From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Dec 2018 04:28:24 +0000 (-0800)
Subject: ZHEAP on Pluggable V1.
X-Git-Url: http://git.postgresql.org/gitweb/static/gitweb.js?a=commitdiff_plain;h=4a9c21d57f7177b932bc76490990029ffffe7046;p=users%2Fandresfreund%2Fpostgres.git

ZHEAP on Pluggable V1.

Squashed commit of the following:

Pluggable Storage Integration +

commit 309145fb8ab59c993e26855cf0bf51d37902773f (zheap/master)
Author: mahendra <mahi6run@gmail.com>
Date:   2018-12-10 11:14:44 +0530

    Fix WAL logging for tpd map update during Rollback

    We were not WAL logging tpd map update in few cases, fix that.

    By Mahendra Singh Thalor and Amit Kapila

commit deda887629ca8bba859c0fa2a2e459e2c922c2bd
Author: mahendra <mahi6run@gmail.com>
Date:   2018-12-10 10:45:04 +0530

    BugFix : Correcting block number read in tpd_xlog_free_page

    In tpd_xlog_free_page function, we were trying to read from
    block 3 the data written in block 2. Hence correcting the
    block number.

    Patch by me, reviewed by Amit Kapila

commit 79738e7038e267c2c685cc5a5221153269317eeb
Author: Dilip Kumar <dilipkumar@localhost.localdomain>
Date:   2018-12-07 10:54:33 +0530

    Revert "For slots marked as invalid xact, skip fetching undo for some cases"

    This reverts commit 796299af2ca14800f8d69560d33f89ac293c1d15.

commit 796299af2ca14800f8d69560d33f89ac293c1d15
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-12-05 17:27:18 +0530

    For slots marked as invalid xact, skip fetching undo for some cases

    For a slot marked as invalid xact, if the corresponding xid preceds
    oldest xid having undo or the corresponding undo record is already
    discarded, we can consider the slot as frozen.

    Patch by me. Investigated by Amit Kapila, Dilip Kumar and me.

commit e26c804789fdcf24fc588a734b2196a81a51eef4
Author: Dilip Kumar <dilipkumar@localhost.localdomain>
Date:   2018-12-05 13:09:06 +0530

    Merged review comment fixes from pg hackers patch

    Dilip Kumar and Amit Kapila

commit 1fee88b7ec5d9aed7e1c0429292906e30f457f0d
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-12-04 02:11:08 -0800

    Fix CLANG's warnings.

    By Mithun C Y, review by Amit Kaplia.

commit 5b48c0c175dabfaf34099f0cc7feefe0b2367a0c
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-12-04 01:35:40 -0800

    Fix thinkos, caught by CLANG compilers.

    By Mithun C Y and Review by Dilip Kumar.

commit a1e1859d6910f49bb6eec8d04113dbba78deee3a
Author: Dilip Kumar <dilipkumar@localhost.localdomain>
Date:   2018-12-03 14:54:38 +0530

    Improve comments and code for previous commit.

commit 132fefaf953e632e07ddc53f8e0198e655cf0896
Author: Dilip Kumar <dilipkumar@localhost.localdomain>
Date:   2018-12-02 16:35:02 +0530

    In ZGetMultiLockMembers we fetch the undo record without a buffer lock
    so it's possible that a transaction in the slot can rollback and rewind
    the undo record pointer.  To prevent that we acquire the rewind lock
    before rewinding the undo record pointer and the same lock will be
    acquire by ZGetMultiLockMembers lock in shared mode.  Other places
    where we fetch the undo record we don't need this lock as we are doing
    that under the buffer lock.  So remember to acquire the rewind lock in
    shared mode wherever we are fetching the undo record of non commited
    transaction without buffer lock.

    Mahendra Thalor Reviewed and modified by Dilip Kumar

commit c200352aa6917b8c3d4a0e771a1f19d6f4eec6d9
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-30 14:06:07 +0530

    Bugfix in TPD recovery

    Patch by me, reviewed by Dilip Kumar

commit fd9152fa3ecfe1d0d8e6d06571d91f082d0f4ee9
Author: Dilip Kumar <dilipkumar@localhost.localdomain>
Date:   2018-11-29 07:17:27 -0800

    Remove multilocker flags for the cases for unwanted cases

    Currently, we are setting multilocker flag whenever lock is acquired
    on the tuple which has some performance penalty.  As part of this patch
    we have avoided setting multilocker flag for the case when a updater is
    taking a lock through eval plan qual mechanishm.

    Amit Kapila and Dilip Kumar

commit 79882e67fe4572081fc65a0dc28c20a773ef0533
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-29 18:26:12 +0530

    Fix issue in TPD allocate entry

    While allocating an entry for TPD, we traverse all the tuples in the
    page and check for tuples the correspond to the last slot and update the
    corresponding offset map in TPD with the actual transaction slot.

commit 8a6d050f121d2e588defee48d4a3ad9e43ac6136
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-29 16:57:02 +0530

    Bugfix in recovery of TPD free page

    Patch by me, reviewed by Amit Kapila

commit 47b86637a3d1ae71102c1a7a7704aaf7e95feffc
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-28 14:10:10 +0530

    Fix issues in TPD

    1. If previous and next block number is invalid for a TPD page, then
    that is the only TPD page in the same relation. We should handle the
    case correctly in TPDFreePage.
    2. While allocating a new TPD page, we ask a new page from FSM. It's
    possible that FSM returns a zheap page on which the current backend already
    holds a lock in exclusive mode. Hence, try for a conditional lock. If it
    can't get the lock immediately, extend the relation and allocate a new
    TPD block.
    3. In TPDPageLock, if a TPD buffer is already pruned, we don't take a
    lock on the same. Here, we should take the opportunity to clear the TPD
    location from the corresponding zheap buffer.
    4. In lazy vacuum, we don't vacuum a TPD page. Instead, we try to prune
    the page. But, if it's already pruned, we should skip it.
    5. There are several places in the code where we lock a TPD page before
    entering critical section. For non-inplace updates on different buffers,
    it is possible that both old and new zheap buffer corresponding to the
    same TPD buffer. Hence, we should be careful that we don't try to lock
    the TPD page two times. Else, we'll be waiting on self.

    Patch by me. Reviewed by Amit Kapila.

commit 8c31fef78166e7df53470f43c6bd1d53d18341e7
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-28 14:31:09 +0530

    Skip using tuple if corresponding item is deleted

    If item id is marked as deleted, we can't check the tuple. Else, it'll
    lead to segmentation fault.

    Patch by me. Reviewed by Amit Kapila.

commit c66b03f50c074236882fb0370a87dad8df4a597c
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-27 09:10:44 +0530

    Fix windows build.

commit 35e2e1b5907d85ee02d74b728a23ab7865405e64
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-26 14:34:16 +0530

    Forget Local buffers

    For zheap, forget the local buffers whenever used.

    Patch by me, reviewed by Amit Kapila

commit 3a1f846c48cea443478478622c12b6df28fe15a6
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-25 10:12:13 +0530

    Fetch CID from undo only when required.

    If the current command doesn't need to modify any tuple and the snapshot
    used is not of any previous command, then it can see all the modifications
    made by current transactions till now.  So, we don't even attempt to fetch
    CID from undo in such cases.

    Patch by Amit Kapila, reviewed and verified by Dilip Kumar

commit e48be86489daf6bc78071d7fe7fe5e1eb5b2a705
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-11-23 07:58:01 -0800

    Bug Fix, correct an invalid Assert.

    slot_xid is uninitialized and Assert condition is invalid.

    Patch By Mahendra Thalor Review by Amit Kapila.

commit 12bf9650d247c64d96868ada87fcda07a2ff46b5
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-23 03:31:06 -0800

    Merge review comments fixed on community patch

    Dilip Kumar

commit d06ab949c8774f6989ffefc65a100b7208516a06
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-22 15:53:22 +0530

    Fix the modification of TPD map during rollback.

    If the previous transaction slot points to a TPD slot then we need to
    update the slot in the offset map of the TPD entry.  This is the case where
    during DO operation the previous updater belongs to a non-TPD slot whereas
    now the same slot has become a TPD slot.  In such cases, we need to update
    offset-map.

    Patch Mahendra Singh, reviewed by Amit Kapila

commit 9f1fd917e954b6b430b2575e84bd669166ee8cdf
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-11-21 22:09:16 -0800

    Make the expected file to be generated by a source file.

    Patch By Mahendra Thalor Review by Mithun C Y.

commit 3e255b05d87a65183adc8cac204529937c978e13
Author: Beena Emerson <memissemerson@gmail.com>
Date:   2018-11-20 11:47:49 +0530

    Fix crash in pg_stat_* functions

    commit b04aeb0a05 added Assets to ensure that we hold some relevant
    lock during relation_open. This commit corrects the locks used in
    relation_open for the zheap pg_stat functions to avoid Assertion
    failure.

    Patch by me, reviewed by Dilip Kumar

commit 0e2f8eb102744121d7203213c87c124f75578665
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-19 13:07:25 +0530

    Reset undo buffers in case of abort

    In case of errors, when the transaction aborts, the undo buffers and
    the buffer index are now reset.

    Patch by me, reviewed by Dilip Kumar and Amit Kapila

commit 88f9e6e42d0c769158b283b378c4c0d19af91139
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-18 04:43:36 -0800

    Fix trailing space in previous commit

commit b925c7ab4df48f42951b9690216d9c9e78f25ba3
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-18 04:18:04 -0800

    Only try to lock the TPD page if the zheap page has the TPD slots in it.

    Path by Mahendra Thalor reviewed by Dilip Kumar

commit 257e8e7990e9d85a373a41d0dce17f7f56ed5c61
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-15 17:46:14 +0530

    Add expected file for multiple-row-versions isolation test

    In this test, we get a serialization failure due to in-place updates in
    zheap. But, this is an expected behavior.

    Discussions: https://postgr.es/m/CAGz5QCJzreUqJqHeXrbEs6xb0zCNKBHhOj6D9Tjd3btJTzydxg@mail.gmail.com

commit 1b44ba6c5154fa6eed3e56bfb97285a78aa58003
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-02 12:00:41 +0530

    Implement ALTER TABLE SET TABLESPACE for zheap

    To alter the tablespace for a zheap table, we copy the pages one-by-one
    to the new tablespace. Following is the algorithm to perform the same:

    For each zheap page,
        a. If it's a meta page, copy it as it is.
        b. If it's a TPD page, copy it as it is.
        c. If it's a zheap data page, apply pending aborts, copy the page
        and corresponding TPD page if we've rolled back any transaction from
        the TPD.

    Patch by me. Reviewed by Amit Kapila.

commit 2b5f6aa07e8d705f238bf8b1a308e6f761e05a98
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-14 11:35:00 +0530

    Create a wrapper function for fetching transaction slots for a page

    We've created a  wrapper function GetTransactionsSlotsForPage for
    fetching all transaction information for a zheap page and its
    corresponding TPD page.

    Patch by me. Reviewed by Amit Kapila.

commit 0931d56b34aaf5c9c2b665da751aec051aa7d90a
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-15 01:50:11 -0800

    Fix assert

    It must be in critical section only if it's not in recovery

commit 96ede480d229e874e204895856bcded0fb33f68b
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-15 01:41:43 -0800

    Remove relfilenode and tablespace id from the undo record and store reloid

    There was no reason why we stored relfilenode and tablespaceid in undo record,
    instead we could just store reloid.  Earlier, we might have kept it thinking
    that we will perform rollback without database connection but that's not the
    case now.  This will save the space as well as it will be useful when we need
    to transfer the undo records across the relfilenodes e.g ALTER TABLE SET TABLESPACE.

    Dilip Kumar reviewed by Amit Kapila

commit 842e9e36821e2ad1ce2f557631a2af8bb96f2d23
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-15 01:28:07 -0800

    Fix cosmetic review comments in undoinsert

    Dilip Kumar review by Amit Kapila

commit 01f981fdeddad6eff83f1a12c8751a373df56a1c
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-14 20:48:14 +0530

    Discard undo logs in single-user mode

    For zheap, discard the undo logs at commit time when in single user mode.

    Patch by me, reviewed by Dilip Kumar and Amit Kapila

commit 1bcef1bf35738b060e5c412428aac502245abe2b
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-14 17:03:47 +0530

    Bugfix for visibility map buffer

    In case the page is newly extended, the vmbuffer will not be valid.
    Hence avoid checking the status for it in zheap insert and update.

    Reported by Neha, patch by me, reviewed by Amit Kapila

commit 1b1cace0bb7195025ea5bd19848953f10fe5cd30
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-13 05:14:26 -0800

    Removed invalid assert in extend_undo_log

    It is possible that while we try to extend the undo log it's already
    extended by discard (recycle old undo log) so this situation is
    possible.

    Patch by Mahindra reviewed by Dilip Kumar and Amit Kapila

commit aa25f816c637f9123f67a4c7e1fa14f64c15d72c
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-11 19:35:32 -0800

    Fix compiler error

commit 48d8236385a8894f17e642bb182e6be01b5fcfb7
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-12 08:33:05 +0530

    Fix warning.

commit 0495698cb6e1e6d29ddb146332ddc61299c99aeb
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-10 16:15:45 +0530

    Prune entire TPD page at one-shot if possible.

    We have used tpd_latest_xid_epoch stored in page to prune the entire TPD page.
    Basically, if tpd_latest_xid_epoch precedes oldestXidhaving undo, then we
    can assume all the entries in the page can be pruned.

    Apart from that, I have changed the logic so that page is freed during
    pruning if all the entries are removed from the page. this will ensure that
    pages will be reclaimed whenever they are empty, not only during vacuum.

    In the passing, I have fixed another related issue which is after we get
    the new page from fsm, we need to ensure that it is a tpd page before
    using it.

    Patch by Amit Kapila, reviewed and verified by Dilip Kumar

commit 2c0af615df41867e26fd80900f0e38f032750752
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-09 15:22:26 +0530

    While computing infomask for updating a tuple, don't copy update flags

    When we compute infomask during updating a tuple, we don't have to copy
    update flags. We compute it later in zheap_update.

commit 48f1a2d2973ae25b33adfeac4129769fadfea134
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-09 13:27:35 +0530

    Small bugfix in UndoDiscardOneLog

    We can't compare log->meta.insert directly with the undo_ptr. We need to
    create a undo pointer first.

commit 0b374e91021b49795d2ec05243e3c9c24363d7d5
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-11-08 22:09:05 -0800

    Bugfix in TPDPageGetTransactionSlots to handle truncated pages.

    Block number starts from zero, when compared with the total number of
    blocks in relation, we need to adjust the block number accordingly.

    Fix by Mithun C Y, review by Amit Kapila.

commit 38e5a0cee902dae0c18961d91699ba254ea0fcb3
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-08 15:16:02 +0530

    In zheap_update, clear the in-place update flag for non-inplace updates

    Reported by Gunjan Kumar.

commit 05c36ee9f886c9feb274360d9b523ea0d9e4dac7
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-08 11:51:20 +0530

    Fix a testcase in expected file for trigger.sql regression test

commit f52a281728a425df027d4a201c09730c5a8f0c42
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-08 11:14:55 +0530

    While locking the tuple, set cid as FirstCommandId

     While locking the tuple, we set the command id as FirstCommandId since
     it doesn't modify the tuple, just updates the infomask.

commit 2595ad4d099abf21d59017f8cebfa416fe454d2d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-07 15:07:49 +0530

    Fix a bug for projecting table oid

    There are certain places in the code where we form a heap tuple using
    ExecCopyTuple and store it in slot along with the zheap tuple. In that
    case, we don't copy the tableOid to heap tuple. Hence, for fetching
    tableOid, we should prefer a zheap tuple instead of a heap tuple.

commit a3d175ba1b4a089b74a9f68be4b035695d57d698
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-11-07 21:11:32 -0800

    Post push compilation error fix for 903ce21849

    By Mithun C Y

commit fdd2e03c972269b4a0fe95b13433ce55ac1f8bd7
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-11-06 02:11:35 -0800

    Bugfix in recovery of invalid_xact and freeze

    TPD page slot information was not set if BLK_NEED_REDO is false for
    the heap page.  Also, during recovery relation descriptor is used
    to take the pruning decision which is wrong because during recovery
    it should be done automatically by a seperate WAL.

    Dilip Kumar and Amit Kapila

commit 0fc2cb13d810d7979b51c2b6ea44fa4f9119f7a4
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-05 16:31:38 +0530

    Fix shared memory size for rollback hash table

    While creating shared memory segment for rollback hash table, we should
    set the size of the segment correctly.

    Patch by me. Investigated by Dilip Kumar and me.

commit 62c4099af465e7f739aa0f391724eb4efb02ba08
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-05 13:52:25 +0530

    Fix warnings in tpd.c

commit 9e203e16ffd03867bda789c33f0bdeece04df7a0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-04 21:57:23 +0530

    Assign OID correctly while converting tuple

commit b8bcb9891a1fc24e1c02933ee84eba1a017af692
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-05 13:46:47 +0530

    Handling rollbacks in single user mode for zheap

    Never push rollback requests to worker when in single user mode.

    Patch by me, reviewed by Amit Kapila

commit a95fc28e9073d6e7815f3772e1255efe2967c847
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-05 11:03:54 +0530

    Add empty TPD pages to FSM.

    The empty TPD pages are added to FSM by vacuum which prune such pages as
    well. The empty pages from FSM can be used either by zheap or TPD when
    required. We need to ensure that when we access TPD pages, such a page can
    be pruned or truncated away by vacuum. After pruning the TPD page, it can
    be freed by removing it from the chain of TPD pages.

    As of now, only vacumm can add TPD pages into FSM, but we might want to
    improve it someday that backends can also do the same.

    Patch by me with help from Dilip Kumar who also tested and verified the
    patch.

commit 49a60ee0c3bbb844621bc8167f8e04ca0c032509
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-05 10:14:13 +0530

    Adding new output files for regress-suite

    Some system attributes are not supported for zheap tuples, hence,
    for combocid and transactions new .out files are added for zheap.

    Patch by me, reviewed by Kuntal Ghosh

commit 3d8d5f45b4b928fbcc9d1ff29c3f8fc822009a4f
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-05 08:09:29 +0530

    Fix valgrind error.

    Reported by Tomas Vondra, patch by me, verified by Mithun

commit 8e3a8679dfb0248699155eec8f07098df7d7f3ad
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-02 16:44:26 +0530

    Change case in one error message

    Pointed out by Kuntal Ghosh, patch by me.

commit 1bcb9489f3077ef03616ff19ff46b0e3a2980aa6
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-02 16:36:32 +0530

    Fix the errors reported by valgrind.

    In the passing, I have noticed the xidepoch was not assigned properly,
    so fixed that as well.

    Reported by Tomas Vondra, Patch by me, verified by Mithun C Y.

commit f69fa9edc361c374cb7ce8780b5e63b6412c9d08
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-02 16:26:04 +0530

    Bugfix in heap_truncate for zheap relations

    Initialize meta page for zheap relations only when the complete
    relation is truncated.

    Patch by Amit Kapila, tested by me

commit 704e3b777c385e319ded0b0fa317409fc41dc7a1
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-02 15:40:19 +0530

    Add expected file for rowsecurity regression test

    In rowsecurity test, there is a test case that uses table sample scan
    with bernoulli distribution. The output of tablesample scan partially
    depends on the block number of the relpage from which the tuples are
    fetched. For zheap, blocknumber 0 is meta page. Hence, tuples are stored
    from block 1. Hence, the output of table sample scan can be different.

commit 153ea5c17e65144382c8c68a22d46527f0f7de2d
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-11-01 08:09:35 -0700

    Release Undo buffer locks after wal replay of XLOG_UNDO_APPLY_PROGRESS

    By Mithun C Y And Mahendra Thalor.

commit 5fed5fe9eaa00c1196dc47d18b91c9e2bb336aab
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-01 15:33:47 +0530

    Add expected file for stats.sql regression test

    In zheap, we increase pgstat_info->trans->tuples_updated only for
    non-inplace updates, otherwise vacuum will be triggered for in-place
    updates as well. But for heap, we always increase tuples_updated since
    heap always creates a new version of the tuple during updates. Hence, we
    need to fix the output in expected file for zheap.

commit 987049f5b11351ee6b985ddc76e978e2d77fa403
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-01 14:52:39 +0530

    Add expected file for strings.sql regression test

    There is a test case in strings.sql that counts the number of pages for
    toast table. In zheap, toast tables are also created in zheap format
    which always includes a zheap meta page. We should count that in the
    expected file.

commit 3ef3b9b67ae27d9568ce9380c9f646bd894b8bd0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-11-01 12:00:50 +0530

    Add storage_engine in reloptions regression test

commit 94777e98221cf49e42f9a7bddf5947072424e368
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-10-31 22:23:28 -0700

    Add alternative expected files for zheap regression.

    Order of zheap tuples in pages will be diffrent form heap. Hence
    zheap specific expected files are needed.

    By Mithun C Y

commit 7570b3bed887531f31dcc25ea602233c8d56ee6d
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-31 22:06:56 -0700

    Fix warning of hash seq scan leak and also removed unwanted assert
    from the zheap mask.

    Dilip Kumar Reviewed by Amit Kapila and Kuntal Ghosh

commit 14fdcc4b9eda81577d81ba3c7b040f86850e78eb
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-11-01 09:54:26 +0530

    System attributes for Zheap

    For zheap tuples, Xmin is given as the xid which last modified the
    tuple. The other system attributes Xmax, Cmin, and Cmax are not
    supported for zheap, for now.

    Patch by me, reviewed by Amit Kapila

commit 794b8f4df739d3565663f51ca7117dbe855980c2
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-11-01 09:11:31 +0530

    Update README.md.

commit 6ce786252d5ddaac6fdb07506f6cde3d66e093cd
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-31 19:11:40 +0530

    Updated readme to match latest status.

commit 8c780e8474aed4c38ae482d94683cba1be0e9a29
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-31 06:34:45 -0700

    Fix assert in RollbackFromHT

    Ideally number of entries should be <= ROLLBACK_HT_SIZE

commit 3017c86cdaba643d61f32eab351b63acfa381720
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-31 06:18:52 -0700

    Merge additional test in zheap expected file

commit 8bf58ba522c27429d0da0260a35abc5d519feece
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-31 05:58:00 -0700

    Fix defect in undo worker connection

    Take a database object lock before connecting to the database so that
    the database does not get dropped concurrently.

    Dilip Kumar Reviewed by Amit Kapila

commit 007af507d94b992f0a37caa10015d134a0782b07
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-31 14:35:02 +0530

    Skip aborting rewinded undo records

    Before discarding undo records, the undo discard worker checks whether it
    has to issue a rollback request for the corresponding aborted transaction.
    It's possible that the transaction got aborted by some other backend at
    the same time and the undo records got rewinded. Hence, the undo worker
    should recheck to detect whether the undo records got rewinded. In that
    case, there is no need to issue a rollback request.

    Reviewed by Amit Kapila

commit 28bae62a6745f223e99e1af27328ed6218752bb4
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-31 16:01:02 +0530

    Eliminate alignment padding whereever possible.

    We omit all alignment padding for pass-by-value types. Even in the current
    heap, we never point directly to such values, so the alignment padding
    doesn't help much; it lets us fetch the value using a single instruction,
    but that is all.  Pass-by-reference types will work as they do in the heap.
    Many pass-by-reference data types will be varlena data types (typlen = -1)
    with short varlena headers so no alignment padding will be introduced in that
    case anyway, but if we have varlenas with 4-byte headers or if we have
    fixed-length pass-by-reference types (e.g. interval, box) then we'll still end
    up with padding.  We can't directly access unaligned values; instead, we need to
    use memcpy.  We believe that the space savings will more than pay for the
    additional CPU costs.

    We don't need alignment padding between the tuple header and the tuple
    data as we always make a copy of the tuple to support in-place updates.
    Likewise, we ideally don't need any alignment padding between tuples.
    However, there are places in zheap code where we access tuple header
    directly from page (ex. zheap_delete, zheap_update, etc.) for which we
    them to be aligned at two-byte boundary).

    Amit Kapila and Kuntal Ghosh

commit 580dcc4fc69e57dac919d1ac1068f876505f6c17
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-10-31 14:59:31 +0530

    Fix for a compiler warning

    Reported by Neha Sharma

commit 7f2397b0b87a9941a34998ffc63990eab32f09f1
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-10-31 00:43:06 -0700

    Wal log page extension if it was not done previously.

    Patch by Amit Kapila, Review by Mithun C Y.

commit 9ffbe108474a898e76302bb834d040f288b78e67
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-30 13:17:34 +0530

    Fix bug in registering TPD buffers for WAL

    While creating a WAL record, if a concurrent checkpoint occurs, WAL
    insertion may fail. In that case, we need to prepare the WAL record
    again. At the same time, we should clear the registered TPD buffer
    array. Else, it'll not be registered in WAL in next try.

    Patch by me. Investigated by Amit Kapila, Dilip Kumar and me. Reviewed
    by Amit Kapila and Dilip Kumar.

commit b30e67a310db0bd75a44094c8acd9e87c1f5c51e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-31 11:51:01 +0530

    Fix eval-plan-qual isolation test

    After rebasing the zheap branch with latest PG HEAD, there are some
    additional testcases in eval-plan-qual isolation test. Add those changes
    in regression out file targeted for zheap.

commit 799ca622b624f0b13916ea8ce4ea1afad51d5f1c
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-31 11:50:49 +0530

    Return error on trying to update a row moved to another partition.

    The approach used here is to set the ZHEAP_MOVED (ZHEAP_DELETED |
    ZHEAP_UPDATED) flag when the tuple is moved to different partition.
    All the cases that are handled in heap needs to be handled for zheap as
    well.

    Amit Kapila, based on earlier patch by Amit Khandekar which has used a
    different approach.

commit d9918b8ec386619a3cd2ac28d8f2ea8698aeec99
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-31 11:22:29 +0530

    Defect fix post rebase

commit 10492266c5789bbec1fdbfd13220dcacbea633c6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-30 10:16:43 +0530

    TPD fix by Dilip RM43761

commit 7524cd8cf9f7a20bfb44aefe32f0085bec1ce385
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-30 18:36:32 +0530

    During recovery restore dbid from WAL

    Patch by Mahendra Thalor Reviewed by Dilip Kumar and Kuntal Ghosh

commit e73d2bf79db041177d35c9f3333e126318460f8c
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-30 17:57:00 +0530

    Bug fix after rebase

    Dilip Kumar and Kuntal Ghosh

commit 5340d395c5d63e56f0398fee816393a613e385fb
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-10-25 05:32:50 -0700

    Avoid page compaction if the tuple is marked updated, deleted or
    inplace updated by an open transaction.

    By Mithun C Y

commit ed9356efb9f5ad9622beef61ce75cfb9b8c5ea39
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-24 10:35:11 +0530

    Fix an isolation test case for zheap

    In zheap, when a transaction rolls back, it needs to perforfm undo
    actions on the modfied relation. For that, it needs a lock on that
    relation. If any concurrent transaction holds an exclusive lock on the
    relation, the rollback operation will wait untill the lock is available.

commit 5711aac11e6443d16e6b0aeb2219c9ce0ac6f3b7
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-23 21:24:07 +0530

    Fix compiler warning.

    Reported by Thomas Munro.

commit 407f6350335597bb2029109df3c0a4d090410a4f
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-23 14:12:58 +0530

    Move functions that can allocate memory or allow locking outside critical
    section.

    At few places in the code, TPDPageGetTransactionSlotInfo was being called
    from critical section which leads to assertion failure.  Lock the TPD page
    whereever required before entring critical section.

commit 390416aa17bc215e541c3cf83acd5e336bb6e350
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-23 12:13:23 +0530

    Bugfix in tpdxlog.c

commit a68d7ef43e07849bd75bd5e39634e3ab29efbe0b
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-23 12:11:07 +0530

    Fix a variable initialization

commit 09a0518389ccce88b381b3f44b4d88769feb7634
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-10-22 18:10:00 -0700

    Fix orderering issues in regression tests.

    With zheap we have inplace updates so order of tuples in zheap page
    will be different from heap.  Adding order by clause to test to get
    consistent results for both heap and zheap.

    By Mithun C Y

commit b5f1649f55aa366f239fe37dc2e06c9b2cecdf17
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-10-18 16:37:00 +0530

    Bug fix in zheap_lock_tuple

    Added a code-path in zheap_lock_tuple to check for the latest copy
    of the tuple in case it is modified by some aborted transaction.

    Issue reported by Neha Sharma, patch by me, and reviewed by Amit Kapila

commit f9a00ee12997bc0ac823ac4419b8aa3f7024aee6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-18 11:06:04 +0530

    Add expected file for eval-plan-qual isolation test

    Include metapages in ctid for zheap tuples. Also, Updated a test case
    related to self join. Basically, when performing a self join if it needs
    to pass through EvalPlanQualFetch path, it's possible that both sides of
    the join see the same value due to in-place update. This behaviour is
    different from heap, but similar to other undo-based storage.

commit 804d6edc03cc8aa9b5bb0428bccda27a48c1c8fb
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-18 10:09:54 +0530

    Include meta page in isolation test results

    In vacuum-reltuples isolation test, include zheap metapage in relpages.

commit c0f90396dd27d28c27d0b59e682680289fcba669
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-17 18:28:05 +0530

    Bump the number of initial TPD entries.

    By mistake, commit 1400d8f84b has changed the number, it was added just for
    testing purpose, but it should have been removed before commit.

commit 14d67f52665110d64666e1addc255df15015dc6b
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-17 16:49:10 +0530

    Support inplace update of TPD entries.

    Whenever TPD entry needs to be extended, we call Call TPDPagePrune to
    ensure that it will create a space adjacent to current offset for the new
    (bigger) TPD entry, if possible. We use compactify_ztuples to ensure that
    space adjacent to existing entry can be created. For inplace update of TPD
    entry, we just replace the old entry with new entry at the same location
    similar to inplace updates of tuples. We also write WAL record this
    operation.

    Amit Kapila and Kuntal Ghosh, reviewed and verified by Dilip Kumar

commit 83c27f9fae800fb73eff68f1987136adaf7965ab
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-10-17 06:26:24 +0530

    Bugfixes in zheap update, delete, and lock

    - Modify the infomask only after writing the corresponding undo.
    - Always be in inside critical section when modifying tuple.
    - Check if the current member is the only locker with multilockers,
      then no need to lock tuple again.

    Issue reported by Neha Sharma. Patch by me, reviewed by Amit Kapila
    and Kuntal Ghosh

commit 69b6b791ffaa0d04cc10bc0603c8bbc39f58dfa8
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-10-15 14:31:43 +0530

    Improve test coverage of update.sql.

    Commit 521461cfd91b9b44f0ecc392b3470922e78246b0 had removed
    some redundant RETURNING clauses in some statements, in order to
    make the output consistent. This commit reverts back those changes
    and instead uses WITH clause over the update statements, so that
    we could use ORDER BY for consistent ordering . This helps retain
    the RETURNING clauses, which also makes sure we don't reduce the test
    coverage that we may have got for the update-partition-key feature.

    Patch by me; idea suggested and changes reviewed by Mithun Cy.

commit 0a261c3bae670e9b3748a9eeb82706a96a2b4f32
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-15 11:27:51 +0530

    Fix payload length in standby

commit dac8323bfc43f7c0dec9c02dd53ca761b3c098b8
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-15 10:15:32 +0530

    After fetching xid recheck for frozen xid

commit abdd14af03e32e9abfe7b7ed0c6bb88a0a6184af
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-15 09:53:14 +0530

    Clear TPD entry only in exclusive lock

    When we clear TPD entry from a TPD page, we should have a exclusive lock
    on the TPD buffer. Hence, perform a check for the same before clearing
    the entry.

commit 52b84f9fb21468080c8ea252336a1994f915ca8b
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-15 10:55:07 +0530

    Bug fix in recovery

    Multiple bug fix in recovery of the TPD and undo action WAL

    Dilip Kumar reviewed by Amit Kapila

commit 6ebf7d2e1f12fcd92bb5c7ebe11e83eda3824c32
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-15 10:51:17 +0530

    Bug fix in tpd

    If the tpd entry is pruned then we can directly use the last slot
    for the transaction and no need to try for extending the tpd entry

    Amit Kapila reviewed by Dilip Kumar

commit aa09e789b0b40f9593f97135c6c9af4c74464023
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-10 13:47:25 +0530

    Skip calling zheap_fetchinsertxid for non-serializable xacts

    For non-serializable xacts, we can skip calling zheap_fetchinsertxid to
    fetch the targetxmin that is passed to PredicateLockTid. A call to
    zheap_fetchinsertxid is costly since it has to fetch undorecords.

commit 2b2963b37cb1316de21ed65e8c859b76467616ca
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-10 11:07:35 +0530

    Fix compiler warnings

commit 0a82317f0378b35122df6bef1d473826f149c121
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-10-08 11:58:38 +0530

    Bugfix in undoworker commit.

    Shared memory size was not calculated for undo launcher
    and removing one unwanted hunk.

commit c77321a0364149c3831844f02082579a070d47f0
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-10-07 14:47:06 -0700

    Fix regression tests and output for row order changes.

    With the introduction of inplace updates in zheap, order of rows
    within a page has changed. This fix changes the queries whose output
    depend on stored row order. Those queries are rewritten to have an
    order by clause to get consistent result for both heap and zheap.

    By Mithun C Y

commit e5a37a151fc52b0bc10b2dd6af13af3b71136360
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-10-05 22:20:44 +0530

    Fix a concurrency issue with UPDATEs and triggers.

    In GetZTupleForTrigger(), after it calls zheap_lock_tuple(), a check
    for hufd.in_place_updated_or_locked was missing. So if there was a
    concurrent in-place update, it used to conclude that the tuple was
    deleted. Add the missing check.

    Patch by me, reviewed by Dilip Kumar.

commit f5c0d3dd3cbab42628d311fbb50614f1bce93ec1
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-10-05 22:01:36 +0530

    Regression test changes for update.sql

    update.sql had to undergo some modifications because the RETURNING
    clause was returning in different order. Also some UPDATE statements
    that updated multiple rows had to be modified such that they update
    only single row, because the erroring row was different in case of
    zheap. There were two erroring rows, and out of them whichever is the
    first in the update scan result got displayed in the error message.
    This test also has some more ORDER BY clauses.

commit bbd7a64c3495b77bf77406511afa4f1183a7fcc4
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-10-05 21:37:18 +0530

    Fix tuple handling with trigger functions and transition capture.

    In ExecUpdate(), for zheap tables, ExecARUpdateTriggers() was called
    only if resultRelInfo->ri_TrigDesc is true, which is incorrect. This
    function also needs to be called for transition capture even when
    trigdesc is false. Due to this, transition table rows were not getting
    generated.

    Furthermore, in ExecInsert(), for zheap tables, ExecARUpdateTriggers()
    was getting called using tuple when it should be called using ztuple.

    Fixed both these by making the newtuple parameter of
    ExecARUpdateTriggers() a void * type and renaming it to
    newtuple_abstract, so that we can pass tuple or ztuple according to
    the table storage. And then in ExecARUpdateTriggers(), do the
    conversion from zheap to heap if it's a zheap tuple.

    On similar lines, do changes for ExecARInsertTriggers(), where
    trigtuple is made trigtuple_abstract. Now that this function accepts
    an abstract tuple, fix another pending issue in copy.c : in CopyFrom()
    and CopyFromInsertBatch(), pass either zheap or heap tuples to
    ExecARInsertTriggers().

    Reviewed by Dilip Kumar and Amit Kapila.

commit 51da00138eeceda6ea57e59760e49b9d60518274
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-10-05 17:04:22 +0530

    This commit completes the work of adding the new lock type for
    sub-transactions in zheap that was started in commit
    13b72c94cc77f6471e46f6e9da29423994753240. This commit handles the cases
    of waiting on sub-transactions when using dirty snapshot, inserting
    index tuples in btree, and while checking for constraints violation.

    Patch by me, reviewed by Amit Kapila

commit 0cb8b6c800d4437c0e413c8bd2473181d843f1e3
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-10-05 02:22:45 -0700

    Fix condition for page pruning.

    In zheap_page_prune_guts condition for pruning was wrongly set. This patch
    fixes same. Along with it patch also fixes the code which made NULL pointer
    dereference.

    By Amit Kapila and Mithun C Y

commit 1662919e20b45e515849304f68e310b28939aae6
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-10-04 19:36:10 +0530

    Support Cluster/Vacuum Full in zheap.

    As of now we only rewrite LIVE tuples and we freeze them before storing in new
    heap.  This is not a good idea as we lose all the visibility information of
    tuples, but OTOH, the same can't be copied from the original tuple as that is
    maintained in undo and we don't have facility to modify undorecords.  We
    have some ideas how to do that and those are documented in rewritezheap.c.

    Patch by Amit Kapila with help from Mithun C Y who also reviewed the patch.

commit 62606342d267ad671d7084798ac9febd1230919c
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-04 11:22:11 +0530

    Fix compiler warnings

commit bfe41c9db8b397d9c230b36f5c7f6f7826b7fa64
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-04 10:49:57 +0530

    Fix assert for insertion of frozen tuple

    If we're freezing a tuple during insertion, we can use the HEAP_INSERT_SKIP_WAL
    optimization since we don't write undo for the same. Hence, adjust the
    assert in zheap_prepare_insert accordingly.

commit 8defd8aa1852f24ca81552efbf158d3abeda7662
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-03 19:07:57 +0530

    In zheap parallel scan, use SnapshotAny if required

commit 268ab31eebddb41e4f5c9bfb026b9e88f24d4416
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-03 19:06:48 +0530

    During in-place updates, update t_hoff correctly

commit 14dd056736a4cdf4949f798bf1386b80671ada85
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-03 19:01:57 +0530

    Initialize startblock for zheap scan

    Since zheap scan unconditionally uses scan->rs_startblock in zheap_getnext,
    we should initialize the same by default. Otherwise, valgrind barks
    loudly. This may also result in undefined behaviour in release mode.

commit bf1b20a81832f1c007690cda95088aef857bd34d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-10-03 19:00:25 +0530

    Fix scan initialization for zheap tables

    Use zheap_beginscan_parallel in _bt_parallel_scan_and_sort for scanning
    zheap tables.

commit e090a7c5b3f82578d38860cf0e0244aee5e585e7
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-09-28 15:03:14 +0530

    undoworker for hadling the rollback

    This patch introduces two type of workers the discard-worker and
    the undo-launcher. The discard worker's main responsibility is
    to discard the older undo and the undo-launcher will process the
    rollback hash table and launch undo-worker one for each dbid.
    The undoworker will take the database connection and start
    processing all the request for that db.  Once the undo-action is
    applied than it will mark the transaction header in the undo as
    processed and remove the entry from the rollback hash table.

    Dilip Kumar Reviewed by Amit Kapila

commit 93ed5d44e3f9c62b78fa651d7a9082cfc57791f6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-25 13:28:47 +0530

    Fix some TPD issues

    1. Introduced Asserts for ZHEAP_METAPAGE before calling ZheapInitPage.
    Metapage should not be initialized using this method.
    2. Introduced Asserts to identify zheap metapage after reading the same.
    3. Fixed tpd_desc.
    4. In TPDPageAddEntry, we shouldn't shuffle the item pointers.
    5. In TPDAllocatePageAndAddEntry, when new page is not added, we
    shouldn't over-write previous and next block number in tpd opaque space.
    6. Set LSN in page header for meta page.
    7. For PageSetUNDO while locking a tuple, send set_tpd_map_slot=false.
    8. Properly initilize fist and last tpd page in meta page.
    9. In zheap_xlog_update, send proper undo pointer in TPDSetUNDO.
    10. Fixed the storage of old tpd slot in xlrec.
    11. Fixed compactify_ztuple memory overrun issue.

    Patch by Dilip Kumar and me. Reviewed by Amit Kapila.

commit 8683fff396da98ad00ae3b853714b0788bff99b6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-27 16:54:01 +0530

    Fix compiler warnings in xact.c

commit 3f86bb91ece478ab9ab562cc1972bb85d9f2a965
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-09-26 15:32:12 +0530

    Introducing sub-transactions lock type in zheap

    With this patch, the sub-transactions will have a new type
    of lock for them. Now, instead of waiting for top-level
    transaction lock, the tuples are free once the sub-transaction
    is committed. This way, we neither waste our transaction-slots
    by assigning transaction ids to sub-transaction, nor we suffer
    in performance by waiting on top-level transactions.

    Patch by Amit Kapila and me.

commit f0a2c8247aad9fe1569709599c5b267f473a6024
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-09-25 17:22:41 +0530

    Bugfix in inserting the undo record when previous transaction and
    the current transaction are in different undo logs.

    Earlier it was just comparing the block number to find whether
    we have already read the  buffer or not, but that's not
    enough when we are comparing block no. which are in two different
    undo logs.

    Dilip kumar Reported by Thomas and Reviewed by Amit Kapila

commit 4060bd90a3a47dfc9e5b396b3071c9c4d6bd0dcd
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-09-25 17:17:48 +0530

    Add comment for specific handling of geting CID for inserted
    tuple with lockers.

    Dilip Kumar Reviewed by Amit Kapila

commit e5eee5471d739b1abf084feed6635c5a526e43b4
Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date:   2018-09-25 14:15:21 +0530

    Fix compilation error

commit 0fe57bf56a54fecf31d7b810c47853ba8e00925b
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-25 13:26:05 +0530

    Fix zheap_mask declaration

commit 61226ffd40ad0841d6c1cd4bbf792ec9e85349ba
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-20 11:44:55 +0530

    Dont call GetTransactionSlotInfo for all visible tuple

    For all visible tuple, the corresponding slot can be re-used or pruned.
    Hence, we shouldn't call GetTransactionSlotInfo to retrieve the
    transaction information for the same slot.

commit 20c3244baba9a72530fc4747da962190a15623f2
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-25 12:04:00 +0530

    Check WAL Consistency for TPD and zheap meta pages

commit 5f8604946763e6b537f8f68f382b5da62a7d5a86
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-09-24 06:24:18 -0700

    Test fix, Allow inplace updates if the row can fit on the same page.

    By Mithun C Y Review Amit Kapila.

commit 0b83a48a01eb57ffd8a247c8adbe3c85de1d8d98
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-09-24 18:46:55 +0530

    Allow inplace updates if the row can fit on the same page.

    Currently, we perform inplace updates only when the new row is smaller than
    old row or if the old row is the last row on page and it has space after
    it. However, there are more cases where we can perform inplace updates:
    (a) If there's no free space immediately following the tuple, but there is
    a space in the page to accomodate the entire tuple. (b) If there's no free
    space immediately following the tuple, but there is a space in page to
    accommodate the delta tuple (new_tuple_size - old_tuple_size).

    We allow pruning function to rearrange the page such that it can make space
    adjacent to the tuple being updated. This is only possible if the page has
    at least space to support equal to (newtupsize - oldtupsize). Otherwise,
    also we try to prune the dead/deleted tuples to see if the new tuple can be
    accommodated on same page and that will allow inplace updates.

    To perform pruning, we make the copy of the page. We don't scribble on
    that copy, rather it is only used during repair fragmentation to copy the
    tuples. So, we need to ensure that after making the copy, we operate on
    tuples, otherwise, the temporary copy will become useless. It is okay
    to scribble on itemid's or special space of page.  While rearranging the
    page tuples will be placed in itemid order. It will help in the speedup of
    future sequential scans. Note that we use the temporary copy of the page to
    copy the tuples as writing in itemid order will overwrite some tuples. We have
    also changed the patch such that REDO will perform repairfragmentation only if
    we it has been done during DO operation.

    Amit Kapila, Mithun C Y.

commit 85262536130fbf40afb96b4d7a64692d31171cc5
Author: edb <edb@Laptop220.pn.in>
Date:   2018-09-19 09:57:29 +0530

    Fix compiler warning in tpd.c

    Reported by Mithun

commit de73a4d3c89af83d08f3e2822f0ad07090d89f77
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-09-18 19:46:36 -0700

    Bugfix in vacuuming zheap table.

    The function count_nondeletable_pages is using vac_strategy variable
    defined for heap even when it is called for a zheap relation. This
    patch fixes it to use zheap's vac_strategy variable.

    Reported by Neha Sharma, fix by Mithun C Y, review by Amit Kapila.

commit 92fdaf6736bd975fcf6fba3486e2578781c16404
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-09-18 19:36:25 -0700

    Remove no longer needed fixme comments on visibility map.

    Reported by Amit Kapila.

commit 2509bd2e93e2cc16f4555cd2112895badc304471
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-09-18 19:35:15 -0700

    Bug fix in execute_undo_actions_page.

    ove ZheapInitPage after XLogInsert, as TPD related information in
    page are needed by some of the functions called before.

    By Mithun C Y and Amit Kapila.

commit c5817d6ade8c1ba423bb0d04f983a65e718138d0
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-09-18 08:10:04 +0530

    Support variable sized TPD entries.

    We extend the TPD entries when (a) while reserving a slot, we found that
    there aren't enough slots in TPD entry or offset-map doesn't have much
    space, (b) while getting the existing TPD entry we found that offset-map
    doesn't have enough space. If we find the space in the same TPD page, then
    we perform inplace update of the TPD entry, otherwise, a non-inplace-update
    is performed. In non-inplace-update, we mark the old entry as deleted and
    later during pruning, if we encounter any deleted entry, we directly prune
    it.

    Currently, the implementation of in-place updates is not complete, so we
    always perform non-in-place update.

    Patch by Amit Kapila, Dilip Kumar and Kuntal Ghosh.

commit f17c96014308f8f01385a82e41be8601073d147e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-14 16:43:14 +0530

    In replay of freeze_xact, read TPD buffer before using it

commit 40ce6e610d138de1ebb2944ce619db9f53df61f1
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-14 12:39:18 +0530

    Fix a flag in xl_zheap_lock

    In xl_zheap_lock flags, we've skipped the second bit for no good reason.

    Reported by Dilip Kumar.

commit 58000e7691eda261ddc99566f24b03692bf2b19a
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-13 17:51:11 +0530

    Implement serializable isolation APIs for zheap

    For PredicateLockTid, ConflictIn and ConflictOut, we pass the tid to
    identify the tuple.

    For conflict out, we've introduced a function ZHeapTupleHasSerializableConflictOut
    that performs all zheap related work to figure out whether the reader
    conflicts out with any other writers. In ZHeapTupleHasSerializableConflictOut,
    we refetch the tuple and check the recent status of the tuple. Using
    that, we decide whether we conflicts out.

    We've a special handling for the tuple which is in-place updated or the
    latest transaction that modified that tuple got aborted. In that case,
    we check whether the latest committed transaction that modified that
    tuple is a concurrent transaction. Based on that, we take a decision
    whether we have any serialization conflict.

    Patch by me with help from Amit Kapila. Reviewed by Amit Kapila and
    Thomas Munro.

commit b401a9c4da9e9038a0a72e4c31f31ba87bd43dbb
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-07 19:49:32 +0530

    Make serializable code independent of storage

    This code aims to make PredicateLockTuple, CheckForSerializableConflictIn
    and CheckForSerializableConflictOut independent of storage tuple.
    PredicateLockTuple and CheckForSerializableConflictIn method can
    work with tid only. However, CheckForSerializableConflictOut requires
    the storage tuple to check latest visibility status of the tuple. Hence,
    I've separated the *SatisfiesVacuum and its usage towards conflict
    resolution in a separate storage specific function.
    I've also renamed PredicateLockTuple to PredicateLockTid.

    Patch by me. Reviewed by Amit Kapila and Thomas Munro.

commit 44d776147cdafc55ea3fe95569bb039352e580ac
Author: edb <kuntal.ghosh@enterprisedb.com>
Date:   2018-09-10 21:27:38 +0530

    Change behaviour of zheap_fetchinsertxid

    Earlier this function was returning xid which has inserted the tuple.
    But, the tuple can be inserted by multi_insert and inplace-update
    operations as well. Hence, we should handle that.

commit e5b821e0cfc905ea32e11a9619cf797de2342104
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-09-06 18:28:49 +0530

    Modifying the size of rs_visztuples in HeapScanDescData

    The size of rs_visztuples was kept to a fixed value which was causing
    the failure in regression suite when running with increased blocksize.
    Now, it is modified to the value of MaxZHeapTuplesPerPageAlign0.

    Patch by me, reviewed by Amit Kapila.

commit 0136ec40cb54df56f52af626df680ae5981e8fd3
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-09-06 17:07:53 +0530

    Bugfix in toast table updation

    Patch by me and reviewed by Amit Kapila

commit e12bb90c6751d60564535a7effc8e08994d61f3c
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-09-03 11:32:30 +0530

    Fix compilation warning in test_undorecord.c

    Reported By Amit Kapila.

commit c11609baa42bd0332bc90a97a2d27814a66346b9
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-09-03 15:43:14 +1200

    Reorder handling of ONCOMMIT_TEMP_DISCARD.

    The previous coding accidentally caused ONCOMMIT_NOOP to enter the
    new ONCOMMIT_TEMP_DISCARD case.

    Patch from Rafia.

commit 16a7bf70c67dae185d74e5cd988af8bb6d2bb4fd
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-09-02 05:02:57 +1200

    Fix test_undo undo_append_file() procedure.

commit 6f94ac34370ffd305bf702fa514136eab98e43f7
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 22:46:34 +1200

    Delete undo log files in dropped tablespaces in recovery.

    When a tablespace is dropped, we clear out any remaining undo log files.

commit c7e00a820e5daaa3c7a268cdae195a859c5476af
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 18:43:50 +1200

    Handle missing undo log segment files during WAL replay.

    In recovery, segment files may not be present because they will be deleted
    by later WAL records.  Following the example of regular relation files, we'll
    supply empty files.

    Previously, I created unexpectedly missing files during startup, which
    didn't work correctly if we crashed after dropping a tablespace (I couldn't
    create the files in the tablespace directory if it has already been
    deleted).  This new approach does what other PostgreSQL code does, and
    creates a new tablespace directory as required (expecting it to be deleted
    by a later WAL record).

    Thomas Munro

commit 3bbf152f4d330e582e950a45790d0e375d93bc5c
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-26 14:11:59 +0530

    Fix thinko in assert.

commit 1e9d17cc2406683b014f872ce873c845adafe3af
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-26 14:10:21 +0530

    Fix discrepancy between ZHeapTupleSatisfiesUpdate and

    HeapTupleSatisfiesUpdate also for the non-inplace update, it was trying to
    fetch the cid from the prev_undoptr but in case of non-inplace update it will
    not find the previous version of the undo so get the cid while copying the
    tuple.

    Dilip Kumar, Amit Kapila  Review and tested by Ashutosh Sharm

commit 4cc08c3ee0c31346af8c8a02a45a7bb759c3ed68
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-10 22:24:22 +0530

    Refetch tuple after reserving slots in the page

    When we reserve a slot in the page, we sometimes freeze some tuple in
    the same page. Hence, we should re-fetch the tuple to update the slot
    information.

    Patch by me. Reviewed by Amit Kapila.

commit 056a37b464058d6c386b6337ad65589f15c1a9b3
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-10 16:49:11 +0530

    An optimization for temp tables

    For temp relations, we don't have to check all the slots since
    no other backend can access the same relation. If a slot is available,
    we return it from here. Else, we freeze the slot in PageFreezeTransSlots.

    Note that for temp tables, oldestXidWithEpochHavingUndo is not relevant as
    the undo for them can be discarded on commit.  Hence, comparing xid
    with oldestXidWithEpochHavingUndo during visibility checks can lead to
    incorrect behavior.  To avoid that, we mark the tuple as frozen for any
    previous transaction id.  In that way, we don't have to compare the previous
    xid of tuple with oldestXidWithEpochHavingUndo.

    Patch by me. Reviewed by Amit Kapila.

commit 10c225fbd9a3aa8239efaf8cf6d00e8b7e9d589d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-10 13:50:15 +0530

    GUC to enable/disable undo launcher

    We've introduced a new GUC variable called disable_undo_launcher
    to enable/disable the undo launcher process. If true, the postmaster
    won't register the undo launcher. By default, it's set to false.
    This is a postmaster option. Changing the value requires a restart.

    Patch by me. Reviewed by Amit Kapila.

commit e613fa25d714ddba99f2b3675894f3d0d141df4f
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-23 16:40:35 +0530

    Undorecord cleanup.

    Created two files undorecord.c and undoinsert.c where undorecord.c
    mainly deals with how to insert and read the undo record and the
    undoinsert.c provides external interfaces to prepare,insert and
    fetch undo and deals with buffers management required for undo record.

    Dilip Kumar Review by Amit Kapila

commit dccab1c216d8677bb6ad716e13adc08da44f79a9
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-23 10:39:53 +0530

    bug fix in pg_control_checkpoint

    Reported by Andres

commit 6d3f5716cb4257e8226212fbe002f05cc0d60145
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-17 18:54:13 +0530

    Fix issue in vmbuffer wal replay

    While logging visibility map changes for zheap, we don't register
    the zheap buffer. That's intentional since we don't set any hint bits
    in zheap buffer for tracking visibility. But, we've to store the block
    number in WAL so that we can track the block number for setting the
    corresponding visibility map bit while replaying the WAL for the same.

commit 9fc32f608a0900f8ed051cc010e72a40abe318ac
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-20 14:39:05 +0530

    Fix prevxid for insert/multiinsert WAL replay

    We should set the prev xid as frozen during WAL replay of insert and
    multiinsert.

commit 9fcccee5d272c437a898171a4a97317ba1430ee0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-20 16:25:02 +0530

    Fix slot index in undo_xlog_reset_xid

commit 0047fe43fe0fa5a10833d7f864d81b6a2e4ac4ed
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-17 17:59:59 +0530

    In zheap_update, fix WAL record for lock tuple

    This is leftover of commit 9b1f493a6335d07024. In zheap_update, we also
    lock the tuple. Hence, we need to fix the transaction slot related issue
    that was fixed as part of the above-mentioned commit.

commit e5f5c83c3120ac90c0a8501a584935a0ceb782f4
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-16 16:13:10 +0530

    Lock undo buffer while preparing an undorecord

    Currently, we're locking the undo buffers in InsertPreparedUndo. This
    is called under critical section. If we encounter any error while
    taking the lock under critical section, it'll lead to server crash. We
    can easily avoid this situation by taking the lock in PrepareUndoInsert.

commit 25593d3c8351df93c9d5d57ec5b88ba94f10df7a
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-21 11:37:09 +0530

    Bugfix in temp table rollbacks

    If rollback_oversize is set to zero then rollbacks of temp tables
    were pushed to RollbackHT. This is corrected to perform the
    rollbacks by the backend itself.

    Reported by Neha Sharma, patch reviewed by Amit Kapila

commit 420810a9cc500cdc539d34e8e4e49296337453e8
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-20 12:00:03 +0530

    Bugfixes in UserAbortTransactionBlock

    - If rollback request could not be pushed, then backend executes
    the undo actions.
    - Corrected arguments for PushRollbackRequest.

    Reported by Dilip Kumar.

commit 771f43b086060f2b1bba25a857693886aa0ef4e9
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-08-19 19:21:41 -0700

    Bug fix, In lazy_cleanup_index and lazy_vacuum_index we shall use
    vac_strategy passed to it by either zheap or heap relation.

    By Mithun C Y, reviewed by Amit Kapila.

commit bd0b43b93ed204749347d3c885ee258942cc8221
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-17 15:55:22 +0530

    Bugfix for speculative insert in toast table

    Reported by Kuntal Ghosh, reviewed by Amit Kapila

commit cb36f7e7e848cfb11a5062684451788a41f048f9
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-17 15:53:50 +0530

    Bugfix in ZMultilockers

    Found as part of a bug reported by Neha, reviewed by Amit Kapila

commit 22e1c1c33e1e5ed129de8fbfa539d943ae42c844
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-16 13:46:54 +0530

    For insert/multi-insert, set previous xid as frozen

    When we modify a tuple, we set the previous xid for this tuple as frozen
    if its previous modifier xid is older than the discarded xid. For insert/
    multi-insert, we don't have any previous modifier. Hence, we can set it
    as frozen unconditionally.

commit 9668bfac227acb13c87dcdf8a54e440aab768e07
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-16 09:32:43 +0530

    Indentation fix.

    Reported by Andres.

commit 45e405946ff54b5461fdeb7f2a4826c7a87a890a
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-14 17:14:23 +0530

    Use ZHeapTupleHasInvalidXact wrapper consistently

commit 043603d58f4dfc4bd8b1f7ddaa4f2c73e5713f24
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-14 17:12:46 +0530

    Update regression tests file in pageinspect

commit 4d0a2ebf9e8bc40f44430174d9990b26b473405c
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-14 16:19:35 +0530

    Bugfix in zheap_update for non-inplace updates

    For non-inplace, we should always mark the tuple as locker-only
    if we're propagating the key-share lock to the new tuple.

    Reported by Rafia Sabih. Reviewed by Amit Kapila. Patch by me.

commit 1b8c7eba990341809096a245937c48e2fdb875d2
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-14 14:49:36 +0530

    Fix bug in regression tests for pageinspect

    The previous commit of pageinspect forgot to attach the zheap.sql
    and zheap.out file required for regression tests.

commit 02d983e5a5b4a6e2b76b59023668a98457465060
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-08-14 13:43:31 +0530

    Propagate lockers information.

    We were not propagating the lockers information when the tuple has
    multi-lockers bit set.  Fix it.

    Reported by Ashutosh Sharma, Patch by Amit Kapila.

commit b16f0fec8868482773d5a7bc37ec9369dcfaa476
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-14 10:52:08 +0530

    In bitmap scan, don't keep the pin on the buffer

    For bitmap scan, we always use pagemode to scan the tuple. Hence, there
    is no need to keep the lock on the tuple.

commit 191d6f03c143097d63e17e057935a30f3a4db6bc
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-13 15:25:56 +0530

    Bug fixes

    - fixes for incorrect retieval of TPD buffer
    - fixes for incorrect size calculation of undo-header at recovery time
    - fix for accessing an uninitialised variable in TPDPageGetTransactionSlots
    - fixed one space issue in zheapamxlog.c

    Came across these while working on an issue reported by Neha Sharma.
    Reviewed by Dilip Kumar

commit 5fcf8c59430fe4ae105d544d9f406c1e7e3e3959
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-10 17:34:05 +0530

    Bug fix in undo action

    When previous version of the tuple has the TPD slot that time
    we need to pass a flag to the function so that it can set the
    tpd slot in the offset map.  Mistakenly that option was always
    passed as 0.

    Dilip Kumar Reviwed by Kuntal Ghosh

commit ef717922bb7f357a772e2f274766e1d1049beb37
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-10 15:27:29 +0530

    Fix Typo

    Reported by Andres.

commit f68d2c8862ee2f0981ccbc767c1435cca647b804
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-08-10 09:35:12 +0530

    Allow in-place updates for some expression indexes.

    This is a zheap port of commit c203d6cf81b4d :
    "Allow HOT updates for some expression indexes."

    Since we can do HOT udpates for such expressions, allow in-place
    updates for the same expressions.

    Add a new regression test zheap_func_index.sql derived from the
    existing test func_index.sql. The new test uses
    pg_stat_get_xact_tuples_inplace_updated().

    Amit Khandekar, reviewed by Amit Kapila.

commit c5e306aa3afc6b8ac2975be699772c76734d0ff0
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-08-10 09:24:10 +0530

    Add a pg_stat function for inplace updates in a transaction.

    There was already a pg_stat_get_tuples_inplace_updated() function for
    getting in-place updates in a session. But the _xact_ version was
    missing.

    Amit Khandekar, reviewed by Amit Kapila.

commit 24c5086eb332d3e0549238f25641887c9c9d03fb
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-06-25 20:22:29 +0530

    Add pageinspect functions to analyze zheap page and zheap tuples

    It adds two functions zheap_page_items to inspect the tuples in
    the page. It also adds another function zheap_page_slots to inspect
    the transaction slots in the special space. If the page contains
    TPD slot, then zheap_page_slots doesn't show the same since it
    doesn't contain any transaction information.
    TPD slot information can be shown in future once the structure of
    TPD page is stabilized.

    Patch by me, reviewed by Ashutosh Sharma

commit a8442adccf916a5de5c79f4c8a155cc6d4fe9919
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-08 17:28:47 +0530

    Initialize some local tuples in zheap scan APIs

    It also fixes a bug in zheapgetpage. We should not copy the tuple
    for deleted item pointers.

commit abb680be862a99fc23b3abfa28f159b545eba5e6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-09 14:22:37 +0530

    Fix some compiler warnings

commit 11b199c7d779a2beb6ba9217303d4b73359665c2
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-09 11:39:09 +0530

    Allow freezing and reusing of the TPD slots

    Currently, for zheap pages we allow to freeze slots of the
    the all visible xids and also allow to reuse the slot of the
    committed xids.

    This patch is implementing the same for the TPD slots. The mechanism
    of the freezing and reusing the tpd slots are same as for the
    zheap page slots.

    Dilip Kumar reviewed by Amit Kapila

commit 67ca8248109f52c10f06347ac82ab20693547b7f
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-08-09 12:04:50 +0530

    Remove unnecessary changes related to zheap from RelationGetBufferForTuple.

    Now, that we have a corresponding separate function for zheap, we can remove
    the zheap related changes from the function RelationGetBufferForTuple.

commit e18858cffcbf964d7cf3aab7a087673bd492652b
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-09 11:22:40 +0530

    Bug fix in execution undo action

    Set the proper slot on the TPD offset map while replaying undo actions
    if the prior version of the tuple is pointing to the TPD slot.  Prior
    to this we are simply overwriting the older tuple version but the slot
    is not updated in the TPD offset map.

    Dilip Kumar review by Amit Kapila

commit db2d9ae8a2c609e2535932b1bbcc9bb9181ca7b8
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-08-08 22:10:29 -0700

    Post push fix for 2c9d6d9216ad28be2 which by mistake resued
    a flag value.

    By Mithun C Y

commit 2c9d6d9216ad28be25b5e4d60c07f5fb2db6096b
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-08-08 21:49:15 -0700

    Bug fix, move calls of visibilitymap_pin or visibilitymap_status
    outside the critical section at execute_undo_actions_page.

commit 7b799efee32450a93b5200774d3550893d4049e0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-07 10:44:00 +0530

    Handle whether a tuple is self-locked before modifying it

    While deleting/updating a tuple, we should check whether the tuple
    is already locked in desirable lockmode by the current transaction.
    We've missed this check in zheap_delete/zheap_update when the tuple
    is marked with multilocker flag.

    Patch by me and Dilip Kumar. Reported by Thomas Munro.

commit b51357225493b0436bedece52075de58334d3abd
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-07 14:14:24 +0530

    Fix a bug while fetching a pruned tuple from all-visible page

    In zheapgetpage, when a pruned tuple is fetched from an all-visible
    page, we return NULL (and we should return NULL). But, in page-scan mode,
    we increase rs_ntuples by mistake.

commit 2b9bb36f1a7bdcf1e2792e31d943fad3bbc9fa8b
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-07 08:54:58 +0530

    Fix various bugs in TPD with multi-lockers
    - Locker is setting wrong slot in TPD offset map
    - Locker is not calculating proper TPD slot for members.

    Dilip Kumar Reviewed by Amit Kpila

commit 3967b9019e051842c2947aa0333c4938facdb636
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-08-06 16:02:02 +0530

    A minor bug fix in TPD code, missing break statement in switch case

    Reported by Andres.

commit 7949de7ff8bd4cd9d99f5cd81ad1211af5a55c7e
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-08-05 05:58:36 -0700

    Support visibility map for zheap.

    With the support of visibility map for zheap relation, vacuum task
    and Index Only scan can skip looking into all visible pages. Also,
    on page flag PD_ALL_VISIBLE is no more in use for zheap.

    Index Only scan for zheap is enabled with this patch.

    By Amit Kapila, Mithun C Y

    Review by Amit Kapila.

commit 585226697f91c61926361ad565a10f7a78144ad5
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-08-03 12:44:03 +0530

    Pass the right priorXMax to ValidateTuplesXact()

    In zheap_get_latest_tid(), ZHeapTupleGetTransInfo() was getting called
    with 'resulttup'. But because 'resulttup' is returned from
    ZHeapTupleSatisfiesVisibility(), it can be a tuple generated from an
    undo record. And if we use ZHeapTupleGetTransInfo() to get the xid
    that modified resulttup, it returns the xid that created the original
    tuple instead of the xid that modified the tuple. This results in
    the wrong xid being passed as priorXMax to ValidateTuplesXact(), which
    in turn leads to assertion failures.

    So move the ZHeapTupleGetTransInfo() call before
    ZHeapTupleSatisfiesVisibility() call, so that we could use 'tp' rather
    than resulttup. 'tp' gets freed by ZHeapTupleSatisfiesVisibility(), so
    we could not use tp after ZHeapTupleSatisfiesVisibility() call.

    Added a new isolation test.

commit 7ad7a9aedaf883e02caf5b172374344faa4507b8
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-02 19:44:07 +0530

    Fix some compiler warnings

commit db8af0bf6388182415d0cbcde55882a1bf8de84c
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-02 14:18:00 +0530

    Report the scan location for zheap meta page

    During scan, we report our scan position for synchronization
    purposes. We do this before checking for end of scan so that the
    final state of the position hint is back at the start of the
    rel. But, if we skip metapage, the scan location may point to a
    block at the end of the relation.

    Patch by me, reviewed by Amit Kapila

commit caf343183b8940670f9f953014f6815883f9f66c
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-08-02 14:14:50 +0530

    For parallel scan, find and set the scan's startblock

    Probably, we've missed this change during earlier rebase.

commit ec6c8b936c6de6af49816e29dd04ebc363aad5e6
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-07-26 16:43:44 +1200

    Fix silly bug in undolog_xlog_discard().

    When an XLOG_UNDOLOG_DISCARD record is replayed, we need to tell the
    checkpointer to forget about any files that we are about to unlink.
    I was using the wrong variable, so if a single XLOG_UNDOLOG_DISCARD
    record caused segment files 1, 2, 3 to be unlinked, I was telling it
    to forget about fsyncing 3, 3, 3.  Then it would eventually try to
    fsync 1 and 2 and try to raise an error.  Repair that.

    Additionally, in the error path resulting from the above bug, I was
    also calling FilePathName() on a File that I had failed to open.  That
    causes an assertion failure.  Repair that too.

    Thomas Munro, reported by Neha Sharma, RM43571

commit 9d8be03646a5775000e1008badcabf0a5ff828bc
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-07-26 17:56:27 +1200

    The startup process shouldn't attach to undo logs.

    When replaying XLOG_UNDOLOG_META records, the startup process was
    recording that it was attached to the referenced undo log.  That
    caused corruption of the freelists when it tried to detach on exit.
    During recovery we shouldn't attach at all; instead we use the
    xid->undo log mapping.

    Thomas Munro, RM43614

commit adaaac4518c68b378ec2c60a2a8b29353de6b190
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-01 11:52:41 +0530

    Toast table support for zheap

    Now, toast tables for zheap tables are created of zheap type. This imporves the
    performance in terms of memory by reducing the bloat in toast tables. With zheap
    type toast tables, as soon as a transaction deletes a tuple and commits, the space
    can be utilised for the next insertion. Since, toast tables are larger in size
    compared to ordinary tables and their updation is handled by insertion + deletion,
    zheap storage is likely to benefit significantly.

    Reviwed by Amit Khandekar and Amit Kapila.

commit b2509b38661f97c1e1308a5b043a1f34d4d9ebeb
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-08-01 11:46:12 +0530

    Bugfix in GetLockerTransInfo

    The initialization of trans_slots was missing.

commit b455083f2a902df76ba43e38fb16260fd26fb4f0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-31 16:21:47 +0530

    In zheap, we cannot ignore trans status of backends executing vacuum

    For zheap, since vacuum process also reserves transaction slot in page,
    other backend can't ignore this while calculating OldestXmin/RecentXmin.

commit df3e6a10a3ba9c5fe81dd80fc1bfd105b7b42478
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-31 16:15:23 +0530

    During vacuum, reserve sufficient offsets in tpd page

    In lazy_vacuum_zpage_with_undo, we should allocate sufficient space
    in tpd page to store the highest unused offset from zheap page.  Since,
    we've to reserve space before determining the unused offsets, we reserve
    space for maximum used offset in the zheap page.

commit 58e36c8929684198c40bb326cfb455e719b7a11d
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-07-31 14:51:17 +0530

    Support pruning in TPD pages.

    The basic idea is process all the TPD entries in the page and remove
    the old entries which are all-visible. We attempt pruning when there
    is no space in the existing TPD page. Also, while accessing TPD entry,
    we can consider the entry as pruned, if we find that the
    ItemIdIsUnused or the block number in TPD entry is different from the
    heap block number for which we are accessing the TPD entry.

    Amit Kapila with help from Dilip Kumar.

commit 1c37ab83d083241d8738444980ca5510e8e5deb0
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-07-31 14:23:08 +0530

    Store Itemids in TPD page.

    Earlier heappage always point directly to the actual offset of TPD
    entry in a TPD page.  This won't work after pruning as even if the
    particular page's TPD entry is not pruned, we might not be able to
    directly access the TPD entry as offset might have moved. Now, we can
    reach our TPD entry if we can traverse all the TPD entries and TPD
    entry has block number in it, but that is quite inefficient.  To
    overcome this problem, it is better to store Itemids in the TPD page.

    Amit Kapila, reviewed and verified by Dilip Kumar

commit ed01319cc1d9541b3293b051d3d91eb35b7c448d
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-07-31 18:43:17 +1200

    Improve smgr README.

    Wordsmithing.

    Thomas Munro

commit 99f54c695641304061847f6124f04fc9f82a5317
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-07-31 18:33:22 +1200

    Add missing case to undolog_identify().

commit 9a956e391937f685fd3284620c0c6796362cfdf0
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-07-31 18:30:04 +1200

    Add basic undo log storage tests.

    Exercise basic undo log storage code under make check-world.

    Thomas Munro

commit 576f60e34994d4e5777c3e1e17c9987199c46a1f
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-07-27 17:55:32 +0530

    Discard temp table undo logs for zheap

    At the commit of a transaction, backend discards temp table undo
    logs.

    Reviewed by Dilip Kumar and Amit Kapila

commit c61887386927699c1f40c8ad2da0bb49948d38eb
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-27 17:21:48 +0530

    In zheap_update, fix condition for reserving slots.

commit 142f2a014312c668eddc20d9a43d5c81baa48675
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-27 12:37:02 +0530

    An optimization to improve COPY FREEZE in zheap

    In COPY FREEZE, when we insert a tuple, we always mark it as frozen.
    Hence, there is a possible performance optimization for the same scenario:
    1. We can skip inserting undo records for the tuples to be inserted.
    2. There is no need to reserve a transaction slot.

    Here is the implementation details:
    1. Set skip_undo = true if HEAP_INSERT_FROZEN is mentioned.
    2. If skip_undo is true, we don't have to reserve a transaction slot in
    the page. Also, we skip preparing and inserting undo records for the
    to-be-inserted tuples.
    3. For recovery, a new WAL flag XLZ_INSERT_IS_FROZEN is introduced. It's
    true if HEAP_INSERT_FROZEN is mentioned. During WAL replay, we skip preparing
    and inserting undo records if XLZ_INSERT_IS_FROZEN is set in WAL records.

    Patch by me. Reviewed by Amit Kapila.

commit 3dd753713bfdedebd5d38cdd709d61f4cf97b60e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-06-22 12:46:15 +0530

    Cosmetic changes in zheap_multi_insert and zheap_xlog_multi_insert

    This commit removes the usage of undo record information at other
    places in the same function. This makes the coding easy when we
    intend to skip undo insertions.

    Patch by me. Reviewed by Amit Kapila.

commit bf8288d70bad7da53ea6ab3457ac234ca08ff4f9
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-27 16:29:44 +0530

    In ZHeapGetVisibleTuple, fetch transaction slot correctly

    When we call GetTransactionSlotInfo to fetch the transaction info
    from undo, we should pass TPDSlot=false since we fetch the slot
    from the item pointer. Item pointer never stores TPD slots.

    Reported by Neha Sharma. Patch by Mithun CY. Reviewed by Amit Kapila.

commit 0f4f8ad4b402ab3163c78691c70ebcf38cc54384
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-27 12:15:32 +0530

    In ZHeapGetVisibleTuple, handle non-mvcc snapshots

    Reported by Neha Sharma. Patch by me. Reviewed by Amit Kapila.

commit 8d395035cc9a711b6761006b5b2433aa4077ea12
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-27 11:12:05 +0530

    Fix some compiler warnings

commit 2a932c0c1580a699740b5628562f014381dd48c5
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-25 16:50:58 +0530

    Handle locked-only tuple in ZHeapTupleSatisfiesOldestXmin

    In ZHeapTupleSatisfiesOldestXmin, we can't take any decision if the
    tuple is marked as locked-only. It's possible that inserted transaction
    took a lock on the tuple. Later, if it rolled back, we should return
    HEAPTUPLE_DEAD, or if it's still in progress, we should return
    HEAPTUPLE_INSERT_IN_PROGRESS. Similarly, if the inserted transaction
    got committed, we should return HEAPTUPLE_LIVE. The subsequent checks
    in the function already takes care of all these possible scenarios,
    so we don't need any extra checks for locked-only tuple.

commit bca6d63cdc78fd8e8d5804b0296b2d1c436af7d5
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-07-26 17:52:14 +0530

    Fix bug in prepared transaction.  StartPrepare is only copying
    the first member of the array instead of copying whole array.

    Dilip Kumar Reviewed by Amit Kapila

commit 199b463c81cb253b4bb77645629dc9b621ab865c
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-07-26 17:51:43 +0530

    fix the regression test for the zheap.

commit 85b792b7a12c91e59abe0189e79587fd516ea1c2
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-07-26 17:49:08 +0530

    Create an auxiliary resource owner for the undoworker because now
    we need to have valid resource owner for accessing the buffer.

    Dilip Kumar Reviewed by Amit Kapila

commit f41d0055f5129b99530f1fc0f612b21627e816b2
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-25 12:09:10 +0530

    Handle single in-progress locker in ZHeapTupleSatisfiesUpdate

    For single in-progress locker, ZHeapTupleSatisfiesUpdate returns
    locker's xid and transaction slot along with latest modifier/inserter's
    xid and transaction slot. We also send the single locker's trans info
    and latest modifier/inserter's to compute_new_xid_infomask.

    Kuntal Ghosh and Amit Kapila

commit afe785f273dfbb393c0026677e00c32296e7406e
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-07-25 09:06:26 +0530

    Rollbacks of zheap temp tables

    We diligently take care to never push the rollback requests to
    undo worker for temp tables and the backend itself performs the
    required undo actions.

    Reviewed by Dilip Kumar and Amit Kapila

commit 9282d623ae17a9ff3f15e5e8e000150a8aee6011
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-07-23 18:41:22 +0530

    Undo actions is not executed in many cases when failure occurs during
    commit/abort path.  This commit fixes the same.

    Dilip Kumar Reviewed by Amit Kapila

commit 3ac8b197c0f12f91f0f8f93426e818f902036c1e
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-07-23 17:02:02 +0530

    Bug fix in PushRollbackHT

    In PushRollbackHT, if the start_urec_ptr is not given then
    get it from the log, as done in execute_undo_actions.

    Reported by Neha Sharma.

commit 5c599edf74c44a02332ed50ee939bf974bde6fef
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-19 16:43:19 +0530

    Remove unnecessary GetTransactionSlotInfo() call

commit e68f8d97ecbdf730fdd910b0e683542684c6c9d8
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-07-19 08:34:50 +0530

    Fix Rollback of multilockers.

    Uptill now, on rollback, we never change the slot of tuple if the
    multilockers flag is set on the tuple.  This is because we can't find
    the next highest locker (there could be multiple lockers with same
    lock level) even by traversing undo chains.  To overcome this problem,
    we come up with a new design where we ensure that the tuple always
    point to the transaction slot of latest inserter/updater.  For example,
    say after a committed insert/update, a new request arrives to lock the
    tuple in key share mode, we will keep the inserter's/updater's slot on
    the tuple and set the multi-locker and key-share bit.  If the
    inserter/updater is already known to be having a frozen slot (visible
    to every one), we will set the key-share locker bit and the tuple will
    indicate a frozen slot.  Similarly, for a new updater, if the tuple has
    a single locker, then the undo will have a frozen tuple and for multi-lockers,
    the undo of updater will have previous inserter/updater slot; in both cases
    the new tuple will point to the updaters slot.  Now, the rollback of a single
    locker will set the frozen slot on tuple and the rollback of multi-locker
    won't change slot information on tuple.  We don't want to keep the slot of
    locker on the tuple as after rollback, we will lose track of last
    updater/inserter.

    Amit Kapila, Kuntal Ghosh and Dilip Kumar

commit 143adfee5ad287dedc744e06eaae0c0ed390005e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-16 17:14:54 +0530

    Small fix in GetTupleFromUndoForAbortedXact

commit dd2f4010834f885af0c2baffa16bc4f6cfeb0664
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-10 23:16:25 +0530

    Implement ZHeapTupleSatisfiesVacuum for pruning/vacuum

    For pruning/vacuum, we can skip the tuples inserted/modified by an
    aborted transaction. It'll be handled by future pruning/vacuum calls
    once the pending rollback is applied on the tuple. This optimization
    allows to avoid fetching prior version of the tuple from undo.

    Patch by me. Reviewed by Amit Kapila.

commit d5d9f72eca46bc01efea01f3236f1a3d279459d7
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-16 16:45:29 +0530

    Handle aborted xact in ZHeapTupleSatisfiesOldestXmin

    If the latest transaction for the tuple aborted, we fetch a prior committed
    version of the tuple and return it along with prior comitted xid and status
    as HEAPTUPLE_LIVE.
    If the latest transaction for the tuple aborted and it also inserted
    the tuple, we return the aborted transaction id and status as
    HEAPTUPLE_DEAD. In this case, the caller *should* never mark the
    corresponding item id as dead. Because, when undo action for the same
    will be performed, we need the item pointer.

    Patch by Amit Kapila and me. Reviewed by Amit Kapila.

commit 3cc15be2061d294fe2762da471c321394bf06ebb
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-13 13:26:43 +0530

    In zheap, don't support the optimization for HEAP_INSERT_SKIP_WAL

    If we skip writing/using WAL, we must force the relation down to disk
    (using heap_sync) before it's safe to commit the transaction. This
    requires writing out any dirty buffers of that relation and then doing
    a forced fsync. For zheap, we've to fsync the corresponding undo buffers
    as well. It is difficult to keep track of dirty undo buffers and fsync
    them at end of the operation in some function similar to heap_sync.

    This commit skips copy_relation_data and copy_heap_data. We need to
    revisit the same once we implement ALTER TABLE.. SET TABLESPACE and
    CREATE CLUSTER/VACUUM FULL feature for zheap.

    Reviewed by Dilip Kumar and Amit Kapila

commit d261daaee572c0d6ea953948428a93eec95eee79
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-07-10 15:45:05 +0530

    Modified output file for triggers.sql for zheap

    When storage_engine = zheap, the behavior for one test case
    is different than heap. The difference in behavior is because
    of inplace updates in zheap and non inplace updates in heap.
    This changed behavior is acceptable for zheap, hence, adding a
    new output file for zheap.

    Reported by Neha

commit 1ca9a92cde64fcc798c5a0a5e7dbc0b8aae7454f
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-09 11:58:50 +0530

    Bug fix in UndoRecordAllocateMulti

    Reported by Neha, reviewed by Amit Kapila and Dilip Kumar

commit 2eb1df348fe34bd4d68b0892a5e63d6d8a9ddb47
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-07-09 11:57:45 +0530

    In recovery, transaction id should be sent in UndoSetPrepareSize

    Reported by Neha, reviewed by Amit Kapila and Dilip Kumar

commit c4b6cc166f11fbc22147e4f5567791ac9bbc572e
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-07-05 17:39:50 +0530

    Handling when a transaction span across undo logs and avoid single
    WAL logged operation to span across multiple log.

    The first part of the patch handle a case, when a single WAL logged
    operation which needs multiple undo records (e.g non-inplace update,
    multi-insert) we avoid it to go in multiple logs.  For that we first
    allocate all the undo required for the operation in one allocate call.

    And the second part handle the discarding and rollback when a transaction
    span across undo logs.

    Path by Dilip Kumar Reviewed by Amit Kapila.

commit 7ed998fd1d369ce8012cd69bed7b020a367c9294
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-07-03 12:14:47 +0530

    Handle changed relfilenode while executing undo actions.

    While an undo worker executes undo actions for a relation, the same
    relation can be truncated. In that case, RelidByRelfilenode() when
    called using the relfilenode saved in the undo record returns invalid
    relation oid. Use this behaviour to figure out that the relation is
    truncated, and abort the undo actions.

    Because this was not handled earlier, the algorithm in
    execute_undo_actions() failed to identify when exactly the undo
    records switch to new pages, and kept on fetching records, without
    releasing them, thus leaving behind those many records and their
    buffers unreleased. This eventually leads to "no unpinned buffers
    available" error in the server log. This was reproducible easily when
    truncate is run immediately after interrupting a long insert, and with
    shared_buffers set to minimum value.

    Amit Khandekar, reviewed by Amit Kapila.
    Reported by Neha Sharma.

commit a96820d722ca0244c8bd9a9749f18e5b2604c45d
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-06-28 09:42:58 +0530

    Prevent usage of uninitialized tuple during tuple routing.

    TransitionCaptureState.tcs_original_insert_tuple is of type HeapTuple,
    and it was assigned a tuple variable which remains unitinitialized in
    case of zheap table. Fix it so that the zheap tuple is converted to
    heap tuple and then assigned to tcs_original_insert_tuple. Fixed this
    in both ExecPrepareTupleRouting() and CopyFrom(). Due to this issue,
    trigger.sql regression test used to crash on some environments.

    Amit Khandekar, reviewed by Kuntal Ghosh.

commit 304dfa7af0547db1f22d72b30b42baa92647b136
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-27 20:49:04 -0700

    Fix compilation warning

commit 89c80035015098c563c104e31406dd23d483fc0f
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-06-13 16:49:04 +0530

    Change minimum transaction slots per zheap page to one

    We distinguish a zheap page from a TPD page by comparing the special
    page size. TPD pages always have 1 slot in its special space. Hence,
    we've to set minimum slot as two in zheap pages.

    Tested on Windows by Ashutosh Sharma

commit 1265603a63e90def6a1738f5cca962a13d59e43d
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-06-25 08:08:51 +0530

    Free payload data only when it is allocated.

commit 045032bdedff74d583bd54c68a1560f63b7cb967
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-21 06:24:13 -0700

    Isolation test for tpd patch.

    Patch by Rafia Sabih Reviewed by Dilip Kumar

commit 6bbe1f3fde618cf35f64da2726f1100963ba5576
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-21 06:14:12 -0700

    Post TPD commit fix.  After Rereserve the slot,  trans_slot is not
    set back to the new slot.  Ideally we should get the same slot but
    if our slot moved to TPD than it can be old_slo +1.  Also, removed
    the invalid Assert and converted to if condition.

    Patch by Dilip Kumar Reviewed by Amit Kapila

commit 3d3697e196c2f44eceaa07b77d00bc9d90ff0c19
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-06-21 16:32:10 +0530

    Support TPD which allows transaction slots to be extended beyond page
    boundary.

    TPD is nothing but temporary data page consisting of extended
    transaction slots from heap pages.  There are two primary reasons for
    having TPD (a) In the heap page, we have fixed number of transaction
    slots which can lead to deadlock, (b) To support cases where a large
    number of transactions acquire SHARE or KEY SHARE locks on a single page.

    The TPD overflow pages will be stored in the zheap itself, interleaved
    with regular pages.  We have a meta page in zheap from which all overflow
    pages are tracked.

    TPD Entry acts like an extension of the transaction slot array in heap
    page.  Tuple headers normally point to the transaction slot responsible
    for the last modification, but since there aren't enough bits available
    to do this in the case where a TPD is used, an offset -> slot mapping is
    stored in the TPD entry itself.  This array can be used to get the slot
    for tuples in heap page, but for undo tuples we can't use it because we
    can't track multiple slots that have updated the same tuple.  So for
    undo records, we record the TPD transaction slot number along with the undo
    record.

    This commit provides basic support of TPD entries where a fixed number
    of entries per page can be allocated in extended pages.  There is more
    to do in order to complete the support of TPD.  The remaining work
    consisits of
    1. Reuse transaction slots in TPD entry.
    2. Allocate bigger TPD entries once the initially configured slots or
    offsets are exhausted.
    3. TPD page pruning and once the page is clean, we can add it to the
    FSM.
    4. Find the free TPD page from FSM.
    5. Chain of TPD entries when the single TPD entry can't fit on a page.

    Amit Kapila with the help of Dilip Kumar and Rafia Sabih

commit 15536e74ed38ba970f30684850ad013d2174d6e8
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-06-20 04:31:51 -0700

    Move Assert to right place.

    By Mithun C Y review By Dilip Kumar

commit 9724bc889e4a99309e74912c739416c4345cd5d1
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-06-19 22:24:35 -0700

    Release restriction on page compactification.

    Previously a30d278e8d we did not allow compactification of page
    if it ever have deleted uncommitted tuple in it. This is not
    necessary and hence removing same.

    By Mithun C Y, review by Amit Kapila.

commit a07a3976029c6322cc3e3f954eae0cd606e776b2
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-06-18 05:29:17 -0700

    Fix bugs related to concurrency in zheap update

    While updating the tuple we might need to release the buffer lock and
    reaquire same. During that period where we do not hold buffer lock
    a concurrent process might have moved the tuple to undo and/or pruned
    the page. So whenever we reacquire the buffer lock check if
    itemid is deleted and readjust the position of tuple in page buffer.

    Patch by Mithun C Y and Amit Kapila.

commit 2448790b692e8f6581cce59b4d08abc977dbd20b
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-06-12 11:13:34 +0530

    In PrepareUndoInsert, don't call SubTransGetTopmostTransaction

    In PrepareUndoInsert, we always send the top transaction id. Hence,
    there is no need to fetch the parent xid for a transaction.

commit 4abb8ba5305ef04f7164be58f4c0c94b11ab793c
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-14 05:00:13 -0700

    LogUndoMetaData was called inside XLogBeginInsert, And,
    LogUndoMetaData iteself call XLogBeginInsert for inserting the
    meta WAL.

    Patch by Dilip Kumar

commit 7c44b88155a14186ca91ca010cf84d5f7be5fd93
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-14 04:54:20 -0700

    Conditions for rollback request push to worker was not consistant
    it should only be pushed for the top transaction.  Also, when there
    is an error in top transaction, then we will not have xid for the
    top transaction. So we have removed the xid based key for rollback
    hashtable.

    Path by Rafia Sabih Review by Amit Kapila tested by me.

commit 2797c4bc90ad986804b698a4eaa39991ffcf2094
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-06-14 15:37:44 +0530

    Implement stats for in-place updates

    Reuse the existing hot-update stat variable for in-place updates
    instead of introducing a new variable as that will increase the size
    of stats structure.  It appears ugly to overload the unrelated
    variable, but it is better to discuss in community before introducing
    a new variable.

    Beena Emerson, reviewed by Mithun C Y and me

commit 0b8ef5ea24a55c23d54821fa4bc3667319e84801
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-06-08 17:12:59 +0530

    Use high 4-bits of xl_info to store undoaction WAL operation type

commit 80e9fe40e186dac8ffbd4b9bf2479cb1fca4f770
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-06-11 13:04:33 +0530

    Pass '/c:' option with findstr command on Windows.

    Commit 3e9f07a8fe22 uses findstr command to remove the lines having
    "Options: storage_engine='zheap'" pattern from the results/*.out
    files on Windows platform but, it doesn't pass the correct option to
    findstr to ensure that only the lines having the given pattern gets
    removed form the results/*.out file.

    Ashutosh Sharma, Reported by Amit Kapila.

commit a600e4e055877dbb9071f40f215a8f8bb8662d79
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-06-08 17:00:07 +0530

    Buffer leak fix in zheap_insert

commit 303634f18969c53dc31b6668668232f327d7bed6
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-07 23:08:42 -0700

    During rollback we rewind the insert location, Now if the insert
    location is used by some other transaction then its updating its
    own next pointer with its own insert location resulting in cycle
    and due to that undo worker is stuck in this cycle.

    Patch by Dilip Kumar Review by Amit Kapila and Kuntal Ghosh

commit 6fe4dfdda5db82e0e24eeaefa4cdc63a3277be49
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-06-07 23:06:34 -0700

    ZHeapPageGetCid currently comparing with RecentGlobalXmin to
    identify the old undo, ideally it should compare with oldest
    xid having undo. Due to this it fetching many extra undo hence
    the performance of is low.

    Fix by Dilip Kumar review by Amit Kapila

commit 363c7aa95eeb35f57908ed7c777817cfa38a0f1d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-06-04 22:09:13 +0530

    While performing undo, ignore out of range block numbers

    If a table is truncated just berfore performing undo actions on the same,
    it's possible to encounter out of range block numbers. In that case,
    we can safely ignore those block since we don't have to perform rollback
    on the same.

commit 7cd5b08da9380552376a69456ca11c6f97ce1a1c
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-06-07 17:32:52 +0530

    Store the information about lockmode in undorecord and initialize
    the flags variable in xl_zheap_lock to zero.

    Commit 0763c68645e9 introduced the support for different tuple
    locking modes and added the necessary changes in
    zheap_lock_tuple_guts() to store lockmode information in undo record
    but, missed to do the similar changes in zheap_update() due to which
    the undo record pointers for non-inplace updates were not the same in
    master and standby nodes thereby, resulting in an assertion failure.

    Additionally, commit f45251a37fe2 introduced flags variable in
    xl_zheap_lock to store transaction slot related information and did
    the necessary changes for it in zheap_lock_tuple_guts() but, missed
    to do the same in zheap_update().

    Ashutosh Sharma, reported by Neha Sharma, reviewed by Dilip Kumar.

commit 643bf383e77acb5eca372380267d01ed01d3a545
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-05-31 03:17:44 -0700

    Correct style issues in macro definition.

    By Mithun C Y comments by Robert Haas.

commit aaf09cb0491869b5d6e84e4cc68cfe28418ef6ad
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-30 15:37:24 +0530

    In zheap insert option, HEAP_INSERT_FROZEN should be used

    In zheap_insert/multi_insert, we can use HEAP_INSERT_FROZEN flag
    similar to heap to indicate the inserted tuple should be frozen
    after insertion.

commit 409cf3957bee5894cacc603670587476171277fe
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-05-30 01:42:43 -0700

    Applying pending undo action before modifying the page.

    Currently, if a transaction wants to update a tuple and we
    find that the other modifier is aborted and undo actions is
    not yet applied we simply modify the page, which will create
    an unpredictable behaviour as the execute undo action may
    rollback the changes made by our transaction.
    This commit first apply the pending undo action only for that
    page and then perform the changes.

    Patch by Dilip Kumar and Amit Kapila Reviewed by Amit Kapila
    Tested by Kuntal Ghosh

commit 736ff4cb26355992800792dc64954e55694a82fb
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-05-30 01:41:06 -0700

    Bugfix in condition check while applying the undo action.

    Earlier it was not collect when two undo pointer was in different logs.

    Patch by Dilip Kumar Reviewed By Amit Kapila

commit c708efefa812ffe54d94841b3c54c906b263a3da
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-30 12:00:01 +0530

    In zheap_update, don't propagate lockers when no lockers are present

    When no lockers are present, we should not propagate any locker
    related information to the newly inserted tuple.

    Patch by Dilip Kumar, reviewed by Amit Kapila and me.

commit 640c67490b76c096405fe38253ae222439707ba5
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-05-25 13:20:53 +0530

    Correct the insert and discard pointer of the undo log segment file
    when resetting the undo logs.

    Report by Neha Sharma, Initial Analysis by Ashutosh Sharma, Patch
    by Thomas Munro.

commit dc28828a8aa5a54fd8a5aced67c4a69cad84faf7
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-24 14:39:34 +0530

    Allow DML commands that create zheap tables to use parallel query

    This commit applies the required changes for zheap corrsponding to
    the commit e9baa5e9fa147e00a2466.

    Reviewed by Amit Kapila

commit be186ca7f02be4c8d88632bc9fff37d2a7680267
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-25 15:30:31 +1200

    Fix corruption of oldest_data.

    After commit 9ebe7511 it could be left pointing to space before log->discard,
    and we'd later try to read from there and possibly see a bunch of zeroes.

    Thomas Munro, RM43553, reviewed by Rafia Sabih

commit ba2406238eb93e400ac223a3d8e78097984072c6
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-05-25 08:31:46 +0530

    Fix the freespace recording by vacuum

    The freespace was not being updated till we have some deleted/dead
    tuples in the page.

    Report and initial analysis by Ashutosh Sharma, patch by me

commit ab0049c4bb174c332052b187ce60b3f01466bb18
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-24 11:11:40 +0530

    Revert last commit 853849138bf05bcac85

    We're already releasing the lock in function IsPrevTxnUndoDiscarded.
    Although, we could've released the lock in the same function where
    it was taken, but let it be as it is for now.

commit d7e1b3f1c30456b25fbbaa22ffa6bb9791f60a6e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-23 16:38:33 +0530

    In PrepareUndoRecordUpdateTransInfo, release discard lock before leaving page

commit e329d580a973fcb41dc37b6283fc0db67e8262f3
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-21 12:28:12 +0530

    In execute_undo_actions, release undo records before exiting

    When undo record is discarded, we return from execute_undo_actions
    immediately. But, we should release the undo records and corresponding
    undo buffers collected earlier in the same function for rollback purpose.

commit 9dcdb6468df34f45b2975e5c4cac69a05ed27ac2
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-05-22 23:14:02 -0700

    Revert the restriction on page pruning.

    Revert code which disallowed pruning if there exist any open
    transaction on the page.

    Patch By Mithun C Y

commit 7023615ca40a2041e79863f56c2c1fa6b7c58cda
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-05-23 09:42:55 +0530

    Pass nobuflock=false to ZHeapTupleGetTransInfo().

    In zheap_get_latest_tid(), ZHeapTupleGetTransInfo() was called with
    nobuflock=true, which is wrong because the buffer is already locked.

    Discovered by Kuntal Ghosh, patch by Amit Khandekar.

commit ad0dd279a096e5f9763f44799dfc78c18c4bbd26
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-05-23 09:12:09 +0530

    Avoid using tuple freed by a visibility function.

    In zheap_get_latest_tid(), the tuple that is passed to
    ZHeapTupleSatisfiesVisibility() was being used subsequently, ignoring
    the fact that the visibility function frees the passed-in tuple.

    Rather than the passed-in tuple, use the tuple returned by the
    visibility function in the subsequent code. While at it, make sure
    that the returned tuple is also freed if it is different than the
    passed-in tuple.

    Discovered by Neha Sharma while testing TidScan implementation for
    zheap.

    Amit Khandekar, reviewed by Kuntal Ghosh.

commit e47016cdfba942a07cc3c9fb98d698e22251cb77
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-05-22 09:38:16 +0530

    Fix an issue with copying overlapping memory areas.

    For adjusting zheap tuple location, the tuple header was getting
    corrupted because memcpy() was used to copy, and the old and new tuple
    header areas may sometimes overlap, which memcpy does not handle. This
    was discovered when trigger.sql regression test used to crash on some
    environments.

    So use memmove() instead of memcpy(). memmove() is meant to handle
    this scenario of overlapping areas by first copying the data from
    source location into a temporary location.

    Amit Khandekar, reviewed by Dilip Kumar.

commit ec2d2d7b51bd843b88c2668d548fb90da60fcc67
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-17 10:33:47 +0530

    Fetch correct cid for undo tuples

    In ZHeapTupleGetCid, we should fetch the correct cid from undo.

    Patch by me, reviewed by Amit Kapila

commit ef42a733e045b5624308ce37f4d745b0b9e7d221
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-21 13:09:17 +0530

    Fix compiler warnings

commit 804476bd9929532a83fd0b30ddacfa00b68c5028
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 14:38:34 +1200

    Add a smgrsync() implementation for undofile.c.

    Teach the checkpointer to go through an sgmr API call to mark a (RelFileNode,
    forknum, segment) as in need of flushing to disk at the next call to
    smgrsync().  Previously it called RememberFsyncRequest() in md.c directly,
    but undofile.c needs to be able to participate in this scheme too.  So, add a
    new function smgrrequestsync(), and have it forward to mdrequestsync() or
    undofile_requestsync() as appropriate.

    For now, there is a LOG message when undo segment files are fsync'd, like the
    existing create/recycle/unlink messages.  These will be removed in future.

    This contains a small amount of code that is copied from md.c, but the fsync
    queue machinery is being redesigned so this can be rebased later to use a
    common fsync queue.  See commitfest entry 18/1639 (work independent of zheap).

    Thomas Munro, RM43460, reviewed by Amit Kapila

commit 74c0db38b6469fd92bfca32b41ecbadafd99ed40
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-15 01:22:41 +1200

    Free up undo log DSM segments and trim pg_undo file size.

    To prevent the pg_undo file from getting gradually larger and the number of
    undo log DSM segments from gradually increasing in a long running system,
    update the lowest non-discarded undo log number at each checkpoint so that we
    can free up resources.

    In passing, change UNDO_LOG_STATUS_DROPPED to UNDO_LOG_STATUS_DISCARDED, a
    name that better describes the state.

    Thomas Munro, RM43532, reviewed by Amit Kapila

commit 9b18132fe9ad889e0aaecb2d2019e3bcf9433018
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-24 16:45:58 +1200

    Recycle memory used in recovery for the xid->undo log map.

    At each checkpoint, free up memory used to hold information about which
    undo log old xids are attached to.  Also rename associated variables and
    functions to make things clearer.

    Thomas Munro, RM43532, reviewed by Amit Kapila

commit 8cb810f82a694d794429f0b8e28328baf6334bc5
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-05-17 15:06:51 +0530

    Bug fix in discard mechanism

    When undo actions were applied by backend and the undo record pointer
    is rewound, discard the corresponding undo logs and skip performing
    undo actions.

    Reported by Neha Sharma, reviewed by Dilip Kumar and Amit Kapila

commit 56762bc994dfa14e79c29f13231bd0f6e99e0782
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-15 17:16:55 +0530

    Fix compiler warnings

commit e40d26a76fb67a3b2f1cf144a2e76c8301be6a4b
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-15 05:12:47 +1200

    Improve smgr.c and undofile.c modularity.

    Add a void pointer called "private_data" to SmgrRelationData where undofile.c
    can put its state, instead of the previous horrible hack where it was using
    relm->md_seg_fds.  md.c should really use the new member too, but that'll be a
    patch for another day.

    Thomas Munro

commit bc419ace343682505687cbb39f402d4e90cf228a
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-15 17:04:31 +1200

    Remove obsolete comment from undolog.c.

commit 62e17972070c1b994f2ae63abe07f951f59bdbc3
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-15 13:40:29 +1200

    Update copyright date to 2018.

    For all undolog-related files.

commit b591268e289fba15aeb9f1160d7c164e241c6a11
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-09 18:44:56 +1200

    Add some user-facing documentation about undo logs.

    This commit documents the pg_stat_undo_logs view and the layout of files on
    disk.  More wordsmithing will be needed.

    Thomas Munro

commit 350dadc12b77a4efccafbb5f11663c07370bfd01
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-09 19:16:24 +1200

    Undo log README tweaks.

    Author: Thomas Munro

commit 22560511be9b6d49034573736feb0c7a830e1a3a
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-16 16:15:54 +1200

    Skip unnecessary reads of newly allocated undo log pages.

    Whenever we're inserting new undo data that happens to fall at the start of a
    page, we know that there can be no pre-existing data on the page.  Therefore
    we can ask bufmgr.c to zero it out instead of reading it from the storage
    manager.

    To facilitate this, create a new BufferReadMode RBM_ZERO, just like
    RBM_ZERO_AND_LOCK except without the content lock.  undorecord.c expects to
    acquire the content lock a bit later.  Since each backend has sole write
    access to write to the undo log, it's not necessary for bufmgr.c to acquire
    the lock for us.

    Thomas Munro, RM43486, reviewed by Amit Kapila

commit b88edc54636b2649101106ad08b95c16fbfc1d7f
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-05-14 21:46:38 -0700

    Move previous transaction's undo updation inside critical section

    Update the previous transaction's undo record inside
    InsertPreparedUndo after we have actually inserted the undo record.

    Patch by Mithun C Y Review by Dilip Kumar.

commit 87075c70ffaeea5c570340df7aa144823d94a510
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-05-14 21:35:47 -0700

    Get deleted rows from zheap_fetch.

    In zheap_fetch if ItemId is set as deleted we need to
    fetch the old version of rows from undo.

    Patch by Mithun C Y Review by Kuntal Ghosh

commit 229aaef467f263bf779cab93c2bf05011a03f335
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-05-14 18:18:30 +0530

    Fix an issue in parallel btree index build.

    Commit 9da0cc35284b added support for parallel btree index build.
    While imitating those changes for zheap in IndexBuildZHeapRangeScan()
    with commit a33e61f999f03, some changes got missed, due to which
    an already-unregistered snapshot is tried to be freed again.
    Added the missing changes.

    Reviewed by Mithun Cy and Amit Kapila.

commit 43d9001c390852fae721b9351a8404f3e063b0d5
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-07 16:10:29 +0530

    During ROLLBACK, set frozen/invalid xact flag correctly

    Reviewed by Amit Kapila

commit e18393d52f04e6c5ade7529581b521d2f7387bdf
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-14 12:05:48 +0530

    In UndoLogAllocate, set is_first_rec to false by default

commit db7d7ee6110a47b420d2be2dc88381e39a099f9c
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-05-13 09:29:14 +0530

    Update README.md to reflect the current status of zheap.

commit 19235c3d03f0505f91b5056871c79a70e79430c4
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-08 12:26:35 +0530

    Add new expected .out files for zheap to compensate the failures
    only happening due to inplace updates.

commit 4396609900c4bd5efdd7ecebc8d3966411304786
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-05-08 11:41:45 +0530

    Fetch the tuple from undo for aborted transactions when checking
    the visibility for dirty snapshot

    Reviewed by Kuntal Ghosh

commit 27814264865d23ab8cfae46cdbe9202117ae7c6c
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-05-07 13:18:29 +0530

    Thinko fixed for execute_undo_actions caller

commit d8a3a11957e1e4647da0782308ca89b5ac931b1b
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-05-07 13:11:26 +0530

    Fixed a thinko in RollbackFromHT

    Reported by Dilip Kumar

commit 35a9231220d0866847cd7bd3729c5d231b10e3b5
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-04 15:45:25 +1200

    Document the undo log IO wait events.

    Add the new undo log wait events to monitoring.sgml.

    Thomas Munro

commit 7dc8a6862586586b48eadfe708044e679f86f153
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-04 15:06:23 +1200

    Report read/write/sync wait events for undo checkpoints.

    Like other file IO, let's make these show up in pg_stat_activity.

    Thomas Munro, RM43509, based on feedback from Amit Kapila

commit ff7193ad4d4472014344e0983ce90adce9be3229
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 23:19:49 +1200

    CRC verification for pg_undo checkpoint files.

    Add CRC32C checksums to the per-checkpoint files stored under pg_undo.

    Thomas Munro, RM43509, reviewed by Amit Kapila

commit 5ec98770f1e4a1b6fc515447a7d01e1139a8eb27
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-05-01 17:48:01 +1200

    Fix compiler error in test_undo.c on Windows.

    Per CI build report.

commit e8ee49180be2391e87a0858d6c2b09eca5fbd716
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-05-03 18:55:15 +0530

    Avoid calling PageGetUNDO

    We need to avoid calling PageGetUndo as it re-access the transaction
    slots the second time.  We can already get it via PageReserveTransactionSlot.
    This is okay till now, but with TPD, we need to again access the TPD page
    which will be costly.

    Patch by me, reviewed and edited by Dilip Kumar

commit 83986b54c2b6445517502d1750f5523ed2690bba
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-03 14:23:22 +0530

    Small fix in CopyTupleFromUndoRecord

    We should free the memory for zheap tuple only after allocating memory
    for the copied undo tuple.

commit c8ce97a9d55c0c3a3f537a389513ce10de1b6c0e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-03 11:51:11 +0530

    Remove unnecessary undo type from CopyTupleFromUndoRecord

    Reported and reviewed by Amit Kapila

commit a2a88babb3e190f526cdb3fae496c41fc9c1d2f2
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-05-03 12:45:30 +0530

    Fix WAL replay of XLOG_ZHEAP_UNUSED

    We forgot to call PageSetUNDO during wal replay of XLOG_ZHEAP_UNUSED.

    Patch by me, verified by Mithun C Y

commit 9b1f493a6335d0702479bc56506dc34287b46b35
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-05-02 16:54:50 +0530

    WAL replay for lock tuple was not using correct transaction slot to
    update the transaction information

    This commit fixes the issue by logging and using the correct
    transaction slot during replay of lock tuple.

    Patch by me, reviewed by Dilip Kumar

commit 6a74940a505bd5e302cc9ac7b3adbb3625c560a8
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-05-01 23:19:24 -0700

    Bug fix in undo record start header update during recovery

    Patch by Dilip Kumar Reviewed by Rafia Sabih

commit a5c5e1706570f379eb8bf6a12a9226af60912016
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-05-02 13:42:36 +0530

    Handle non-exixtent/empty files in pg_regress

    Commit 3e9f07a8fe222b0e9 excluded storage_engine option from result
    files using grep/findstr command. But, these commands returns
    non-zero values in case of non-existent/empty files. Hence, we've to
    skip the checks for the same.

commit fd4c197eca2a0846d08727361073cbc1f82d7cbf
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-05-01 22:51:15 -0700

    Fix tuple lock wal replay

    lock mode was not stored in lock tuple WAL and also not stored in undo
    during replay.  This commit fixes the same.

    Patch by Dilip Kumar Reviewed by Amit Kapila

commit 3775d97153d1cd63bc98d561816e9662cd03eb53
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-30 17:14:08 +0530

    Fix copy undo payload

    Reported by Amit Kapila

commit 02a89f973c2be94019d695fb0d917494db319473
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-05-01 21:45:18 -0700

    Bug fix in update wal replay.

    Patch by Dilip Kumar Reviewed by Amit Kapila

commit c5f63f27f73618b994bd561057d3f4e9c0da15ff
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-05-01 12:42:26 +0530

    The check to ensure whether undo is discarded was missing at few
    places.

    This patch fixes two such occurrences.

    Patch by me, reported and reviewed by Rafia Sabih

commit f336c7a16c72e0048f897f85b1e38b24adc8ff8e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-30 11:15:40 +0530

    Add new expected .out files for zheap to compensate the failures
    only happening due to inplace updates.

commit 1a8e463d48e1bbf5ea9abb1e07b1512a097fb963
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-04-30 10:49:06 +0530

    Additional flag in execute_undo_actions for relation lock

    Callers of execute_undo_actions now provide a flag based on if they
    a lock on the relation. If the caller have relation lock already
    then no need to lock again in execute_undo_actions.

    This is required particularly in cases when rollbacking the prepared
    transactions or rollback to savepoints. In such cases the transaction
    have already held the relation locks and we need not take another.

    Reviewed by Dilip Kumar

commit 5f229e04af2f187fc757e8f59ba8c075e56cf46c
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-04-27 04:21:42 -0700

    While vacuuming a large zheap table, update upper-level FSM data every so often.
    for detail refer commit 851a26e26637aac60d6e974acbadb31748b12f86 of PG.

    Patch by Dilip Kumar Reviewed by Amit kapila

commit 90dbf29c4b815ea3b51c8736f412a67d59d5f4f1
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-27 21:49:59 +1200

    Log the same undo segment messages in REDO and in DO.

    During DO we currently output LOG messages when undo segment files are
    recycled etc.  Output the same messages in recovery.  All of these
    messages will later be removed, but it's helpful for testing to show
    them for now.

    Thomas Munro

commit 292856c5c73f80f9089441a9079ce76ab17b9764
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-04-27 12:53:10 +0530

    Add new expected .out files for zheap to compensate the failures
    only happening due to inplace updates.

    Patch by Ashutosh Sharma, suggested by Amit Kapila, reviewed by
    Kuntal Ghosh.

commit 0925b489735ab6ba01c3c57a85175947ec04a373
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-04-27 12:51:00 +0530

    Allow pg_regress module to exclude storage_engine option printed
    when viewing the definition of zheap table using \d command.
    Additionally, also add a new reloption_1.out file to compensate
    for the diffs generated due to storage_engine option in reltopions
    field of pg_class table.

    Patch by Ashutosh Sharma, reviewed by Kuntal Ghosh.

commit c13e3a64be6fe571f48c1abc8a837c063594ce1d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-18 18:17:16 +0530

    Implement Foreign Key Constraint for zheap

    For zheap, when we call the triggers we convert a zheap tuple to heap
    tuple. But, we don't store any transaction related information on the
    heap tuple. Hence, I've to make the following two
    assumptions/considerations for the foreign key implementation in
    zheap:

    1. RI_FKey_fk_upd_check_required: In this function, we check if the
    original row was inserted by our own transaction, we must fire the
    trigger whether or not the keys are equal. For zheap, since we don't
    have the transaction related information, we always fire the trigger.
    So, even if the keys are equal, we cannot skip the trigger for zheap.

    2. validateForeignKeyConstraint: During ALTER TABLE..ADD CONSTRAINT
    FOREIGN KEY, this function is called to validate whether we can add
    the foreign key constraint. First, it tries to fire a LEFT JOIN query
    to test the validity. If the user doesn't have proper access rights to
    pktable/fktable, we scan the table in pagescanmode and fire trigger
    for each tuple. Before firing the trigger, it takes buffer lock to
    check whether it can skip the trigger. For zheap, we don't retain the
    pin on the buffer. Hence, we've to read the buffer again before
    locking the same.

    Patch by me, Reviewed and tested by Ashutosh Sharma

commit 4113663bfa26941ef3b5eec9589e70dfe3643e96
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-04-27 11:34:49 +0530

    Update t_infomask2 field of old tuple correctly during inplace
    updates.

    Patch by Amit Kapila, reviewed by Ashutosh Sharma

commit 06fb3f80e349e0a329606702cd94cd2bdc740946
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-04-25 23:56:02 -0700

    Mark all_dead if ItemId is already dead.

    By Mithun C Y review by Ashutosh Sharma

commit 60731dbac9433dc640d3412504423e15260b1ed6
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-04-25 23:52:55 -0700

    Fix rebase issue

    Prototype change of GetOldestXmin was not updated while rebasing to
    postgres main branch. Above patch corrects same.

commit 7e10f40e60dcd2704ba98781ae98d658b3dcbb5a
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-25 15:07:34 +1200

    Track undo logs' is_first_rec correctly in recovery.

    Previously we could get confused about whether an undo record is the first
    in a transaction during recovery.  Use XLOG_UNDOLOG_ATTACH to set the flag,
    since that is always emitted before the first zheap WAL record for each
    transaction.

    If a checkpoint happens to come between that and the first zheap operation,
    it doesn't matter because then an XLOG_UNDOLOG_META will be inserted and that
    will restore the is_first_rec flag, assuming it was also tracked correctly
    during DO (that's a separate investigation).

    Thomas Munro, RM43459, reviewed by Dilip Kumar

commit 3a82f2f014ac86b7587bc85f12a5df6fc4bb1514
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-25 21:10:30 +1200

    Make src/test/modules/test_undo compile.

    After many recent changes it wasn't building.  Repair.  It isn't useful
    for testing at the moment, because if you add test content to undo logs it
    causes the undo worker to crash.  I need to figure out a way to prevent that
    from happening and then convert this into a useful set of tests of undo log
    machinery.  Watch this space.

    Thomas Munro

commit 0bf11c41400b26ec50a2edd4fb70777268993d56
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-26 09:59:39 +1300

    Change undo log segment size to 1MB.

    The previous size was 4MB.  That size wasn't chosen with much thought, and it
    creates a fairly large disk footprint for systems with many concurrent
    backends.  Let's try the smaller and nice, round number of 1MB and see how
    frequently we finish up doing filesystem operations.  Early testing with
    pgbench on powerful machines seem acceptable, since the undo worker is very
    easily able to recycle segments fast enough so that foreground processes never
    have to create a new one.

    Thomas Munro

commit 5b7fbed0c316ec15cdf8fc566a503e5428946dd7
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-09 16:18:12 +1300

    Add a README file describing the undo log storage subsystem.

    Add src/backend/access/undo/README, and update src/backend/storage/smgr/README
    to describe the new storage manager.

    Thomas Munro

commit 2b620d2e8b3bc0708106e122d42464f417406071
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-20 05:19:45 +1300

    Make temporary undo logs use backend-local buffers.

    For now temporary undo data is not discarded, except at startup when it's all
    discarded at once.  This will be addressed in later commits.

    Thomas Munro, RM43422

commit 1f176882f95411b5a4fd18740687143fd4cd2c0e
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 14:30:14 +1200

    Basic tablespace and persistence support for undo (take II).

    You can now be attached to a separate undo log for each persistence level
    (permanent, unlogged, temporary).  Undo logs can now be created in tablespaces
    other than pg_default by setting the new GUC "undo_tablespaces".

    Temporary undo logs are not yet backend-local; a separate commit will add
    that.  A separate commit will also fix some details of crash recovery.

    Thomas Munro, RM43422, reviewed by Rafia Sabih, Dilip Kumar, Amit Kapila

commit 25eb200b1aa3d05cb1d8365946fc03c3ca1c96fc
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 13:01:51 +1200

    Introduce XLOG_UNDOLOG_META WAL records.

    During checkpoints, undo log meta-data is captured at an arbitrary time after
    the redo point is chosen.  In the case of an online checkpoint, this means
    that we might capture incorrect meta-data.  Correct that by inserting an
    XLOG_UNDOLOG_META record before the first WAL record that writes to each undo
    log after a checkpoint.

    Dilip Kumar, RM43459, reviewed by Thomas Munro

commit 47c7e3b5199d561953f693e18c2f85dae7b56264
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 12:32:00 +1200

    Remove code for consistent pg_undo checkpoint files.

    Instead of trying to make pg_undo files consistent, we have decided to allow
    them to contain data from any arbitrary time after the redo point.  A
    follow-up commit will introduce new WAL records that will be emitted to
    correct them.

    This is not a complete revert of commit d8a02edf as there were some
    refactorings and small fixes that seem worth keeping.

    Thomas Munro, RM43459

commit b43da384a36b13679f5f0ec12aa72e218656d0d1
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-04-24 07:09:41 -0700

    Fix the assert.  In recovery we can not ensure that we are attached
    to the undo log from which we are allocating.

commit f306a1c5c10e63eb38c790be504e154413372a90
Author: Mithun CY <mithun.cy@gmail.com>
Date:   2018-04-24 06:35:27 -0700

    Force page init on Insert rollback

    In zheap_xlog_insert we see insert of first and only tuple on the
    page we re-initialize the page. Force page init on insert or multi
    insert rollabck so wal consistency check of page on standby still
    satisfy. This overrides previous commit 15179e57b124

    Patch by me and review by Kuntal Ghosh.

commit ffc5ea7e8577d3cce4b5e44fda55a4f7df6bed75
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-13 14:13:44 +0530

    Fix ZHeapPageGetCtid for deleted tuples

commit d8107861935ad0e202aeb14ad116722af5d22af5
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-24 18:26:23 +0530

    Handle deleted item pointers for no-key-exclusive mode

    Reported by Amit Kapila

commit 156e25415049a1d9a940bf709abc5fd5f550ee7e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-20 15:15:59 +0530

    Fix assert in zheap lock,update and delete tuple

    For deleted item id, we don't retrieve the tuple. Hence, we should
    check for the same.

commit af6f63e740cd904f234475bf530a9290b982b38b
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-04-24 03:35:42 -0700

    make zheap changes in slot_getsysattr function.  This function is
    now called by execCurrentOf.

commit 8b6041f28b8e2554da439da7701500bb2af2c712
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-23 11:11:57 +1200

    Fix warnings in non-assertion build.

    Clang complained about uninitialized variables when there was no assertion.

commit 2b96b1be41fad5aa00e8d8b8db63aada925f7581
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-20 19:17:48 +0530

    Fix a bug in ALTER DOMAIN .. ADD CONSTRAINT

    The previous commit doesn't use beginscan correctly for heap/zheap
    relation.

commit 7cb019bbcec083052f8436f4489fe6e2957b223a
Author: Beena Emerson <memissemerson@gmail.com>
Date:   2018-04-20 16:11:19 +0530

    Correct the behaviour of ALTER DOMAIN .. ADD CONSTRAINT in zheap

    Add zheap support in functions validateDomainConstraint and
    AlterDomainNotNull which will check if any tuple violates the
    newly added constraint.

    Reviewed by Kuntal Ghosh

commit dbcfe95943c7f5ddd5fbcb3b84daa62a32c3abf9
Author: Beena Emerson <memissemerson@gmail.com>
Date:   2018-04-20 14:22:16 +0530

    Allow foreign tables to be added as partitions of zheap table

    Since foreign tables do not support the storage_engine option, exempt
    them from the checks where we check if the partitions have the same
    storage_engine option as their ancestors.

commit 439869f9fc77a8e6121a74fc433eef7d822951b8
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-20 00:38:59 +0530

    Add isolation expected file of vacuum-reltuples test for zheap

    In zheap, we don't have to take cleanup lock for vaccuming the
    relation. Hence, the vacuum command won't skip the buffer even
    when another backend holds a pin on the same.

commit d50ed5e003d5104171540e3bcb45335bc941cd4a
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-04-18 13:00:28 -0700

    Force Refragmentation on Insert rollback

    In zheap_xlog_insert we see insert of first and only tuple on the
    page we re-initialize the page. Force pruning on insert or multi
    insert so wal consistency check of page on standby still satisfy.

    Patch by me and review ny Kuntal Ghosh.

commit 240b960b6cf36a5611aa13b7f67e6cebedd15b63
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-04-18 14:26:43 +0530

    Introducing a queue for passing rollback requests to undo worker

    To increase the efficiency of rollback mechanism in zheap,
    we now have a rollback queue. Now, any rollback request that
    exceeds the threshold -- rollback_overflow_size, is added to
    this queue. Whenever undo worker is idle it checks if rollback
    queue has some entries and executes the required undo actions,
    if any. The rollbacks required for 'rollbacks to savepoint' are
    not added to this queue, rather the backend itself executes the
    required undo actions for them.

    The rollback queue is implemented as a shared hash table.

    Reviewd by Beena Emerson and Amit Kapila

commit fee9b99b1ccf3dca9a822d4e41e6a3a801598eb1
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-04-17 03:37:04 -0700

    Block VACUUM FULL on zheap table, which got unblocked by commit
    0d27be592d82a44158d

    Reported by Thomas Munro and Kuntal Ghosh.

commit 49743ef26167946e60292aa3251b6bbdfdacd015
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-04-16 17:26:56 +1200

    Fix uninitialized variable.

    Per compiler warning from clang.

commit 31cd2baf750596b961c9838a61078f3c9f3eb70f
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-04-13 05:12:13 -0700

    Fix for warnings introduced by commit 0d27be592d82a44158d

    Reported by Kuntal Ghosh.

commit 274b0cb5c7ad77da7f4f98db23619a30571f6ac0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-04-12 12:53:26 +0530

    Declare get_old_lock_mode as static inline

    Otherwise clang complains about the inline function declaration.
    Reported by Thomas Munro, reviewed by Amit Kapila

commit ee296fd6b96d8294dba9437f30353797e756d6da
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-04-12 10:07:15 +0530

    Two pass vacuum

    We need vacuum in zheap for non-delete marked indexes, however, we can
    use undo to reduce three-pass to two-pass vacuum.  When a row is
    deleted, the vacuum will directly mark the line pointer as unused,
    writing an undo record as it does, and then mark the corresponding
    index entries as dead.  If vacuum fails midway through the undo can
    ensure that changes to the heap page are rolled back.  If the vacuum
    goes on to commit, we don't need to revisit the heap page after index
    cleanup.

    We must be careful about  TID reuse: we will only allow a TID to be
    reused when the transaction that has marked it as unused has
    committed. At that point, we can be assured that all the index
    entries corresponding to dead tuples will be marked as dead.

    Currently, due to lack of visibility map for zheap, we scan all the
    pages during vacuum.  A future patch which will introduce visibility
    map in zheap will remove that limitation.

    Patch by me, Mithun has fixed few bugs and added implementation for
    Rollback action, also he has done basic verification of the patch

commit 4f3a0106661ab1f52be6a4b1c1de37216c710535
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-04-11 16:51:11 +0530

    Support different tuple locking modes

    The basic idea is that we maintain each lockers information in undo
    and the strongest lockers information on the tuple. If there is more
    than one locker, then we set multi_locker bit on tuple. Now, if the
    multi_locker bit is set and the new locker conflicts with the
    strongest locker, then we traverse all the undo chains in the page and
    wait for all the conflicting lockers to finish. As we have to wait for
    all the lockers by releasing the lock on the buffer and then reacquire
    the buffer lock after waiting for all the transactions is finished, in
    the meantime, a new locker (say key share) can take a lock on the
    tuple and we won't be able to detect it unless we do something
    special.  Now, one might think that as before waiting for multiple
    lockers we have acquired a heavyweight lock on tuple by using
    heap_acquire_tuplock, no other transaction can acquire xid-based lock
    (something like key share), but that is not true, as that is allowed for
    both heap as well as for zheap (till now).

    Now, the heap can detect such a case because it always creates a new
    multixact whenever a new locker is added to the existing set of
    lockers and it puts the newly create multixact id in xmax of tuple, so
    in above case after reacquiring the buffer lock we can just check if
    the xmax has changed and if so, then we redo the *TupleSatisfies check
    and again wait for new lockers.

    For zheap, after reacquiring the buffer lock, check again if there is
    any new locker on the tuple and to find this we need to again traverse
    the undo chains. Although this doesn't sound the best design, it is
    not clear whether it can really create the problem. If we want we can
    optimize by checking whether LSN of the page is changed, then only go
    for chasing all the undo chains, sure that won't work for unlogged
    tables, but still it is a good optimization. Another thing is that
    this approach is quite simple, so we inclined to go with this
    approach.

    We clear the multi_locker bit lazily like when we are already
    traversing all the undo chains to verify if there is any new locker on
    the tuple after taking the buffer lock.

    The visibility routines don't need any special handling as we are
    already storing the strongest locker information on a tuple which can
    help us get the transaction information of updater which is what is
    required to check visibility of tuple.

    Patch by me with help from Kuntal Ghosh and Dilip Kumar, reviewed and
    tested by Kuntal Ghosh

commit 5bf3a603cd575ec188e42b1e0b7b0978f0fc54ed
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-04-11 16:12:12 +0530

    Retrieve the transaction slot of modified tuple

    This is required for the upcoming tuple locking patch.

    Patch by me, reviewed by Dilip Kumar

commit f2922167ffb0134854ca1fceea7b80d1a657cc23
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-04-05 11:03:14 +0530

    Fix the incorrect update of infomask for inplace updates

    Ensure to copy everything from new tuple in infomask apart from
    visibility flags.

    Patch by me, reported by Kuntal and reviewed by Ashutosh Sharma

commit 1dc99668b499b30a6fb7146e7c6156178143ced7
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-04-04 00:30:38 -0700

    If full_page_writes is enabled, and the buffer image is not included in
    the WAL then we can rely on the tuple in the page to regenerate the undo
    tuple during recovery as the tuple state must be same as now, otherwise,
    we need to store it explicitly. But, in current code it is not stored
    externally even if the page image in included in the WAL.

    Introduced new API XLogInsertExtended. Unlike XLogInsert, this function will
    not retry for WAL insert if the page image inclusion decision got changed
    instead it will return immediately. Also, it will not calculate the latest
    value of the RedoRecPtr like XLogInsert does, instead it will take as input
    from caller so that if the caller has decided to not to include the tuple
    info (because page image is not present in the WAL) it can start over again
    (if including page image decision got changed during WAL insertion). And
    for the zheap wherever we need to include tuple info for generating the undo
    record, we will call this new function instead of XLogInsert.

    Patch by Dilip Kumar. Review and modified by Ashutosh Sharma and Amit Kapila

commit 9b3f81dd68751cd8fc61a588e07c0299c52b1e7b
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-04-03 01:44:20 -0700

    Remove undo for INVALID_XACT_SLOT

    Previously we needed this undo to identify the exact xid which
    was there in the slot before reusing it.  And, that was stored
    as uur_prevxid of the undo record.  Now, we are already including
    uur_xid (transaction id which has inserted undo record) so we can
    find the actual slot xid just by traversing the undo chain.

    This will reduce the complexity of the code.  And, this will also
    remove the limitation that one transaction is writing the undo in
    other slots. So, by removing this limitation we can rewind the insert
    location during rollback from the backend.

    Patch by Dilip Kumar Review and Defect fixes by Amit Kapila

commit a354f4e7ad32176b57a47251fb8ac2366d0bff3c
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-04-03 13:44:33 +0530

    Include undolog.h in undorecord.h to fix compilation error on MSVC.

    Ashutosh Sharma

commit 44f5d0336add411302e9de94329ac7cc8808508c
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-20 00:26:36 +1300

    Fix uur_next corruption by replacing global variables with UndoLogControl.

    Previously we could corrupt our uur_next chain when the undo log you're
    attached to changes.  This broke some later commits.

    Get last_xact_start and prevlen in UndoLogControl instead of maintaining a
    local copy.  They can be read directly from shmem without locking (but not
    written) by the backend that is currently attached.

    prev_txid remains as a global variable, and it needs to be cleared whenever
    the undo log changes.  Perhaps it could be stored in UndoLogControl too, but
    that leads to some circularities.

    Thomas Munro, RM43421, reviewed by Rafia Sabih

commit a0536401ad8f228ea7bffa08ccbb9f69d62fb4d1
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-12 15:29:05 +1300

    Remove UndoDiscard and move its state into UndoLogControl.

    Previously, a per-undo log object "UndoDiscard" was used to track the progress
    of the undo worker machinery.  It was in a shared memory array of fixed size,
    which isn't going to work.  Move that state into the UndoLogControl object for
    each undo log.

    Since this change requires undodiscard.c to have direct access to
    UndoLogControl objects and to iterate over them, create new functions
    UndoLogNext(), UndoLogGet() with extern linkage to do that.  A better
    interface is probably needed -- to review later when we work out the type of
    access that multi-process undo worker infrastructure will need.

    Register the LWLock tranches.

    Thomas Munro, RM43420, reviewed by Dilip Kumar and Amit Kapila

commit 709a41dc526b9eecda1d4891f758b3d268af2650
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-08 16:01:03 +1300

    Move UndoLogControl struct into header to prepare for wider use.

    A later commit will make use of it from other translation units so let's not
    define it in undolog.c.  This requires hiding the definition when included
    from FRONTEND code.  Perhaps this should have a new header of its own or some
    other reorganization, but for now let's just do it conditionally.

    Thomas Munro, RM43420

commit 8b74af923855cd72e1fca40b8822d546fc07a1ab
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-22 11:07:35 +0530

    For calculating latest removable xid, don't use raw xid from page

    This commit also modifies ZHeapTupleGetTransInfo to work with
    deleted item pointers when we've buffer lock on the page.

    Reviewed by Amit Kapila

commit 38345d269e4bb4908b9e48cb3a4a955d2579a0cf
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   2018-03-26 12:23:43 +0530

    Support Tid Scan for tables with zheap storage.

    Use zheap equivalent of heap_fetch() to fetch the next tid.

    To support WHERE CURRENT OF, have a zheap-equivalent of
    heap_get_latest_tid() function. This function in turns uses
    the zheap visibility function. For following the ctid chain
    for non-in-place-updated tuples, have the MVCC visibility function
    pass back the ctid of the new tuple.

    Although zheap_get_latest_tid() accepts a snapshot, it may not work
    with snapshots other than MVCC snapshot, because the corresponding
    visibility functions for those snapshots are not modified to return
    the new tuple ctid. This would be done in later commits, since it is
    not necessary for Tid Scan.

    Furthermore, the callers of heap_get_latest_tid()
    (e.g. currtid_byreloid) should be modified to call
    zheap_get_latest_tid() for zheap tables. This also would be done
    in later commits.

    Patch by Amit Khandekar, reviewed by Amit Kapila and Kuntal Ghosh.

commit ec068b0903c8dc9788b06e0405f6d4c94ba24dcc
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-03-25 22:14:13 -0700

    Handling the rollback if error in commit path

    During the commit if there is error occure before updating the
    status in the clog then we need to track the undo pointers and
    apply the undo actions.

    Patch by Dilip Kumar review by Amit Khandekar.

commit 37994ff7060acc5317ccfd8c48c2ef440e4a3504
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-23 13:55:50 +0530

    Fix README.md for better readability

commit 698f1fb1bb35f423891c7ae64b4f6f7f1f1f2074
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-21 16:22:49 +0530

    Make transaction slots per zheap page as compile-time parameter

    One can specify the same using --with-trans_slots_per_page=<VALUE> while
    configuring postgres installer. Allowed values are 1,2,4,8,16,31. By default,
    it is assigned to 4 slots per page.
    The changes required for Windows have been done by Ashutosh Sharma.

    Patch by me, reviewed by Amit Kapila and Ashutosh Sharma

commit acbc6636929b21c6df12e4c69964a2c4035652c6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-23 12:33:34 +0530

    Bugfix in btree and hash delete items

    Sizes of xl_btree_delete and xl_hash_vacuum_one_page were not properly
    calculated.

commit a8bae1f3400d9e98e46268040521f517d556efe9
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-03-23 12:02:35 +0530

    Handling rollbacks in prepared transactions

    Reviewed by Dilip Kumar and Amit Kapila

commit 2fa48d89c290f309fa92b719d1070b5f329f4daa
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-03-23 10:58:40 +0530

    For speculative insertion, store a dummy speculative token in the
    REDO function (zheap_xlog_insert()) so that, the size of undorecord
    in DO and REDO function matches with each other.

    Ashutosh Sharma, Reported by Neha Sharma.

commit 574f673a1cf6d494b2ef0817962aee93bb0659a1
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-03-22 10:51:23 +0530

    Fix for warnings introduced by commit 70f35f756834028c

    Reported by Amit Kapila

commit 89cb28898369622578be42f7a131c024b7bb5f05
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-19 18:13:52 +0530

    Skip unnecessary palloc for deleted item pointers

    Reviewed by Amit Kapila

commit 4765f6669408f9e13e3cebd1c33a668d718b5b98
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-03-20 16:19:00 +0530

    Bug fixes in rollback

    1. Check for running transactions using top transaction instead of
    current transaction state in XactPerfromUndoActionsIfPending

    2. Insert start undo record if we rewound the previous start undo
    record pointer while rollbacking the subtransaction containing the
    start undo record of the transaction.

    Issues reported by Neha Sharma, reviewed by Dilip Kumar, Amit Kapila

commit 1830381aa7c44d0ba24fb1fc6fc94cab87165705
Author: dilip kumar <dilip.kumar@enterprisedb.com>
Date:   2018-03-19 23:44:10 -0700

    Removed unwanted function "SetUndoPageLSNs"

commit 6a2b8d3f80829987908005941de9c9583577430e
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-03-19 16:36:54 +0530

    Place the if-check used to decide whether the last tuple in a page
    can be inplace updated or not outside the check to know if a page
    needs to be pruned, otherwise, even if the tuple to be updated is
    a last tuple in a page, it won't go for inplace updated if the
    page pruning is not required.

    Ashutosh Sharma, Reported by Neha Sharma, Reviewed by Mithun CY.

commit 6953e07fcf5bd914c31beb6adcc756ea34974940
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-16 14:28:39 +0530

    Fix zheap_xlog_multi_insert to release the buffer properly

    In zheap_xlog_multi_insert, we should unlock and release the buffer
    even after restoring the block from backup image.

    Issue reported by Tushar Ahuja

commit 350f34b92fcac83aa9447cb8b7685f1f5e3655e4
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-03-15 22:09:38 -0700

    Bugfix advance latest RemovedXid only if tuple is dead.

    By Mithun C Y, Review Dilip Kumar

commit e63e2365e1ccc78424032fafa8c3a5a012955c1e
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-03-08 11:50:50 +1300

    Fix incorrect worker name displayed if undo launcher/worker dies.

    Previously bgw_type was uninitialized, causing at least some systems to show
    a value that confusingly just happened to be "logical replication launcher" to
    be displayed by the postmaster in error messages.

commit 5541c5927873c34aecbf5bc82b7610fc8b035e94
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-14 14:17:48 +0530

    Fix UndoFetchRecord for fetching UNDO_MULTI_INSERT

    For UNDO_MULTI_INSERT undorecords, we've to check whether our offset
    number falls in the offset range stored in the payload of undorecord.
    We've added a callback function in UndoFetchRecord to check whether
    an undorecord satisfies any blocknumber, offset number and xid.

    Reviewed by Amit Kapila

commit 986cfc9f789e51fa796b2afdb8ee093c740172a3
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-03-13 22:12:30 -0700

    Bug Fix in CopyTupleFromUndoRecord

    For certain undo record type we will not have tuple data associated
    with them. For Copying such tuple we need an input tuple. Above patch
    Asserts and address those issues.

    Patch By Mithun C Y Review By Amit Kapila

commit 28078c9b70d8d6d7a9af9f5f5789b3a739af0ca9
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-07 16:03:23 +0530

    In zheap_prepare_insert, reset visibility bits in infomask/infomask2

    Dilip Kumar and Kuntal Ghosh, reviewed by Amit Kapila

commit 1cd9b192209084bcea39c8603a1de54e8481e5ee
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-08 14:26:14 +0530

    Enable WAL consistency check for zheap

    We've added zheap_mask function to mask unimportant fields in
    zheap page before consistency check.

    Patch by me, reviewed by Mithun CY

commit d1d0d9de831695f218f6390f0fea742ddce98b46
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-03-07 13:17:25 +0530

    Fix compiler warnings

commit ac3838b9ab274d2e18e7f4ec19b02ece442d043b
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-03-06 17:30:13 +0530

    Support Insert .. On Conflict

    The design is similar to current heap such that we use the
    speculative token to detect conflicts.  We store the speculative token
    in undo instead of in the tuple header (CTID) simply because zheap’s
    tuple header doesn’t have CTID. Additionally, we set a bit in tuple
    header to indicate speculative insertion.  ZheapTupleSatisfiesDirty
    routine checks this bit and fetches a speculative token from undo.

    Amit Kapila and Ashutosh Sharma

commit 96eeccc2d9c095fcea050e3c797b9934c9d1cfcc
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-06 12:31:21 +0530

    Fix MinZHeapTupleSize definition

commit 7ba6c3c85e913faa02cb7507093abf648d05a3b2
Author: dilip <dilip@localhost.localdomain>
Date:   2018-03-02 01:17:23 -0800

    Commit (db9d0b2988c829b8e0599ebf087a10b98cb9690d) calculated the
    latest_urec_ptr in case of SUB_INPROGRESS but it should have done that
    for SUB_ABORT case.

    Pointed out by Rafia Sabih

commit 89cb2efebb0b039786a1ffde7b48ff16b2ece128
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-03-01 13:47:14 +0530

    Bug fix in zheap_xlog_multi_insert

    This fixes a type casting error. It also fixes the condition for changing
    the offset range.

commit 0eb31eb589c1430eecf430a9c04f988b98914d8e
Author: dilip <dilip@localhost.localdomain>
Date:   2018-03-01 05:55:22 -0800

    Bugfix in rollback

    Latest_urec_ptr is not calculated before executing undo actions,
    fixed the same.

    Patch by me, Reviewed by Amit Kapila.

commit 8043baef73e74311eb399305e99e12bb8e86f868
Author: akapila16 <amit.kapila@enterprisedb.com>
Date:   2018-03-01 17:26:52 +0530

    Create README.md

    This document is to help users understand how to use zheap and open issues.

    Amit Kapila, reviewed by Robert Haas

commit 82320f78eec7eb44584d00728ed84e6a89d5d55b
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-03-01 17:12:41 +0530

    Remove memory leak in various zheap related functions.

    Ashutosh Sharma, reviewed by Amit Kapila.

commit 0618b93ac8f431de813f23d5c8a7b68812ef729c
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-03-01 17:02:12 +0530

    Initialize the new zheap page allocated during update operation on
    zheap tables correctly in zheap_xlog_update().

    Ashutosh Sharma, reviewed by Amit Kapila, reported by Tushar Ahuja.

commit 729d4cb6b09c98a5ecc57a670f55b468ab598d45
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-03-01 16:34:44 +0530

    Design of zheap

    This readme covers overall design of zheap.  This is to help
    developers and or users to understand the zheap.  Later, we might
    split this into multiple README's.

    Amit Kapila, Robert Haas and Dilip Kumar

commit 17ae6f76fb85a1036111eb62f6bf0a2375531306
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-02-28 11:27:23 +0530

    Fix memory leak in zheap_multi_insert wal replay

    Reported by Ashutosh Sharma

commit 0d03216692b1d38b646126fd6300de1e619bc2db
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-02-27 16:30:11 +0530

    Allow zheap_lock_tuple() to release lock on a zheap page if the
    tuple may be updated but the desired lock on a tuple is already
    acquired.

    Patch by me, as per the suggestions from Amit Kapila.

commit 37911c71c174dd3b95befe2d5f7b44d3dc421f62
Author: dilip <dilip@localhost.localdomain>
Date:   2018-02-25 20:58:02 -0800

    Currently, on standby, we don't have DiscardUndoInfo like we have on the
    master side. So, on master before accessing any undo buffer we hold lock
    on DiscardUndoInfo in shared mode and undoworker hold that lock in exclusive
    mode. But on standby side we discard undo directly by WAL. So even though we
    check that undo is not discarded, but by the time we try to access the buffer
    undo may get discarded by the wal.

    Patch fixes the problem by checking the standby recovery conflict with other
    snapshot.

    Patch by Dilip Kumar, Reviewed by Ashutosh Sharma and Amit Kapila

commit c6457f72c4e7695680118902f4c662bedba3a965
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-02-21 01:34:42 -0800

    Remove unrelated code of previous commit

    In commit 5ba1e2f4c605f63a6deca278f39c9bfa05afb239 we have committed
    some unrelated code this patch removes same.

    By Mithun C Y

commit c04a4c83e19f0e9e10cb6aab0ab12ff8a241880f
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-02-19 21:18:29 +0530

    Fix zheap insert options for COPY

    Patch by me with help from Dilip Kumar

commit 17aaf8247ae3ef5763f5a164528f5c9f38dbe345
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-02-20 16:37:37 +0530

    Remove redundant header inclusions.

commit 82681700ad5bd7b976438138d1c970eb6c55720a
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-02-20 02:26:27 -0800

    Check ItemIdIsDeleted on reacquire of BufferLocks

    In some cases after reacquiring the BufferLocks ItemId's might have
    been pruned so check if it is deleted before accessing ItemId's.

    Mithun C Y, reviewed by Amit Kapila

commit cf4614bf7a7b294eae264d4123d76584d66f0384
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-02-16 05:06:28 -0800

    Bugfix in ItemId status check.

    We have used tuple header flag
    ZHEAP_INVALID_XACT_SLOT instead of itemId flag
    ITEMID_XACT_INVALID while checking its status
    this caused undefined behavior. And, fixed a
    condition in zheap_search_buffer.

    Patch By Mithun C Y reviewed by Amit Kapila

commit 418c2e1ac7c46497b3d994bb3db5ea6850d66c50
Author: dilip <dilip@localhost.localdomain>
Date:   2018-02-16 02:12:37 -0800

    bugfix in FetchTransInfoFromUndo

    The actual condition to break the while was if undo type is
    UNDO_INVALID_XACT_SLOT and undo_xid is input xid. This was
    broken in some of the previous commit.

    Patch by me reviewed by Amit Kapila

commit 051a7a2a3849a1140627de7414f79a868a292c38
Author: Beena Emerson <memissemerson@gmail.com>
Date:   2018-02-16 14:59:51 +0530

    Provide zheap support in check_default_allows_bounds

    For a zheap partitioned tables, check if the default partition has rows that
    meet the constraints of the new partition.

    Beena Emerson, reviewed by Rafia Sabih and Amit Kapila

commit a9c5724aca95877300414606bec648e8f512e045
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-02-13 16:02:37 +0530

    To add the storage_engine option in conf.sample file.

commit d61ff363ce28733fb518ccfa5fe9d32f71b7994d
Author: dilip <dilip@localhost.localdomain>
Date:   2018-02-12 15:29:33 +0530

    wal log oldestxid having undo

    Currently oldestxid having undo is not durable and value is also
    not sent to standby.  As part of this patch this value is included
    in checkpoint record so after server restart also value will be
    valid.

    Patch by Dilip Kumar Reviewed by Kuntal Ghosh and Amit Kapila

commit 16fa749cf1c025efb744f1ebd246690a541fe0ac
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-02-12 02:02:57 -0800

    Set page prunable on Insert undo, inplace update
    with reduced lengths.

    Patch by Mithun C Y Reviewed by Amit Kapila.

commit 50f9fb43a3c302cf4fc206227786fba26ce72167
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-02-09 23:52:42 +1300

    Create missing undo log segments during recovery.

    During recovery, we might discover that the segment files that existed at the
    time of the checkpoint don't exist, because they'll be deleted by later WAL
    traffic.  We'll create zero-filled files to avoid errors, and trust that the
    contents of the files will never be needed, because later WAL records discard
    them.

commit bc2ed1a1cefcb56749b3de382c6d6056e7837bd4
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-02-09 23:10:22 +1300

    Make undolog meta-data checkpoints consistent.

    The earlier prototype code captured undo log meta-data at an arbitrary point
    in time somewhere after the redo point.  That worked only for clean shutdowns,
    at which point it was consistent and correct.  To do the job properly, this
    commit keeps track of two copied of each undo log's meta-data in memory: the
    current meta-data, and a snapshot as of the last checkpoint.  In order to
    maintain the checkpoint snapshot, every operation that modifies an undo log's
    meta-data must check if we are now on the other side of a redo point.  Since
    the shared memory access and lock contention would be expensive, we only
    actually do that while a checkpoint is in in progress, from a moment just
    before the redo point is chosen up until we discover that we are now on the
    other side of a redo point, which should ideally mean that it happens only
    once.

    RM43038, Thomas Munro

commit df81f34c6d62ec7601314d5561b8bf1be1596f53
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-02-08 23:37:10 -0800

    Fix crashes in page access after page pruning .

    Issue is in many places while fetching or updating
    tuple we failed to check if tuple has been
    eleted/updated and pruned after we have released the
    Buffer lock. This caused invalid access of pruned
    tuples from the page. Now we now check if itemid is
    deleted before accessing the tuple data in page.

commit b26c26c7e0651cbf4cd5fbfe865ee2226b776ee0
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-02-09 10:29:42 +1300

    Fix an ordering bug when discarding undo buffers.

    We need to forget about undo buffers before we remove or recycle undo segment
    files, since otherwise a concurrent backend might try to write a buffer in
    order to evict it and discover that the file is gone.

    We also need need the same logic during recovery, so let's refactor that code
    into a function called from both UndoLogDiscard() and undolog_xlog_discard().

    Thomas Munro, based on report from Neha Sharma and diagnosis by Kuntal Ghosh

commit 25c38e52afb9d4252117fcb4bc575173b43f7492
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-02-08 17:57:44 +0530

    Validate CHECK constraints on zheap relations.

    Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila.

commit 1ef3bcd30c337456386ccfbafff1a913f111782a
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-02-08 17:47:55 +0530

    Implement Table Rewrite performed during execution of ALTER
    TABLE command in zheap.

    Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila.

commit 652911807ff33f35a7c297fdde709fbfc30f43be
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-02-08 17:44:50 +0530

    Bugfixes in zheap_to_heap and heap_to_zheap APIs

    Allow zheap_to_heap and heap_to_zheap to allocate the values and
    nulls array based on the number of attributes specified in tuple
    descriptor rather than tuple header.

    Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila.

commit fc6d0379c922e8c7cf797ec7c7247eb845c6ddb4
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-02-08 17:43:55 +0530

    Restrict clustering of zheap tables

    Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila

commit fe470e36e6ef990861d210cde6fc659657554ac3
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-02-07 21:56:43 +1300

    Don't try to use DSM segments for undo logs in single-user mode.

    Thomas Munro, per bug report from Kuntal Ghosh

commit 5f10daf1c6a1a91b4311f0a578fb9f86eb1fd5cf
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-02-06 14:13:29 +0530

    Avoid stack overflow in visibility routines

    Till now, we were traversing the undo chain in a recursive way which
    could easily lead to stack overflow for very large transactions.
    Traverse the undo chains in a non-recursive way.

    Amit Kapila, reviewed by Dilip Kumar

commit 2b3c4b122d756e4645c69709ed069eb754705a07
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-02-05 18:28:47 +0530

    Implementation of 'For Update and For Share' tuple lock modes

    This requires tuple to be locked in Exclusive or Shared mode and it
    will conflict update, delete and other modes of locks.  Currently,
    multiple lockers for shared mode are not supported, that can be done
    as a separate patch.

    For other types of lock modes, user will get error "unsupported lock
    mode".

    Amit Kapila and Kuntal Ghosh

commit 57a987ada64779896e4c5653d7e65a628f51233c
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-02-05 03:28:56 -0800

    Bug fix in bitmap scan of zheap

    Consider zheap pages when calculating MAX_TUPLES_PER_PAGE

    Patch by Mithun C Y Review by Amit Kapila

commit 35c926be66fb6d991335e93ded5ae59e1919bf2b
Author: dilip <dilip@localhost.localdomain>
Date:   2018-02-02 13:37:59 +0530

    Warning fix

commit 1d455e7504fd2a662c0624f36a42f1b303afdb20
Author: dilip <dilip@localhost.localdomain>
Date:   2018-02-02 11:14:50 +0530

    Minor check fix while recovery from worker

    Reviewed by Rafia Sabih

commit 167901ea80db5e86c64e68630ac5c1082acf464b
Author: dilip <dilip@localhost.localdomain>
Date:   2018-02-02 09:56:44 +0530

    Bugfix in wal recovery

    Flag, is_first_rec is not reset after allocating the first undolog
    for the transaction and it was considering transaction header for
    subsequent allocation for the transaction.

    Patch by Dilip Kumar Reviewed by Amit Kapila.

commit 1ddec06687a1c662d898c865d0a51fe85587040e
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-02-01 17:53:05 +0530

    Fix for Xmax value of unmodified zheap tuples

    Now, output InvalidTransactionId as the value of xmax for the
    zheap tuples which are unmodified. Previously, it was set to
    FrozenTransactionId which was incoherent with the behaviour of heap.

    Reported and reviewed by Ashutosh Sharma

commit d8c3e25148045a50a0f0a767cf803dad588e42d1
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-01-31 18:11:18 +0530

    Bug fix for undo actions

    If backend tries to apply undo actions for a record which is
    already discarded by undo worker, then exit quietly.

    Reported by Neha Sharma

commit 13ecdf57369df7a70087ce1a814fd5586dc420cb
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-29 15:18:32 +0530

    During recovery, set correct urecptr for non-inplace updates

commit 7f8c4546c7456a233b7317a08d730c797e23032f
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2018-01-26 15:18:35 +1300

    Fix recovery of undo logs after a standby crash.

    A standby should never delete undo log meta data files that are referenced by
    its own control file, or it would fail to start up.

    RM43132, analysis and patch by Ashutosh Sharma, tweaked by me

commit 83159870bccd343672cb228580ec5f3c779c6b34
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-01-25 18:55:09 +0530

    Recovery of zheap relations when rollbacks are pending

    Now, the undo actions are performed at the time of restart
    for all those transactions that were in progress at the time
    of system crash or the last time when system was up. This
    ensures proper recovery of transactions involving zheap relations.

    A caveat to note here is that undo worker is initialised with the
    connection to default database 'postgres'.

    Reviewed by Dilip Kumar and Amit Kapila

commit c308be7d90a8a8ffb068203303cf0c1c99ff4fb2
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-25 11:00:10 +0530

    Fix definition of SizeOfZHeapMultiInsert

    Reported by Amit Kapila

commit 34619026014c203f2bd9a2d151ffae2778140ea4
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-24 17:43:10 +0530

    Allow ANALYZE on zheap table using VACUUM ANALYZE command

    Also, we forward the warning to LOG to avoid regression
    failure of some test cases.

commit d5454ef8d5e0556a1fceff48dcf13d55ecc8cffa
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-24 13:47:56 +0530

    Assign table oid while constructing heap tuple from zheap tuple

commit 81e55b488c707be50437484d21a1e92b553b291d
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-01-24 12:39:39 +0530

    Move zheap_insert's buffer modifications inside critical section

    Previously, it was not done because we thought we need to add the
    tuple in page before forming the undo record as undo record requires
    blockid and offset which we can only get after adding the tuple.
    However, on closer inspection, actually, we only need blockid which we
    can get from the buffer.

    Patch by me, reviewed by Rafia Sabih

commit 9351f042a09197bcae827889e3b78ddea2e7b68d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-02 12:07:15 +0530

    Concurrent index creation on zheap tables

    Add the ability to create indexes 'concurrently' on zheap relations,
    that is, without blocking concurrent writes to the table.
    Patch by me, Reviewed and tested by Ashutosh Sharma, Amit Kapila

commit 4780ba565205c638a890f053908157d128ea2470
Author: Beena Emerson <memissemerson@gmail.com>
Date:   2018-01-23 14:44:20 +0530

    Support storage_engine option for partitioned relations

    Allow partitioned table to have the storage_engine option and throw error when
    user tries to create a partition with storage_engine different from the parent.

    Reviewed by Rafia Sabih, Ashutosh Sharma

commit dec2957cbddadefda7b49fbf9c3a0d920d01a08b
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-01-23 12:28:41 +0530

    Bug fix for GiST indexes on ZHeap relations.

commit 9444dc2d14dc3f11d4a79156577475afe6e3f393
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-23 12:00:34 +0530

    Fix warning in nodeSamplescan

    Reported by Amit Kapila

commit bc3ff0e94567b0e85de7830113e84468fa003ed8
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-01-23 11:11:13 +0530

    Add comments to elaborate why we always use Top Transaction Id in
    zheap.

commit a1b34455e7e1ca3b48373c192c25773e84b33b95
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-01-22 02:46:07 -0800

    Bug fix in space reuse

    Fix Null pointer access.
    Patch by Mithun C Y reported by Ashutosh Sharma.

commit 34be2349dca4d238ec3fe99fbe4bbc299ca0c615
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-22 12:28:25 +0530

    Fix definition of SizeOfZHeapMultiInsert

    Reported by Amit Kapila

commit 0257c1a99b1c1958d6c1ed535d219ab7ad4e73f6
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-01-19 15:55:49 +0530

    Add zheap relevant functions in DefineQueryRewrite

    Reviewed by Ashutosh Sharma

commit 38cef475cb6c6b74765be07b2e1f419661c002f6
Author: dilip <dilip@localhost.localdomain>
Date:   2018-01-19 13:18:40 +0530

    Bug fix in space reuse

    Item is getting acceseed without checking whether its deleted or
    not.

    Patch by Dilip Kumar reviewed by Mithun C.Y.

commit ce9e110fb857e050aea92c5f1affbf7b7246d6b1
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2018-01-19 11:54:19 +0530

    Bugfix in zheap_update().

    pfree memory allocated for payload bytes during non-inplace update
    in zheap_update().

    Ashutosh Sharma, With some help from Dilip and Kuntal.

commit 079443ce69d8425b42bbd217579753f6ce8d8ee6
Author: dilip <dilip@localhost.localdomain>
Date:   2018-01-17 13:59:30 +0530

    Perform undo action in case of error.

    If there is some error while executing some sql statement, the patch
    take care of applying the undo actions required for rolling back the
    work done.

    Patch by Dilip Kumar Reviewed by Rafia Sabih and Amit Kapila

commit 9229daa6df42bc3eb60de0b4ef0728ea3a67a011
Author: dilip <dilip@localhost.localdomain>
Date:   2018-01-18 11:08:17 +0530

    Block TidScan for the zheap

    Currently, tidscan is not implemented for zheap so give error.

commit 878ea5ed9c76ce7481a86d636728cc3e7b8e288a
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-17 15:47:05 +0530

    Bugfix in IndexBuildZHeapRangeScan

    We should avoid freeing zheap tuple from the slot in
    IndexBuildZHeapRangeScan when pageatatime scan mode is used for
    scanning the underlying relation.

    Reported by Ashutosh Sharma

commit 4ce7628b00bec6b0058ec9aeb4eade82075fd0b5
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-01-17 17:19:29 +0530

    Extend the support of exclusion constraints to zheap relations

    Reviewed by Ashutosh Sharma

commit a842d28fcb9de980a290268d8c557e1749bb232b
Author: dilip <dilip@localhost.localdomain>
Date:   2018-01-17 13:58:37 +0530

    Bug fix in rewind

    If we are under a subtransaction then just reuse one
    slot, because during the rollback of the subtransaction we will rewind the
    undo insert location and the undo written for invalidating the slot will be
    overwritten. So, it is better to invalidate only one slot which our
    transaction is going to use.

    Patch by Dilip Kumar reviewed by Amit Kapila

commit e5e03186920b5a79025956258b17fe87b3d50699
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-11 17:48:06 +0530

    Fix mask value used in ZHeapTupleHeaderGetNatts

    ZHeapTupleHeaderGetNatts should use ZHEAP_NATTS_MASK to fetch number
    of attributes from t_infomask2.

commit 4c6d4c685f52b0a7d998097734d11b5387ee05c3
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-11 11:49:25 +0530

    Implement TableSampleScan for zheap

    Reviewed by Mithun CY

commit 68bee6f61f146ef71630d3809482a1944ef40ddd
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-01-10 03:26:41 -0800

    Fix cursor fetch backward scan for zheap

    Reviewed by Kuntal Ghosh

commit ed720bd223e3ce9842681ccacd4de84b052c619c
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-01-10 03:01:21 -0800

    get rid of KeyTest in zheap scan

    zheap do not support catalog tables hence
    no need of keytest as it is in heap. So
    adding asserts to acknowledge same.

    Reviewed by Kuntal Ghosh

commit cb19200ec6837fa2fb1aef37ad045db5661c892e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-10 15:31:23 +0530

    Fix brin summarize for zheap

    Reported by Mithun CY

commit 9d6ee4a45df4911854ac32d205eeaab648ce4fb0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-10 11:01:34 +0530

    Implement bulk insert strategy for zheap_insert

    Reported by Ashutosh Sharma

commit 8253166c9f8100e6ce75049004787215f5befe37
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-08 17:43:42 +0530

    Fix sysattributes fetching for the ZHeap

    Make sysattributes fetching work with 64-bit transaction ids.

commit e9a4184a96df7f96b53f1a849cdb75cae6b4ee2a
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-09 17:27:39 +0530

    Handl COPY FROM for non-multi-insert mode

commit 78d680eb0deef07ead5b93c70168284a73ea30ec
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-09 16:11:59 +0530

    For triggers, tuple table slot shouldn't try to free memory itself

commit 250fef496972efec042ec83a490d377b1621f4a1
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2018-01-08 14:00:45 +0530

    Support Epoch in Zheap pages

    The idea is to make the change related to 64-bit transaction ids only
    for the zheap pages whereas heap pages still operate with 32-bit
    transaction ids. We will need wraparound and freeze vacuums for
    metadata (system tables) stored in heap, but not for data stored in
    zheap tables. The way to make 64-bit transaction ids in zheap is to
    store epoch along with transaction id in each transaction slot (which
    will make each transaction slot as 16 bytes (4 bytes transaction id,
    4 bytes epoch, 8 bytes undo pointer)). Now, if we somehow ensure that
    there is no undo for any transaction whose age is 2-billion years old
    (wraparound limit), then we can easily make out the visibility using
    current epoch and oldest_xid_having_undo. The way to achieve it is to
    stop the system if it reaches such a situation. We piggybacked on the
    existing wraparound machinery to raise different warning messages once
    the system reaches that stage. If the epoch+xid in the page is lesser
    than oldestXidWithEpochHavingUndo then it is all visible, otherwise,
    the transaction will belong to current epoch and all the current rules
    of current transaction system will work.

    The reason for relying on 2-billion transaction age limit is that
    current system (Transaction related functions like TransactionIdPrecedes)
    relies on that and we don't want to change it.

    We won't need any vacuum for freezing the transaction ids or wraparound
    vacuums for zheap pages after this commit.

    Amit Kapila, with some contribution by Dilip Kumar, reviewed and
    verified by Kuntal Ghosh and Dilip Kumar.

commit 0cc6dbc97f9320a21cad0d7deb86a9867b51e97f
Author: dilip <dilip@localhost.localdomain>
Date:   2018-01-08 11:19:01 +0530

    Bugfix in rollback to savepoint

    When rollback (partial or complete) is done from the backend then
    rewind the insert location of the undo log.

commit 455bb1d00b8258f0e76b88cbae9705e308c8900a
Author: dilip <dilip@localhost.localdomain>
Date:   2018-01-08 10:22:20 +0530

    Bugfix in parallel query

    Added function for parallel begin scan on the
    zheap table.

    Path by me, reported by Mithun C.Y reviewed by Amit Kapila

commit e9d75ef1410647ef223ebf52528d8010ede6f118
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-29 12:11:00 +0530

    Implement ANALYZE on zheap tables

    Reviewed by Amit Kapila

commit a74fac0c485211a102ffe4208a9549f0f5f5de52
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2018-01-04 19:16:00 -0800

    Bug fix for zheap page pruning

    When pruning we should ignore deleted itemid's as there
    is no tuple space assosiated with them.
    Also if tuple is non-inplace updated then consider the
    previous space assiated with them for pruning.

commit 3d4a58d41d19cb296633801169c31eeb7caf4c23
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2018-01-04 17:51:48 +0530

    Bug fixes for storage_engine option

    Ignore the storage_engine option for views or
    partitioned relations.

    Avoid adding storage_engine option multiple times
    when it's given in conf file as well as in CREATE
    statements.

    Reported by Ashutosh Sharma
    Reviewed by Amit Kapila and Ashutosh Sharma

commit 34dfda185019a93abdb77894558e37df9cd2b9b1
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2018-01-02 14:02:53 +0530

    Implement SatisfiesNonVacuumable for zheap

    Patch by me, Reviewed by Amit Kapila

commit 63e4455ec4009eae16fc2d4bbcc8c14767f0c5dd
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-12-22 13:57:59 +0530

    Remove redundant function definition added by commit 8aeb255f.

commit fb16b15dd108f1769587a5f81d57ac5e9a9478ab
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-12-22 13:51:13 +0530

    Support space reuse within a page

    Allow reuse of space for transactions that deletes or non-in-place
    updates the tuples or updates the tuples to shorter tuples. It has
    the capability to reclaim the space for deleted tuples as soon as
    the transaction that has performed the operation is committed. There
    is some difference between the way space is reclaimed for
    transactions that are committed and all-visible vs. the transactions
    that are committed but still not all-visible. In the former case, we
    can just indicate in line pointer that the corresponding item is dead
    whereas for later we need the capability to fetch the prior version of
    tuple for transactions to which delete is not visible. As of now, we
    have copied the transaction slot information in line pointer so that we
    can easily reach prior version of tuple.

    We try to prune the page for non-in-place updates when there is
    insufficient space on the page.  The other times where we can try to
    prune the page could be while Inserting if there is insufficient space
    on a page or when buffer is being evicted or read, but we can do that
    as a separate patch.

    Mithun C Y and Amit Kapila.

commit a1bce7facfd467c804ccab660f9befb3df8eae1f
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-22 10:55:30 +0530

    Avoid using UndoDiscardInfo if it is not initialized

    If UndoDiscardInfo is not yet initialized, use UndoLogIsDiscarded to
    check whether an undo record is discarded. It also initialize UndoDiscardInfo
    for the same log.

    Reported by Mithun. patch by me

commit 4baf44c765524baf9833308af1d3807987f6889e
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-18 17:47:48 +0530

    Fix warnings in zheap code

    Reported by Thomas Munro

commit 92c2b94de69c49504f3b7d79d676b364870ff279
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-12-12 14:53:54 +0530

    Buf fix for prevlen of the first record of the log

    Reviewed by Dilip Kumar

commit 6616d18f3080f2d80b21c891d2376ffe58b51614
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-08 18:23:09 +0530

    Restrict transition table creation for zheap

commit 97119d13a22de3759574c30ebc442cf7e00e3a49
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-08 16:55:25 +0530

    Fix defect in applying undo action in case of abort

commit c8287a5eaac5bc934645e4cc9b51da12d8f3f71e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-08 18:25:57 +0530

    Fix dirtysnapshot behaviour for inserted tuples by aborted transactions

    Reviewed by Amit Kapila

commit 1369b1c6dc27d7eeac63791725f9e010ee09e495
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-07 17:50:30 +0530

    Restrict foreign key triggers on zheap tables

commit 1c1befcf0e8a08ff1f5cdd49110a0f4a0140afbc
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-07 17:49:51 +0530

    Restrict INSERT ON CONFLICT on zheap tables

commit e135508ebd7292042d8d12b820198025e569b9b8
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-07 17:40:47 +0530

    Fix issue is PageFreezeSlot

    Memory was freed without checking NULL, fixed the same

    Reported By Kuntal Gosh

commit 1ba3aebd1b0d1789c99ba816237b487f950e40ff
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-07 16:07:34 +0530

    Fix issue in regression test

    Avoid executing undo action when transaction is not in progress

    Reported by Kuntal Ghosh

commit 787fa1ff9697dcf872546812ad606a7e41950287
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-06 15:46:37 +0530

    During recovery of inplace-update, fix old tuple header

    Reviewed by Amit Kapila

commit 2eec4974f8da5a7db17524250fd466fbe8995dc7
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-06 15:43:52 +0530

    Update tuple length during inplace-update

    During inplace-update, the tuple should be updated to the new tuple.

    Reviewed by Amit Kapila

commit 03014036363d2c494ee80f95d64f5abd5bdae48a
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-06 15:26:15 +0530

    Store block number before preparing the undo record

    In undorecord, we should store block number before preparing the
    undo record. Otherwise, expected size of the undo record won't
    consider the size for including block info.

    Reviewed by Amit Kapila

commit e7fb026e79d9b724edc54d1b78d2725651d77757
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-07 11:19:04 +0530

    Fix defect in applying undo action in case of abort

    Reviewed by Kuntal Ghosh

commit c646ff19e7fdaddba0b8757ce14e8ab0e824329b
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-06 10:35:03 +0530

    Fix Typo

    Reported by Rafia

commit dac6bdc623bd7b8502dd6642f784024a50abfb1d
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-05 18:55:41 +0530

    Fix warnings

commit 59eb32500326cfb4db8b1336d4ab9d97aa6741f4
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-12-05 18:22:01 +0530

    Rewind the undo pointer after applying undo actions

    After applying all the undo actions for a page, we clear the
    transaction slot on a page if the undo chain for block is complete,
    otherwise rewind the undo pointer to the last record for that block
    that precedes the last undo record for which action is replayed.

    This will help us in knowing whether the undo action is already
    replayed during recovery.

    Patch by Amit Kapila, reviewed by Dilip Kumar

commit 43c6e1c541a4e9d6cc310c1e4b46e514ddfc9f98
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-05 14:30:26 +0530

    UndoDiscardInfo should be updated in exclusive lock on the mutex

    Reviewed by Dilip Kumar

commit 41c6d9d337e2b09448ff324991c639793e61889e
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-05 14:28:00 +0530

    UndoGetOneRecord should be called under shared lock on UndoDiscardInfo

    Reviewed by Dilip Kumar

commit 49b755159affcc0f0e23f3591ddc46d2a9a1e7fc
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-01 15:27:16 +0530

    Remove genarating the subtransaction id in zheap

    Currently in heap we need to maintain the subxac id so that
    we can identify the changes done within suxact and properly
    identify the visible tuple if subxact status is not same as
    main transaction. But, in zheap we have undo and with help
    of that we can immediately revert the effect of the subxact
    and we never need to track the status of the subxact.

    This will also help in maintaining the limited slot inside the
    zheap page.  Because if we assign a separate xid to subxact then
    we may need to provide extra slot for each subxact.

    Patch by Dilip Kumar reviewed by Amit Kapila

commit c0c59ddea242f191eaeb55ddb2598ec31e77b96a
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-04 17:50:36 +0530

    During slot-reuse replay, use physical tuple if not included in wal record

    If tuple is not included in wal record, access the physical tuple
    to fetch the corresponding slot number for the same tuple.
    Reported by me, patch by Dilip Kumar.

commit a83904b8bb7e5aa81e8906d2daefed70afdacd80
Author: Mithun CY <mithuncy@localhost.localdomain>
Date:   2017-12-04 20:48:01 -0800

    Zheap tuples are locally allocated hence set
    tts_shouldFree as true.

    Reviewed by kuntal

commit 5d561693d232b766ed618f48498b9a667cc1ac56
Author: dilip <dilip@localhost.localdomain>
Date:   2017-12-04 17:36:45 +0530

    Fix issue in prevlen

    Return without releasing the lock
    Reported by Kuntal Ghosh

commit b5071ac66cf206801b0de4f0cc9de4390bb59cf8
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-12-04 14:34:27 +0530

    Avoid using UndoDiscardInfo just after undoshem initialization

    If UndoDiscardInfo is not yet initialized, use UndoLogIsDiscarded to
    check whether an undo record is discarded. It also initialize UndoDiscardInfo
    for the same log.

    Reported by Neha Sharma, patch by me, reviewed by Dilip Kumar

commit f916f158f39676de2ab755118924518e3c6f45b1
Author: dilip <dilip@localhost.localdomain>
Date:   2017-11-28 09:16:00 +0530

    Make prevlen crash safe

    Currently transaction previous recrod's length is not crash safe
    but there is possibility that transaction can be spread across
    checkpoints, in such case while preparing the first undo record
    we do need to have the length of the previous record which was
    inserted before checkpoint. For fixing the same we are maintaining
    this value in undo meta and wal logging this along with other
    undo meta.

    As part of this patch also removed one unused structure member
    in undo meta.

    Patch by Dilip Kumar Reviewed by Amit Kapila

commit f20d78fe0d048ba600162e0f5d794399e0dbee13
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-22 16:27:25 +0530

    Write WAL for zheap_multi_insert

    Reviewed by Dilip Kumar, Amit Kapila

commit ed0f47495ef7de6c0e0e8940b6ac630704b0bf77
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-22 16:08:25 +0530

    Set uur_type for multi_insert at begining

    Reviewed by Amit Kapila, Dilip Kumar

commit e13cb875476963959cea79edd114fa8d701909e0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-14 14:49:30 +0530

    Fix assertion failure in zheap_delete

    Reported by Neha Sharma

commit ee06e282b6bb1c64d6c0820fb2c690778af71f71
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-21 20:32:49 +0530

    Derive latestRemovedXid for hash deletes by reading zheap pages

    Reviewed by Amit Kapila

commit bf78457e2e7706338d00f60fbe487b315b626ff5
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-13 21:02:12 +0530

    Derive latestRemovedXid for btree deletes by reading zheap pages

    Reviewed by Amit Kapila

commit 53b6d509674a0aa1b59c00306bdec63c9cad3110
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-22 10:26:35 +0530

    Implement exclusion constraints for zheap

    Patch by Ashutosh Sharma. Reviewed by Kuntal Ghosh, Amit Kapila.

commit a4f00f80ba4745c45be7d747bdd7d4a6363a78f7
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-22 09:33:47 +0530

    Implement SnapshotSelf for zheap

    Returns the visible version of tuple (including effects of previous
    commands in current transaction) if any, NULL otherwise.  This will
    be required for features like exclusion constraints to work with
    zheap.

    Ashutosh Sharma and Amit Kapila

commit a1f49ef37d8bbfd19741bc76b0c749000c9d281d
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-11-21 12:13:44 +0530

    Bug fix in rollbacks in zheap

    The patch fixes the case where ending undo record pointer
    is not provided. This is particularly required in rollback
    to savepoint type of scenarios.

    RM 42958, reviewed by Amit Kapila.

commit 64ecef026b765bba8262b3acce21b3c308e2b02a
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-21 08:54:58 +0530

    Implement the API to advance the latest xid that has removed the
    tuple.

    This will be required by the future patches to implement free space
    management and xlog for indexes like btree and hash.

commit b6d4a835ee52a6c0eec1e2e29c0dc84f866ab42a
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-20 17:43:12 +0530

    Implementation of undo actions for zheap operations

    Implemented undo actions for delete, update, lock tuple, multi-insert
    operations.

    Patch by Amit Kapila, reviewed by Rafia Sabih

commit 6828caf8f4d53788756079e0059a0d117e92afe3
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-20 17:27:37 +0530

    Infrastructure to execute undo actions at page-level

    Starting from the last undo record of a transaction read the undo
    records in a backward direction till first undo record of a
    transaction and accumulate the actions per page.  As soon as the
    action is for a different page, we execute the accumulated actions
    of previous page.

    We are logging the complete page for undo actions, so we don't need
    to record the data for individual operations.  We can optimize it by
    recording the data for individual operations, but again if there are
    multiple operations, then it might be better to log the complete
    page.

    Patch by Amit Kapila, reviewed by Rafia and Dilip

commit c00df8c30bd5baa1eb1ea14f89747ee5ea7d7fe0
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-20 09:56:20 +0530

    Write WAL for zheap_lock_tuple

    Patch by Ashutosh Sharma and Kuntal Ghosh. Reviewed by Amit Kapila.

commit c00823a0787680eb8c0f02b8d998c6cf3bad39b1
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-11-17 15:59:40 +0530

    Bug fix for rollback inserts

    The page header size was not included in the calculation of
    prevlen for undo records when record starts from a new page.
    The patch fixes the issue.

commit b7bda571bb9c01fe2cb9355f9fd498cb6aa06078
Author: dilip <dilip@localhost.localdomain>
Date:   2017-11-16 18:10:42 +0530

    Write WAL for invalidating the transaction slot

    The important consideration for logging invalidating slot operation is
    that we need to identify the tuples pointing to these slots and also the
    transaction id in the slot while performing redo operation for undo log.
    So, we are conditionally (when full page writes are off) writing tuple
    offsets and slot xid in WAL as well.  In case full page writes are
    on, we can rely on the tuples and slots from page.

    Path by Dilip Kumar, Reviewed by Kuntal Ghosh and Amit Kapila

commit 9c9a8f88ad2b8fea8a319e0d9aad40924e1f5684
Author: dilip <dilip@localhost.localdomain>
Date:   2017-11-16 18:09:18 +0530

    Write WAL for slot Freeze operation

    Write the information of the slots which got frozen and during recovery
    we can generate the information of the tuple which are pointing to these
    slot and mark them frozen same as we are doing in DO operation.

    Path by Dilip Kumar, Reviewed by Kuntal Ghosh and Amit Kapila

commit f1be5834b52f5023522af5ce862f9c2c40d7048e
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-15 17:17:17 +0530

    Store old version of tuple in WAL for delete

    Patch by me, reported and reviewed by Ashutosh Sharma

commit e1a47f763d17e4bc33af7ae64fa234f7fefbed96
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-11-14 13:26:53 +0530

    Support for SELECT INTO statements for zheap

    This enables the creation of tables in zheap format
    through SELECT INTO statements.

    Reviewed by Amit Kapila.

commit e1afa7d5eacde17d322dadb659f1d4c78d628cdb
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-11-14 11:56:43 +0530

    GUC for storage_engine

    Now, we have a GUC -- storage_engine to specify the storage_engine option,
    which can be modified before server start. Currently, we have two options
    for this GUC, viz. heap and zheap, with it's default value being heap.
    Note that once this GUC is set to zheap, relations are created in
    zheap format, irrespective of whether with (storage_engine ='zheap')
    is specified in CREATE TABLE statement or not.

    Reviewed by Amit Kapila.

commit 2587d533e90aa99701d3125f2dc0ac00c9a54029
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-13 14:11:58 +0530

    Use better way to set item pointer.

    Suggested by Kuntal Ghosh.

commit 2c3e87e4e9efdef30663c93ee606c12de10c78de
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-13 13:09:03 +0530

    WAL log update operation in zheap

    The important consideration for logging update operation is that we
    need to generate entire old tuple while performing redo operation for
    undo log.  So, we are conditionally (when full page writes are off)
    writing complete tuple in WAL as well.  In case full page writes are
    on, we can rely on the tuple from page. For the new tuple, we are
    writing just the diff tuple as we are doing for heap.

    Patch by me, reviewed by Kuntal Ghosh.

commit a6eb456784db76b444121ef171d635e4c4a01e13
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-13 08:56:41 +0530

    Fix type of undo record in zheap_multi_insert

commit d1859b51bd51ac39d8915f40df7ff679c3df0461
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-10 18:31:54 +0530

    Fix the calculation of previous undo record length

    When the undo record crosses page boundary, we need to consider
    pageheader length which was being ignored.  Fix the same.

commit 4159a0175ce16bd5684314b63a23c0cf9fc8733e
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-11-09 22:27:44 +1300

    Show unattached undo logs as null xid and pid in pg_stat_undo_logs.

commit fce53d149e2b6e67e40eb33b92517cd58673cd4b
Author: dilip <dilip@localhost.localdomain>
Date:   2017-11-08 17:13:31 +0530

    Enhance zheap_lock_tuple

    Handled the case when same tuple is locked multiple time by
    the same transaction.

    Patch by me, reviewed by Ashutosh Sharma and Amit Kapila.

commit 91cd6e14fa9e2a2a9a4bb929750951ed7cfa68ef
Author: dilip <dilip@localhost.localdomain>
Date:   2017-11-08 17:03:49 +0530

    Bug fix in trigger for zheap

    Currently we are always getting the older as well as newer version
    of tuple from the zheap. But, in case of inplace update both are
    having same tid and will fetch the tuple.  Patch will fix the bug
    by getting the older version of the tuple from the undolog.

    Patch by me, reviewed by Ashutosh Sharma and Amit Kapila.

commit 0ef921f6fc46aa6a3fd20eac9d199a3e173c1fc1
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-06 16:00:43 +0530

    Restrict ExecLockRows on zheap relations

    As of now, we don't support FOR SHARE/FOR UPDATE on zheap tables.
    So, throw appropriate error for the same.

    Reviewed by Dilip Kumar

commit 3e73bc3ccd71c28c962749dcda4e54bc8f245022
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-11-08 11:18:50 +0530

    WAL log delete operation in zheap

    The important consideration for logging delete operation is that we
    need to generate entire tuple while performing redo operation for undo
    log.  So, we are conditionally (when full page writes are off) writing
    complete tuple in WAL as well.  In case full page writes are on, we
    can rely on the tuple from page.

    Patch by me, reviewed by Ashutosh Sharma.

commit eaf3708c605930073dcb0d64caad1c9dd13efb01
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-11-07 15:42:10 +0530

    Fix assert in bitgetzpage

commit 01104c4a5af83cff71704bb399172e9c050929c0
Author: dilip <dilip@localhost.localdomain>
Date:   2017-10-27 13:58:08 +0530

    Support sysattributes fetching for the ZHeap

    Properly fetch the xmin,xmax,cmin and cmax from the undo.

    Patch by Rafia some defect fixed by me reviewed by me and Amit.

commit 50f4297716c242fee4cb8e34c2903ccef297983f
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-10-25 18:11:37 +0530

    Undo Action and Replay for Insert Operation

    The undo insert action is to mark the corresponding item as dead if
    relation has any index and unused otherwise.  The need to item as
    dead is that we can't completely clear the item till the
    corresponding index is marked as dead.  Currently undo actions are
    replayed on Rollback or Rollback to savepoint.  Ideally the action
    should be replayed on error as well, but we will deal with it in
    a separate patch.

    Amit Kapila, reviewed and tested by Rafia Sabih

commit a461ec04e06d1675d635904caa3d993370c7ef0f
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-10-25 17:44:57 +0530

    Update prevlen in undo record header with the length of previous
    record.  This is required to traverse the undo chains from last undo
    record to first.

    Dilip Kumar.

commit 66c00153e2de48f9b8ab5c8d2454f87cb2efea6d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-23 17:57:39 +0530

    Bug fix in UndoLogAdvance

    During recovery, MyUndoLogState is uninitialized. Hence, It can't be
    used to fetch undo control pointer while replaying undo records.

    Reviewed by Amit Kapila, Dilip Kumar

commit 08870389179383469196b243b6b5662bc8d41cb6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-18 14:21:19 +0530

    Restrict alterting non-empty zheap tables

    During ALTER command, sometimes we scan and rewrite a table.
    For zheap, we've to implement the same in ATRewriteTable. For now,
    just throw an appropriate error.

    Reviewed by Amit Kapila

commit 1207f60fb21542e8b6675335848d373e6aafed3d
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-18 15:04:01 +0530

    ALTER should restrict partitioning/inheritance to same storage_engine

commit d43e896ba1e66d0a8246bbb3df58c4f38010fcbd
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-10-17 17:07:05 +0530

    Hibernate undo-worker when the system is inactive

    The undo-worker hibernates for minimum of 100ms and maximum of 10s,
    based on the time the system remained idle.

    RM42373, reviewed by Mithun Cy

commit 9c612f44ad2204510ab13a620a003460889c41db
Author: dilip <dilip@localhost.localdomain>
Date:   2017-10-17 16:42:38 +0530

    Bug fix in multi-insert

    uur_xid is not set properly in multi-insert that was making undo-discard
    operate based on invalid xid and was causing segmentation fault.

    Reported by Ashutosh, fixed by me Reviewed by Kuntal

commit 0023cc82949adc74fdb18c7825cf7e5d1988da49
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-04 15:20:38 +0530

    Restrict partitioning/inheritance to same storage_engine

    The descendant/partitioned table inherits the same storage
    engine as its ancestors. User should not specify any storage_engine
    for the descendant/partitioned table. Also, this commit doesn't
    include any sanity checks for the scenarios where heap table
    inherits zheap table and vice versa.

    Reviewed by Amit Kapila, Rafia Sabih

commit 99623dfb48a055ae2b43db76d9e87747915fec09
Author: Rafia Sabih <rafia.sabih@enterprisedb.com>
Date:   2017-10-17 10:39:51 +0530

    Avoid iteration over all the backends in UndoDiscard

    Instead of iterating over all the backends, now iterate only
    the active logs to determine the next log for undoDiscard.

    RM42373, reviewed by Dilip Kumar

commit 8f40d7b7802c1904f4c845c744f62a366d2f9a8a
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-10-17 13:58:11 +1300

    Fix undolog wordsize thinkos that broke 32 bit builds.

commit 927c309586f38da7c737594c3beef65f7c51e5cc
Author: dilip <dilip@localhost.localdomain>
Date:   2017-10-16 21:11:17 +0530

    Bug fix in update transaction start record

    In recovery start transaction info was not updated properly, it was
    fetching prev_xact_start from local variable which in fine during
    normal run, but InRecovery there is only one process so we can not
    depend upon the local variable instead we should always fetch that
    from log meta data.

commit ea6bf762f2a6f5673e8978118edfee99767b844f
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-10-12 05:53:59 +1300

    Implement UndoLogNextActiveLog().

    Provide a simple way to iterate through the set of active undo log numbers.
    This interface might need some refinement to support UndoPersistence
    levels in future.

    RM42373, Rafia Sabih

commit 712f332b5ac98b22110929ae2136e4b0c38dffab
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-11 14:47:58 +0530

    Don't discard undo if a transaction rolls back

commit 1c2daa5b5a47856a227ad66c7cc08e761e147871
Author: dilip <dilip@localhost.localdomain>
Date:   2017-10-11 09:48:37 +0530

    bug fix in Synchronizing undo update with undo discard

    update undo record is just checking whether undo is discarded
    or not, instead it should acquire the discard lock and read the
    undo under the lock. Fixed the same.

commit 8f17025103fc2c344f56b05770550e50e79b5bb3
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-05 16:48:57 +0530

    Wrong undorecord expected size for first record of a undolog

    When we attach a new undo log, a new transaction is definitely started.
    Hence, we should set is_first_rec flag in undo meta and calculate the
    undorecord expected size.

    Reviewed by Dilip Kumar

commit 23be6c5939487c691e77c03daffbdb04dff9afb4
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-10-04 11:36:04 +0530

    Fix some compiler warnings

commit 21c53521b5a1e98ba21a83ec0e8abd5fc97fa866
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-28 15:11:33 +0530

    Add missing null terminator to the dest string after memcpy().

    memcpy() does copy the specified bytes of string from memory area
    src to memory area dest, but, doesn't add null terminator to the
    dest string. This patch handles to same in FindLatestUndoCheckpointFile().

    Patch by me, reported by Amit Kapila.

commit cee1d83ad2888cfe451ec6bf7c31c31ae79b694b
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-28 14:16:05 +0530

    Fixed pending issues for WAL recovery

    Fixed the recovery for the subtransaction and properly update
    the start transaction header record.

commit 4df8c94e347ee6291f9310840164125e3c57a5fe
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-28 09:25:43 +0530

    WAL log Insert operation in zheap

    We need to ensure that we log enough information that during recovery,
    we can construct the unpackedundo record and then insert it in same
    way as we insert for the DO operation.  Note, that we don't write full
    page image for undo logs as those are written serially and we always
    apply all the WAL records for undo log unlike for zheap pages where
    we don't apply it, if the LSN on a page is greater than the LSN of WAL
    record.  This is to avoid problems like if the page header is written
    after last checkpoint, but the other part is not then the LSN in page
    header will be latest but not all the data.

    Patch by me, reviewed by Dilip Kumar.

commit 604100e9ff28172ef1c4372745fafd5501e03a98
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-27 16:45:24 +0530

    Throw an error message if index-only scan is performed for zheap
    tables.

    Patch by me, per suggestions from Amit Kapila.

commit 91630e7f7129b76fe46f29637f4117fca7ef3963
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-26 16:30:52 +0530

    Make the macro 'RelationStorageIsZHeap' (used to determine if the
    relation storage format is zheap or not) more robust.

    Patch by me, quickly reviewed by Dilip.

commit 31a1c21f3e38e1935a8e8d9061ea307273ead50c
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-26 10:34:17 +0530

    Add macro 'UndoCheckPointFilenamePrecedes' to find the oldest undo
    checkpoint file in pg_undo directory.

    RM42357, Patch by me, as per the suggestions from Thomas Munro.

commit fb101c53f1c750c6ca37b049694db50118637b61
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-26 10:29:28 +0530

    pg_upgrade: Rename undo checkpoint file after pg_resetwal.

    During pg_upgrade, pg_resetwal module is invoked which resets all WAL
    related information in the Controlfile including the last checkpoint
    LSN and then a server process is started. But, as the undo checkpoint
    filename is based on last checkpoint LSN, the server startup process
    fails as it is unable to find the undo checkpoint file corresponding
    to the last checkpoint lsn as mentioned in the Controlfile.

    This patch fixes that issue and also copies the undo data and checkpoint
    files from old to new cluster during pg_upgrade.

    TODO: Currently, this patch just adds the logic to copy undo data files
    from default location i.e 'base/undo' which means if undo data files are
    present in non-default location it won't work. In future, when the support
    for storing undo data files in the non-default location is added, that
    needs to be handled in pg_upgrade as well.

    RM42357, Patch by me, reviewed by Thomas Munro.

commit c95f0fda591c3d1e1b3e520570c8e616d4fb42ea
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-25 11:51:06 +0530

    Added trigger supports for zheap

    Currently zheap is not supporting the triggers because all
    trigger mechanism is dependent on HeapTuple, this patch
    provide a conversion between heap and zheap tuple and also
    a api to fetch tuple from zheap for trigger.

commit 13b7d1735e5444fca61368df9ab515267eadf4be
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-22 15:00:13 +0530

    Bugfix in undo discard

    If log is not attached to any session it will have invalid
    transaction id.  In such case just discard till the current
    insert location.

commit ef93ec6350767d863b30cd8a11e632a1442a22af
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-21 16:50:30 +0530

    Fix update of zheap columns with null

    For in-place updates, we should copy non-visibility flags of
    infomask in the updated tuple.

    Reviewed by Amit Kapila

commit 19c54fb49faecbde387be5ed6bde2bc94cbf67bf
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-22 09:49:29 +0530

    Bugfix in undo discard mechanishm

    Prior to this undo was not discarded properly for the last transaction
    in the undolog hence oldestxidhavingundo was also not getting updated
    properly.

commit 5e45ce2025a7024937763e71dcde007941057c3f
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-21 20:03:44 +0530

    Replace unwanted Assert statement in ZHeapTupleSatisfiesDirty
    with if-check.

    Patch by me, reviewed by Amit Kapila, reported by Neha Sharma.

commit 943e365fc06f42c264b9c39d78d9a8a94b3754b6
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-21 15:00:10 +0530

    Use oldexXidHavingUndo to determine if the record is visible

    Previously we were using RecentGlobalXmin to determine if the record
    is all visible, that is okay till we have rollbacks.  Use the more
    appropriate xid i.e oldexXidHavingUndo to determine if record is all
    visible.

    Patch by me, reviewed by Dilip and Kuntal.

commit 6715131fd875709a5b4d89808117b2e17bb7cc32
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-21 14:02:57 +0530

    Code refactoring to eliminate duplicate code.

    We were using the similar code to fetch transaction information from
    undo records in multiple places.  Expose a new function to fetch that
    information and use it in all places.

    Patch by Amit Kapila, reviewed and tested by Dilip and Kuntal.

commit a69d8bb6cc49b4bfb3ef0ad8321508007f7e7d66
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-20 16:18:19 +0530

    Allow INSERT/UPDATE/DELETE RETURNING to work with zheap relations.

    Patch by me, reviewed by Amit Kapila, reported by Tushar Ahuja.

commit 9c886fcb2e34d1ec073ce260a032c5fe3a484695
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-19 18:53:55 +0530

    Implement Bitmap Scan for zheap

    Reviewed by Amit Kapila

commit 113966f8755edabf26ce62eb54798a5bdb419911
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-19 23:44:47 +0530

    Implement COPY-TO for zheap relations

    Reviewed by Amit Kapila

commit 414f84fd6026b0e282c10b9fe01f40898d5a22a4
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-20 00:12:16 +0530

    Index Scan for spgist,gist,hash

    Reviewed by Amit Kapila

commit 95e893242410969de851bdf746b54b69292134ec
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-20 13:57:24 +0530

    Skip vacuum for zheap relations

    Tell autovacuum worker to skip zheap relations for vacuum. Also,
    throw appropriate error if one intends to vacuum zheap relations
    with VACUUM command.

    Reviewed by Amit Kapila, Dilip Kumar

commit e33f25249fc55d24c919d345b233a12b27e00db2
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-18 16:24:18 +0530

    Handle MinimalTuple for ztuple

    Reviewed by Amit Kapila, Ashutosh Sharma

commit f0d0850e99b3303a913a3f06998e120f052b0440
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-19 22:33:23 +0530

    Fix Oid crash in ExecInsert

commit aa0eefadeee806f7691cbe2374a942e771ca8d30
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-19 12:14:37 +0530

    zheap mvcc routine was not set properly in RestoreSnapshot
    which was causing problem in parallel query.  This commit
    fixes the issue

commit e43b9eeeb33b7cce358e4a1bede6623151bb4101
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-09-18 22:06:22 +0530

    Correct the XLOG record type used in UndoLogSetLastXactStartPoint().

    RM42384, Ashutosh Sharma, reviewed by Dilip Kumar and reported by
    Mithun CY.

commit 2edda475a0c19b293d61f657c02a4b38221765f5
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-18 14:49:31 +0530

    Retrieve CTID from undo record when requested.

    Reported by Neha Sharma.

commit f34620869854458a3a671c2112b38109fc240af2
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-15 12:11:23 +0530

    Move BulkInsertStateData def to genham.h

commit 7b30e2d13b1ef23d19ba0d1c521e901d50be21ec
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-14 14:41:27 +0530

    Fix comment style in zheap_multi_insert

commit 91094f3776e5aa18b71fb0360d5e754b128998ae
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-14 13:51:35 +0530

    Implement COPY FROM command for zheap relations

    Reviewed by Amit Kapila, tested by Ashutosh Sharma

commit e7de2741292bfce5ddb7c69c102010f1dc81e2ed
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-07-21 12:00:38 +0530

    Add options in zheap_prepare_insert like heap_prepare_insert.

    Reviewed by Amit Kapila

commit 1d9f79cf415be06824f8893c5e6cdb58bdde3f18
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-13 13:40:44 +0530

    Fix warnings in zheapgettup_pagemode

commit fba570c012c02b96bb3817e33763f03dde1248a1
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-12 15:37:26 +0530

    Isolation tests for non-inplace updates

    Reviewed by Amit Kapila

commit 51fb8188e05d92a973ee14a057c956fec2caab93
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-12 15:42:22 +0530

    Regression tests for non-inplace updates

    Reviewed by Amit Kapila

commit 039efbc45bb68a11db71ed0137d619af07951b54
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-11 17:40:21 +0530

    Fix usage of tuple in visibility API.

    ZHeapTupleSatisfiesVisibility API doesn't guarantee that the tuple
    passed won't be freed, so we can't reuse it.

commit ff452720310a3a8cbb27ff85625a5276110f9de2
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-11 17:26:27 +0530

    Support non-inplace-updates

    Allow tuples to be updated such that a newer version of tuple will be
    stored separately if any index column is updated or length of new
    tuple is greater than old tuple.

    For undo generation, we always generate two undo records one for the
    deletion of old tuple and another for addition of new tuple.  We need
    separate undo record for new tuple because during visibility checks we
    sometimes need commandid and that is always stored in undo record.  To
    reach new tuple from old tuple, we need ctid which is stored in old tuples
    undo record.

    Patch by me, review and test by Dilip Kumar and Kuntal Ghosh.

commit 918aea9d904dd4f2ea82c8e4b0aefd7006cedce8
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-11 16:44:26 +0530

    Support multi-prepare for undo record

    This will support preparing multiple undo record and all of
    them can be inserted with one call of InsertPreparedUndo.

    Reviewed by Amit Kapila.

commit 4df6058209a1028f328b5976315b9373d8385afc
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-07 10:57:14 +0530

    Fix compilation warning

commit 3d9616d314008d2c34942d50d81d40558a771eda
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-06 18:19:48 +0530

    CREATE INDEX on non-empty zheap relations

    This extends the implementation for index creation on zheap
    relations to non-empty relations as well.
    Patch by me, reviewed by Amit Kapila

commit 295f464ede2c243c7872cd3bc48b4d022b4cf9d8
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-06 15:15:30 +0530

    Implement and ZHeapTupleSatisfiesAny and
    ZHeapTupleSatisfiesOldestXmin

    Both these API's are required by upcoming patch to create index.
    These API's implement functionality similar to the heap API's
    HeapTupleSatisfiesAny and HeapTupleSatisfiesVacuum, but for zheap
    tuples.

commit a1bbc57456a2c0985aad6ca14042ff8690275b9b
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-09-01 17:07:30 +0530

    storage_engine can not be modified after table creation

    storage_engine can be set during CREATE TABLE. But, it can't be
    set/reset using ALTER TABLE later.

    Patch by me, reviewed by Mithun CY

commit 01e757ed5ff8e86fdece673c2dcee9d737628c5f
Author: dilip <dilip@localhost.localdomain>
Date:   2017-09-04 16:59:52 +0530

    RM42314: Fixes one part of the problem

    Implements better way to know whether the undo record
    is discarded or not instead of adding the TRY_CATCH as
    done in current code.

commit fc5e48996af7ea9398aed4adfce093ba7bba3c53
Author: mithun <mithun@localhost.localdomain>
Date:   2017-09-04 11:58:16 +0530

    RM41618: Asynchronous Undo Management - Undo BgWorker Part.

    Improved the log messages.

commit 7231d04084c1d305c374c91c8ac9328186ae90e2
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-09-01 15:30:39 +0530

    Fix typo in error message

commit b9746b046d25e8db6c36cd28bd49e72f650e3552
Author: mithun <mithun@localhost.localdomain>
Date:   2017-09-01 11:39:53 +0530

    RM41618: Asynchronous Undo Management - Undo BgWorker Part.

    registered a SITERM handler for undoworker.
    Review by Kuntal.

commit fec2d893806c3b163d4c96e6c2864361ad2eccae
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-08-31 17:15:20 +0530

    Implement PRIMARY KEY for zheap relations.

    This patch allows PRIMARY KEY to be created on zheap relations.
    For now, all it does is, throw an error message, if user tries
    to add a NULL or duplicate value on primary-key column (i.e. when
    primary key constraint is violated) but it doesn't protect it
    from getting inserted into a zheap table as currently ROLLBACK is
    not implemented in zheap. Therefore, if users tries to violate a
    primary key constraint on zheap table, it will result into a data
    corruption.

    RM42103, Ashutosh Sharma, reviewed by Kuntal Ghosh and Amit Kapila.
    Few adjustments done by Amit Kapila.

commit 5b704a091e300d12f18e0bba4d6e2083fd652d0b
Author: mithun <mithun@localhost.localdomain>
Date:   2017-08-30 08:25:40 +0530

    RM41618: Asynchronous Undo Management - Undo BgWorker Part.

    One undoworker's launcher process which calls a loop to
    periodically examine and discard undos of transactions
    which are visible to all.

    This is a very basic form of undoworker which can be used
    by undo subsystem for discarding undos. This code will
    later evolve into a more complex undoworker subsystem.

commit 1c16e16158d83850c20602acce9b6ab5e8833b7c
Author: dilip <dilip@localhost.localdomain>
Date:   2017-08-29 18:20:28 +0530

    Discard Interfaces for undo worker

    Provide an interface to discard all the undo which are inserted by the
    transaction id which is smaller than input xid.  This will also calculate
    the oldestXidHavingUndo.

    Reviewed by Amit Kapila

commit 5e7384fc7d2684ca0714147a0a1c60fdeed8d7fd
Author: dilip <dilip@localhost.localdomain>
Date:   2017-08-29 18:19:07 +0530

    Inserting a transaction header in the undo record.

    Insert a transaction header in the first undo record by the transaction. This
    header will contains the undo record ptr of the next transactions first undo
    log.  This will be used by discard api’s to process a undolog transaction by
    transaction.

    Reviewed by Amit Kapila

commit 72ca11a0fe330abd69958c361ecf8fde2369802f
Author: dilip <dilip@localhost.localdomain>
Date:   2017-08-29 17:58:11 +0530

    Protection against accessing the discarded undo.

    This is a dirty way to handle the problem, we may need to find some
    better way for the same.

    Reviewed by Amit Kapila

commit 35c917a1e21c1c17c7de0d026c0820ac4086c0a2
Author: mithun <mithun@localhost.localdomain>
Date:   2017-08-29 16:16:55 +0530

    RM42266: Support no movement scans for zheap.

    It seems the NoMovementScanDirection is a dead code for
    heap itself. So got rid of related code in zheap and added
    an Assert(false) to indicate same.

commit 38a4d0037a5636e08e00a4e3a352a4b1ded8ba61
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-08-29 15:27:29 +0530

    Make PageSetUNDO less chatty

commit 44c19cc1a6db1c8bd6a4d21b17d1549c059c77d1
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-28 23:00:23 +1200

    Fix build problem on Windows.

    On Windows we have a macro wrapping open() that always needs three arguments.
    Per complaint from Amit Kapila.

commit 1302ec61e5bb482789db809c344b3fed5a9d6dae
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-08-28 15:57:26 +0530

    Change ValidateTuplesXact so that it always get the fresh copy of
    tuple.

    If the tuple is updated such that its transaction slot has been changed,
    then we will never be able to get the correct tuple from undo. To avoid
    that we get the latest tuple from page rather than relying on it's
    in-memory copy.

    Amit Kapila and Dilip Kumar

commit 531f8bf739266d5d0e40512a983cc5b757689810
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-08-28 15:11:39 +0530

    Add comments in code to explain the usage of commandid while
    traversing undo chain.

commit bd9d5215d2deff90615d8108cbdad07d8d4894df
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-08-28 14:46:32 +0530

    Improve comments in code.

commit 91416a25cd2f950a7de078d7661f18e411962c15
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-08-28 14:43:11 +0530

    Always write complete tuple in undo of delete.

    This will ensure that we can reuse the space of deleted tuples as soon
    as the transaction commits.

commit e63edb162c3a18acc7dc10e9510b24b53d2ddfda
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-08-28 14:23:54 +0530

    Fixed few warnings.

commit ed04b0bc3a955f0abfe712626c13dd3734a3676c
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-08-25 13:18:05 +0530

    Remove isolation test for REPEATABLE READ mode and rename mvcc.spec
    file with zheap prefix.

    Commit 0fd69d86 unintentionally added isolation test for REPEATABLE
    READ mode. But unfortunately, zheap is currently just restricted to
    READ COMMITTED mode. Hence, this commit removes the test added for
    REPEATABLE READ mode and it also renames the test file with zheap
    prefix.

    Patch by Kuntal Ghosh, reviewed by me.

commit 953868285ef6d57c83a246603d5e3b615b837b3a
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-08-24 21:17:31 +0530

    Allow transaction slots to be reused after transaction is finished.

    To reuse the slot, it needs to ensure that (a) the xid is committed
    and all-visible (b) if the xid is committed, then write an undo
    records for all the tuples that belong to that slot.  The undo record
    will cover the transaction information of committed transaction. The
    basic idea is that after reuse of slot, if someone still needs to find
    the information of prior transaction that has modified the tuple, the
    same can be retrieved from undo.

    We can reuse the slot even after the transaction has aborted, but for
    that we need to have Rollback facility which is currently not
    implemented, so we can't reuse slots of aborted transactions.

    Amit Kapila and Dilip Kumar.

commit 13de424924220e65ec7348687210e3752ad0fd50
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-08-22 17:13:52 +0530

    Fix crash in EvalPlanQualStart during isolation check

    We can't use es_result_relation_info in EvalPlanQualStart to decide
    the type of the underlying relation. Instead, We can use es_range_table.

    Reported by Amit Kapila.

commit 10c6bff9873d3ce203eb0368ee262a67d6f43d79
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-08-11 15:54:28 +0530

    Deform zheap tuples for aggregate nodes

    Aggregate nodes don't recognize zheap tuples. To fix this,
    we deform a zheap tuple in ExecCopySlotTuple and form a heap
    tuple from the same.

commit 61d3d5e7178203d3070a11872bcdc64d365fce16
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-08-21 17:25:12 +0530

    Fix type of urec_fork in UndoRecordRelationDetails

commit 3d390e874b93b8f265520d2381cd5227f6f8f60c
Author: ashu <ashutosh.sharma@enterprisedb.com>
Date:   2017-08-18 16:26:43 +0530

    Add some isolation tests to validate MVCC model with zheap.

    Thomas Munro and Ashutosh Sharma (The original patch here was
    by Thomas Munro and it was for heap storage, but I ported
    it to zheap.)

commit 89466c9b213f939809de73ae21e8ec939cec1925
Author: dilip <dilip@localhost.localdomain>
Date:   2017-08-07 14:35:11 +0530

    Mark buffer dirty after inserting the undo record in undo buffer.

    Reported by Thomes and Amit.

commit 8fec492113e3f03eeea42063b23b530b689bfd8e
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-05 21:10:32 +1200

    Correct comment.

commit e5818f2fd0b55cab123926db0afc52e832d44dd6
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-05 21:08:17 +1200

    Define a macro UndoRecPtrFormat for use in printf-style functions.

commit 5eedbd8378b0b32cee25ff882a78082a273ac9a6
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-04 16:53:50 +1200

    Git rid of last_size tracking from undo storage layer.

    This problem belongs higher up.

commit 93d94231ef7ff2b30a2c8ae5d795ca9129a0f00d
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-04 16:47:41 +1200

    Report all UndoRecPtr values as hex in pg_stat_undo_logs.

    Previously raw offsets were shown.  Let's not confuse ourselves with more
    than one representation.

commit 1bc609f9f3943c70e94a3cd409c32c536871f309
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-04 16:47:14 +1200

    Raise error if you try to access non-existent undo segment.

commit 799820348f683ff7b4120fdd3356ea75e98b757f
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-08-04 16:34:31 +1200

    Make test_undo module more cromulent.

    Also change type used for space allocation size_t rather than anything smaller
    so that we can test insertion of large amounts of data at once.

commit def8e64b77cb8e50ac12768f930ee6dbc3785a38
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-07-26 17:21:29 +0530

    Update the transaction id in undorecord header

    Separately store the latest transaction id that has changed the tuple
    in the undo record header to ensure that we don't try to process the
    tuple in undo chain that is already discarded. As of now, we are
    using RecentGlobalXmin to prevent it, but up-coming patches that have
    Discard undo mechanism will replace it with oldest_xid_that_has_undo.

commit 5126ec94af589fb56cfcad9f014f6a8c20b16fff
Author: dilip <dilip@localhost.localdomain>
Date:   2017-07-26 15:30:05 +0530

    Adding xid in undo header

    Xid will be used by zheap to traverse the undo chain. It
    will not traverse the chain once it get the undo which has
    the xid smaller than OldestXidHavingUndo

commit 7f994a6806858bcc2069004fa16f50d37263dd51
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-07-26 15:06:49 +0530

    Support EvalPlanQual mechanism for zheap

    To support evalplanqual mechanism we need to support locking a tuple
    (as of now only now only in LockTupleExclusive mode) to ensure that
    nobody else can update it once the tuple is qualified by this
    mechanism. We also need dirty snapshots to fetch the tuple by tupleid.
    We also need to update ExecScanFetch APIs so that it can recognise
    zheaptuples. We also need to adjust other visibility routines so that
    they can understand locked tuples, similarly we need to adjust
    zheap_update and zheap_delete to account for locked tuples.

    Patch by me, reviewed and tested by Dilip.

commit a4ac11945ae4201bc1c45ddd4ea676841f281cd3
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-07-25 05:16:51 +1200

    Improve loop.  Nobody likes degenerate for loops.

commit 4980d572117fc14583af00006e741eedc269a7b3
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-07-25 05:13:52 +1200

    Fix UndoLogDiscard().

    Commit d12ea1cec176c3793cb53b6133d413418c338f62 changed the way that the
    offset part of an UndoRecPtr maps to a block number from header exclusive
    to header inclusive, but I failed to change UndoLogDiscard().
    Thanks to Dilip Kumar for diagnosis & fix.

commit fbd05d66d941cb5410c8cc4b3c0bfe1fa4b9f3a6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-07-17 14:49:36 +0530

    Implement Index Scan using btree on zheap relations

    This doesn't include the implementation required for fetching
    multiple versions of a tuple in case of a non-MVCC snapshot.

commit 5d9386b049a6dc61286c4140457ee1d728938fa6
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-07-18 10:43:19 +0530

    Implement insertion of zheap tuples in btree index

    It doesn't include the implementation for handling updates on
    key column.

commit f849ed2274727a5c04988fcace73361f79b0346f
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-07-17 14:18:16 +0530

    Implement CREATE INDEX on zheap relations USING BTREE.

    This allows creation of btree index on *empty* zheap relations.

commit 0bab411eeb75256cbcb6e58cf99b493d59158ce3
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-07-14 10:12:17 +0530

    Define and use InvalidUndoRecPtr.

commit 64433f17dd08fa2db21af689ff673abf16d863ff
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-07-11 17:04:22 +0530

    Fix an assert failure in PrepareUndoInsert

commit 80878d1097ae747de6b257e8be911c16fd57196b
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-07-04 12:32:00 +0530

    Fix 'storage_engine' reloption to support case-insensitive inputs

commit 9d4097fad8db35add630b7da9b98b3d0e95dcfbe
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-06-30 23:56:14 +1200

    Change the way that undolog.c accounts for page headers.

    Previously, the offset component of UndoRecPtr was a counter of usable bytes
    within the undo log.  Now it's straight physical offset into the undo log.
    This means that it's now possible to have a (corrupt) UndoRecPtr that points
    at header data, which is nonsensical, but it makes a lot of things simpler.
    Now segment filenames, insert, discard, end offsets and UndoRecPtr values are
    all based on the same scale: number of raw bytes from the start of the undo
    log, regardless of whether those are header or data bytes.

    Previously, UndoLogAllocate advanced the insert pointer by the exact number
    of bytes requested in the 'size' argument.  That was unusable because it
    required the caller to allocate header space this way too, and do a bunch of
    math that required knowing where the insert point already was.  Now 'size'
    means usable bytes; header bytes are automatically accounted for.

    Based on a complaint from Dilip.  Let's try it this way and see if it makes
    more sense.

commit 717961ad5ae25a6cfad73f7843e718a36b0f889b
Author: dilip <dilip@localhost.localdomain>
Date:   2017-06-30 10:29:32 +0530

    Fix defect in undo record API

    In the earlier version there was an assumption that
    at least undorecor header will fit into the buffer,
    but actually that was not true undo record header
    can also split across 2 buffers, the same has been
    fixed.

commit c3cd40c69a91cee5d71fb04432393e49b8b62302
Author: dilip <dilip@localhost.localdomain>
Date:   2017-06-29 16:26:48 +0530

    Fix compilation warning in undolog.c

commit a56bc2ce68de378fd73f31e48a127f0620c8fc48
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-29 16:13:38 +0530

    Store command id in undo records.

    Till now, there was a primitive implementation of command id.  We were
    storing the command id in transaction slots, but that won't work for
    tuples stored in undo records.  Now, store the command id in undo
    record and fetch it from undo records during visibility checks.

commit abef75d8885d21857afbc9d4487628807e913cb4
Author: dilip <dilip@localhost.localdomain>
Date:   2017-06-29 15:44:44 +0530

    Support for command ID in undorecord header

commit 779cb3080bf4e4862cd008aa902b55029f545f4e
Author: dilip <dilip@localhost.localdomain>
Date:   2017-06-27 10:09:24 +0530

    Bug fix in undo buffer list: Reset the undo buffer list

commit 99f1adcf642e102df9f87f4763ca299ff0a57733
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-23 20:45:43 +0530

    Support reuse of transaction slots

    Currently the zheap page has four transaction slots which means that
    there can't be more than four active transactions operating on a page.
    After four transactions the system will start waiting, this commit
    will allow to reuse the transaction slots.  The transaction slot can
    reuse if the transaction is committed and all-visible or if it is
    aborted then the undo has been applied.  As of now, we don't have
    rollback support so we just rely on the first check to reuse the slot.

commit 3d850d24e8aa04087ca55ebf9319e6b56ec5ec63
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-23 20:32:51 +0530

    Support Zheap in-place update and delete operations

    This allows zheap tuples to be updated in-place and marked as deleted.
    This also supports visibility checking for snapshots and traversing
    the undo chains for non-visible tuples.  As of this commit, the
    support for snapshot visibility with respect to command id is not
    complete and can give wrong answers.  The future commit will support
    the same.

commit 9fe80770ed32020cac9487f93cdf542cd070c4fa
Author: dilip <dilip@localhost.localdomain>
Date:   2017-06-23 16:53:26 +0530

    Fix warning in undolog and undorecord

commit 3075d2186e45ca0fe14add50d1a8bdf82de7e85f
Author: dilip <dilip@localhost.localdomain>
Date:   2017-06-23 16:48:07 +0530

    Undo API for inserting and fetching the undo records from undo storage

    Undo API is an interface on top of the undo log storage which provides
    a way to insert and fetch records from undo logs. The API internally
    manages the buffers and performs encoding and decoding the undo records.

commit 21d2703d218f05c28638e1ec98559d357e4b94f8
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-06-22 10:27:23 +1200

    Changed interface of UndoLogDiscard based on feedback.

    Previously it took old pointer and size.  Now it just takes the new pointer.
    In other words you call it with the address of the oldest byte you would
    like to keep.

    Also updated various comments, changed the pg_stat_undo_logs view a bit
    and fixed the rules.out to reflect that.

commit 10ccdce79fed7a4dd42321198c8f8a1e9bd26dda
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-06-21 13:59:31 +1200

    Implemented UndoLogDiscard and related things.

    Now able to discard and recycle segment files on master and standby.
    Assorted other changes: using an LWLock per undo log, instead of a SpinLock.
    Using base/undo rather than base/9 to hold undo segment files.  Renamed
    segment files as logno.offset, so that they also tell you the UndoRecPtr
    of the first byte in the segment.  Got rid of the separate tracking of
    'mvcc' and 'rollback' discard pointers; I don't think that's my job, instead
    there is just a single 'discard' pointer.  Renamed 'capacity' to 'end'.

    There is plenty more work to be done here...

commit 502d0afd95e3b7a6bb089be6efed35ed24cec96e
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-06-19 15:45:41 +1200

    Fix double lock release bug in ForgetBuffer

    Don't try to release the partition lock twice if we failed to find the
    buffer.

commit 32ca66260db1675f7fce94e59dbc55d9057bdfdd
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-06-01 17:07:43 +1200

    Early prototype code for undo log storage.

    Includes a backport of 767bc028e5f001351feb498acef9a87c123093d6 because
    we need to be able to create pinned DSM segments without creating a bogus
    ResourceOwner first.

commit 80bb5103f41d28052c8cd11f5393de6eaf1522cd
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-06-06 21:47:46 +0530

    Fix return type for zheap_insert

commit 51f5293a475d6e2175f4e5d87978308e3d7b0a60
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-06-04 19:36:14 +0530

    Fix NULL pointer access in RelationStorageIsZHeap

commit 7c4f86a14b93151c9281a7cab3408b9d040fee88
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-02 16:16:23 +0530

    Change UNDO Insert tuple contents.

    As per current understanding, we don't need to store any information
    about in undo insert as the command id (cid) is also stored in
    transaction slot.

commit 31142f2fcde9e84baf531feb4187505dbaf2259d
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-02 16:12:42 +0530

    Few defines for zheap tuple

    Define infomask flags required for zheap tuples. One of these flags is
    required for the already committed code in commit
    a3bb8c3411d1a17fb190b5757bf181d38900800a.

commit feea62deb7528b7525090e6623cfc2482c6d20e0
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-02 15:40:44 +0530

    Implement getsysattribute function for zheap.

    This is required for upcomimng zheap delete operation patch.
    Attributes related to transaction information won't give correct
    information for all cases.  I have added Fixme in code which needs
    to be fixed at later time when we need to use those attributes.

commit 7d7bb25644ad0a3fd234e6185d21151e7a7e8af0
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-06-02 07:22:28 +0530

    Initialize Zheap page with appropriate special space size.

commit 7598316c1645db0e8d7eee55c45a529d5d426544
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-06-01 18:13:11 +0530

    Add a 'storage_engine' reloption.

    The code that decides which storage should be used for a relation.
    Allowed values are 'heap' and 'zheap'. The default value for this
    options is 'heap'. Also, it adds a macro 'RelationStorageIsZHeap(relation)'
    which can be used to decide whether a relation uses zheap or not.

    Patch by me, reviewed by Mithun C Y

commit 2d8452096418348c6f406b595cc75903fde38b61
Author: Kuntal Ghosh <kuntal.ghosh@enterprisedb.com>
Date:   2017-05-31 12:00:05 +0530

    Support query quals for ZHeap

    It implements the qualification check of column names for a ZHeap tuple.

commit a0ae8ceb08757b0cdce77780c87267b6dbc7de1f
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-05-30 16:31:54 +0530

    ZHeap Tuple visibility checks

    Implements the basic zheap tuple visibility skeleton wherein we can
    verify if the inserted tuple is visible to our command. The basic idea
    used is to have few transaction slots (as of now four, needs some
    testing to determine the exact number) in special space of zheap page.
    Each write operation on page first needs to reserve transaction slot
    in the page and update the same in tuple header and then proceed with
    the actual operation. We need to reserve two bits in tuple header to
    remember the information of transaction slot. There is a need of
    additional bit to indicate no tranasction slot in which case tuple
    will be all-visible.  This usage of additional bit will be implemented
    as a separate commit.  As of now, each transaction slot contains xid,
    cid and undo_rec_ptr which helps us to determine tuple visibility. We
    might not need cid in transaction slot as that can be retrieved from
    undo, however keeping it in page saves us from fetching undo record
    pointer in many cases and due to alignment, it doesn't cost us much
    space wise.

commit 74983024bedce424427bec359f8f0c0f0615a774
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-05-17 10:59:55 +0530

    Handle cases for data_alignment = 4.

    Currently for four-byte alignment, we align everything to four-byte
    boundary which is not what we want for typalign 'c' or 's'.  So align the
    given offset as per attalign for char and short alignment and at four-byte
    boundary for other values of attalign.

    Patch by me, reported by Robert Haas.

commit a802bb024f8041c0122419729c86611f592a262e
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-05-11 12:12:49 +0530

    Support basic seqscan operation for ZHeap.

    Make HeapScanDesc aware of ZHeapTuple, this is required to scan. We
    could create a separate ZHeapScanDesc, but at this stage, I don't see
    the need of same. I have written Zheap scan API's like zheap_getnext,
    zheap_beginscan, etc. As for inserts, it doesn't seem advisable to add
    a lot of if..else checks in heap scan API's to support Zheap tuple
    format. Apart from tuple format, we can't directly use heap visibility
    routines (fetching transaction info from tuple will be different),
    although I have yet to implement the same. Also, we might want to copy
    the zheap tuple before releasing buffer lock as zheap tuples can be
    updated in-place.

    Note that with this patch only select * from <table_name> can work, to
    make where clause work, we need to change few other *get_attr api's so
    that they can understand zheaptuple format.

commit ffbc3b0a84451e365457e7a1d1d31a15cd1366cd
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-05-02 08:41:19 +0530

    Set Tag needs to use zheap tuple when zheap is enabled.

commit 63ba474f0d69a4123129ffa4849548e9d53a0146
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-05-02 08:22:29 +0530

    Mark data_alignment_zheap as PGDLLIMPORT.

    This is so that extensions can use it.

commit 058618921b6f6319eb4f86909058a5a274fcabd0
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-05-02 08:17:16 +0530

    Remove spurious whitespace.

commit 171bfaeeca96c2186ba45e5f9930df2cf7eeb5ee
Author: amitkapila <amitkapila@localhost.localdomain>
Date:   2017-05-02 07:54:16 +0530

    Add missing Makefile in zheap directory.

    Commit 53643b6589a6b963b8621e3173d505383d510bb5 forgot to add Makefile
    for zheap directory.

commit e70c0c4404021ece54e8bdcb9ce2ec6d3c2d87eb
Author: amitkapila <amitkapila@localhost.localdomain>
Date:   2017-04-28 15:38:56 +0530

    Change max heap tuples per page.

    With shorter tuple headers, the maximum number of tuples that can fit
    on a page have significantly increased.  So changed the calculation in
    API's required for insert operation.  I have choosen to expose new
    API's as doing checks for type of heap in such lower level API's
    doesn't seem sensible.

    For now, I have added new page level api's in zheapam.c.  I think
    later we might want to split them into a separate file.

commit 22449b025419c176ed74464ee26bbcc46125a63e
Author: amitkapila <amitkapila@localhost.localdomain>
Date:   2017-04-28 14:56:44 +0530

    Support unaligned inserts in zheap.

    A new guc data_alignment has been introduced to decide the alignment
    of tuple data.  The data_alignment value 0 indicates no alignment,
    value 4 indicates align everything at 4 byte boundary and any other
    value indicates align as per typalign of the attribute.  The main
    objective of this commit is test the size of data with different
    alignments.

    Based on test results we might want to retain one type of alignment
    and remove this GUC and associated code.

commit eaf9f4b054e9d8e7bfd2fc86bf7c7f4e51a04e20
Author: amitkapila <amitkapila@localhost.localdomain>
Date:   2017-04-28 12:24:29 +0530

    Support Zheap Insert Operation.

    The main idea of this patch is to support a short tuple header (3
    bytes instead of 24 bytes). As of this patch, I have kept tuple
    header aligned to 8-bytes and tuple data alignment works as for main
    heap. This doesn't include support for toast inserts or speculative
    inserts (Insert .. On Conflict). Also, we don't support selects. As
    of now, the support for undo record is dummy which means records are
    formed, but not stored. Similarly for WAL, there is support of XLOG
    record insertion, but replay is not done as that also needs some support
    from undo. I have added a guc enable_zheap to perform zheap inserts
    which needs to be changed to reloption or something else, but it serves
    the purpose as of now.

commit ba5b2b31bbd9d8053be2416fb0cad0fda64df053
Author: Robert Haas <rhaas@postgresql.org>
Date:   2017-03-05 10:18:45 +0530

    Throw-away test code for UndoRecordInsert.

commit daadef317fa3b8453115ac24c351faed5f277ffd
Author: Robert Haas <rhaas@postgresql.org>
Date:   2017-03-05 08:07:52 +0530

    Implement UndoRecordExpectedSize and InsertUndoRecord.

commit fe886f35e8b257c6a24fa3729df26432dfb52881
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-02-24 11:42:16 +0530

    update comment to reflect function name changed in
    commit 29f4db6d7a158c75fc152351f874b8d4e6af63a0.

commit 0c79dde27afa0d432ed53ad7c353204560500611
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-02-23 16:19:10 +0530

    Added bootstrap interface; tidied

commit b05429f48e3e7104e84428a81884055569e1006d
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Date:   2017-02-23 14:57:08 +0530

    An early draft of undolog.h.

commit cc1c4927fbc236559c6831ab583f1600c96a543d
Author: Robert Haas <rhaas@postgresql.org>
Date:   2017-02-23 14:41:05 +0530

    Rename DropBuffer to ForgetBuffer, change API a bit, implement.

    Also, implement ForgetLocalBuffer.

commit 954bf571a3e332e5d0dd16f6ec2eb4c11d7962cf
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-02-23 14:02:03 +0530

    Expose the DropBuffer API.

    We don't need a new header file just to expose one API.
    So move the API to bufmgr.h and remove the undobuf.h.

commit 28ca1b57f1d509657d6c77eaf07fd0af395e2e5b
Author: Amit Kapila <amit.kapila@enterprisedb.com>
Date:   2017-02-23 12:04:24 +0530

    draft header files for undoworker, undoloop and undobuf.
---

diff --git a/README.md b/README.md
new file mode 100644
index 0000000000..f677606ec3
--- /dev/null
+++ b/README.md
@@ -0,0 +1,69 @@
+The purpose of this document is to let users know how they can use zheap (a new
+storage format for PostgreSQL) and the work that is still pending.  This new
+storage format provides a better control over bloat, reduces the tuple size
+and reduces the write amplification. The detail design of zheap is present in
+zheap design document (src/backend/access/zheap/README).
+
+How do I use zheap?
+===================
+
+We have provided a storage engine option which you can set when creating a table.
+For example:
+
+create table t_zheap(c1 int, c2 varchar) USING zheap;
+
+Index creation for zheap tables doesn't need any special syntax.
+
+You can also set the GUC parameter default_table_access_method.  The
+default value is âheap", but you can set it to âzheapâ.  If you do,
+all subsequently-created tables will use zheap.
+
+These interfaces will probably change once the storage format API work is
+integrated into PostgreSQL.  Weâll adjust this code to use whatever interfaces
+are agreed by the PostgreSQL community.
+
+We have also provided a GUC called data_alignment, which sets the alignment
+used for zheap tuples. 0 indicates no alignment, 4 uses a maximum of 4 byte
+alignment, and any other value indicates align as per attalign.  This also
+controls the padding between tuples. This parameter is just for some
+experiments to see the impact of alignment on database size.  This parameter
+will be removed later; weâll align as described in the zheap design document.
+
+Each zheap page has fixed set of transaction slots each of which contains the
+transaction information (transaction id and epoch) and the latest undo record
+pointer for that transaction.  By default, we have four transaction slots per
+page, but this can be changed by setting --with-trans_slots_per_zheap_page=value
+while configuring zheap.
+
+What doesnât work yet?
+======================
+- Logical decoding
+- Snapshot too old - We might want to implement this after first version is
+committed as this will work differently for zheap.
+- Alter Table <table_name> Set Tablesapce <tbs_name> - For this feature to work
+correctly in zheap, while copying pages, we need to ensure that pending aborts
+gets applied before copying the page.
+
+Tools
+- pg_undo_dump similar to pg_wal_dump:  We would like to develop this utility
+as it can be used to view undo record contents and can help us debug problems
+related to undo chains.
+- We also want to develop tools like pgstattuple, pgrowlocks that
+allow us to inspect the contents of database pages at a low level.
+- wal consistency checker: This will be used to check for bugs in the WAL redo
+routines.  Currently, it is quite similar to what we have in current heap, but
+we want to extend it to check the consistency of undo pages similar to how it
+checks for data and index pages.
+
+Open Issues
+===========
+- Currently, the TPD pages are not added to FSM even if they can be completely
+reused.
+- Single user mode: This needs some investigation as to what exactly is required.
+I think we need to ensure that undo gets applied without the need to invoke undo
+worker.
+
+The other pending code related items are tracked on zheap wiki page:
+https://wiki.postgresql.org/wiki/Zheap
+
+You can find overall design of zheap in the README: src/backend/access/zheap/README
diff --git a/configure b/configure
index dce6d98cf6..db72375f20 100755
--- a/configure
+++ b/configure
@@ -838,6 +838,7 @@ enable_tap_tests
 with_blocksize
 with_segsize
 with_wal_blocksize
+with_trans_slots_per_zheap_page
 with_CC
 with_llvm
 enable_depend
@@ -1540,6 +1541,8 @@ Optional Packages:
   --with-segsize=SEGSIZE  set table segment size in GB [1]
   --with-wal-blocksize=BLOCKSIZE
                           set WAL block size in kB [8]
+  --with-trans_slots_per_zheap_page=SLOTS
+                          set transaction slots per zheap page [4]
   --with-CC=CMD           set compiler (deprecated)
   --with-llvm             build with LLVM based JIT support
   --with-icu              build with ICU support
@@ -3800,6 +3803,51 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
+#
+# transaction slots per zheap page
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for transaction slots per zheap page" >&5
+$as_echo_n "checking for transaction slots per zheap page... " >&6; }
+
+
+
+# Check whether --with-trans_slots_per_zheap_page was given.
+if test "${with_trans_slots_per_zheap_page+set}" = set; then :
+  withval=$with_trans_slots_per_zheap_page;
+  case $withval in
+    yes)
+      as_fn_error $? "argument required for --with-trans_slots_per_zheap_page option" "$LINENO" 5
+      ;;
+    no)
+      as_fn_error $? "argument required for --with-trans_slots_per_zheap_page option" "$LINENO" 5
+      ;;
+    *)
+      trans_slots_per_page=$withval
+      ;;
+  esac
+
+else
+  trans_slots_per_page=4
+fi
+
+
+case ${trans_slots_per_page} in
+  2) ZHEAP_PAGE_TRANS_SLOTS=2;;
+  4) ZHEAP_PAGE_TRANS_SLOTS=4;;
+  8) ZHEAP_PAGE_TRANS_SLOTS=8;;
+ 16) ZHEAP_PAGE_TRANS_SLOTS=16;;
+ 31) ZHEAP_PAGE_TRANS_SLOTS=31;;
+  *) as_fn_error $? "Invalid transaction slots per zheap page. Allowed values are 2,4,8,16,31." "$LINENO" 5
+esac
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${trans_slots_per_page}" >&5
+$as_echo "${trans_slots_per_page}" >&6; }
+
+
+cat >>confdefs.h <<_ACEOF
+#define ZHEAP_PAGE_TRANS_SLOTS ${ZHEAP_PAGE_TRANS_SLOTS}
+_ACEOF
+
+
 #
 # C compiler
 #
diff --git a/configure.in b/configure.in
index e5123ac122..e411047550 100644
--- a/configure.in
+++ b/configure.in
@@ -343,6 +343,29 @@ AC_DEFINE_UNQUOTED([XLOG_BLCKSZ], ${XLOG_BLCKSZ}, [
  Changing XLOG_BLCKSZ requires an initdb.
 ])
 
+#
+# transaction slots per zheap page
+#
+AC_MSG_CHECKING([for transaction slots per zheap page])
+PGAC_ARG_REQ(with, trans_slots_per_zheap_page, [SLOTS], [set transaction slots per zheap page [4]],
+             [trans_slots_per_page=$withval],
+             [trans_slots_per_page=4])
+case ${trans_slots_per_page} in
+  2) ZHEAP_PAGE_TRANS_SLOTS=2;;
+  4) ZHEAP_PAGE_TRANS_SLOTS=4;;
+  8) ZHEAP_PAGE_TRANS_SLOTS=8;;
+ 16) ZHEAP_PAGE_TRANS_SLOTS=16;;
+ 31) ZHEAP_PAGE_TRANS_SLOTS=31;;
+  *) AC_MSG_ERROR([Invalid transaction slots per zheap page. Allowed values are 2,4,8,16,31.])
+esac
+AC_MSG_RESULT([${trans_slots_per_page}])
+
+AC_DEFINE_UNQUOTED([ZHEAP_PAGE_TRANS_SLOTS], ${ZHEAP_PAGE_TRANS_SLOTS}, [
+ transaction slots per zheap page. By default, it is set to 4.
+
+ Changing ZHEAP_PAGE_TRANS_SLOTS requires an initdb.
+])
+
 #
 # C compiler
 #
diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index e5a581f141..18fca394c9 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -2,17 +2,17 @@
 
 MODULE_big	= pageinspect
 OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
-		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
+		  brinfuncs.o ginfuncs.o hashfuncs.o zheapfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
 	pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
-REGRESS = page btree brin gin hash
+REGRESS = page btree brin gin hash zheap
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 3fcd9fbe6d..9c8b26709c 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -1,5 +1,5 @@
 CREATE EXTENSION pageinspect;
-CREATE TABLE test1 (a int, b int);
+CREATE TABLE test1 (a int, b int) USING heap;
 INSERT INTO test1 VALUES (16777217, 131584);
 VACUUM test1;  -- set up FSM
 -- The page contents can vary, so just test that it can be read
diff --git a/contrib/pageinspect/expected/zheap.out b/contrib/pageinspect/expected/zheap.out
new file mode 100644
index 0000000000..e740f648bc
--- /dev/null
+++ b/contrib/pageinspect/expected/zheap.out
@@ -0,0 +1,64 @@
+CREATE TABLE test_zheap (a int, b int) USING zheap;
+INSERT INTO test_zheap VALUES (16777217, 131584);
+-- The page contents can vary, so just test that it can be read
+-- successfully, but don't keep the output.
+SELECT pagesize, version FROM page_header(get_raw_page('test_zheap', 1));
+ pagesize | version 
+----------+---------
+     8192 |       4
+(1 row)
+
+SELECT page_checksum(get_raw_page('test_zheap', 1), 1) IS NOT NULL AS silly_checksum_test;
+ silly_checksum_test 
+---------------------
+ t
+(1 row)
+
+DROP TABLE test_zheap;
+-- check that using any of these functions with a partitioned table would fail
+create table test_partitioned (a int) partition by range (a) USING zheap;
+select get_raw_page('test_partitioned', 1); -- error about partitioned table
+ERROR:  cannot get raw page from partitioned table "test_partitioned"
+-- a regular table which is a member of a partition set should work though
+create table test_part1 partition of test_partitioned for values from ( 1 ) to (100) USING zheap;
+select get_raw_page('test_part1', 1); -- get farther and error about empty table
+ERROR:  block number 1 is out of range for relation "test_part1"
+drop table test_partitioned;
+-- The tuple contents can vary, so we perform some basic testing of zheap_page_items.
+-- We perform all the tuple modifications in a single transaction so that t_slot
+-- doesn't change if we change trancsation slots in page during compile time.
+-- Because of the same reason, we cannot check for all possibile output for
+-- t_infomask_info (for example: slot-reused, multilock, l-nokey-ex etc).
+create table test_zheap (a int, b text) USING zheap WITH (autovacuum_enabled=false);
+begin;
+insert into test_zheap (a) select generate_series(1,6);
+update test_zheap set a=10 where a=2;
+update test_zheap set b='abcd' where a=3;
+delete from test_zheap where a=4;
+select * from test_zheap where a=5 for share;
+ a | b 
+---+---
+ 5 | 
+(1 row)
+
+select * from test_zheap where a=6 for update;
+ a | b 
+---+---
+ 6 | 
+(1 row)
+
+commit;
+select  lp,lp_flags,t_slot,t_infomask2,t_infomask,t_hoff,t_bits,
+		t_infomask_info from zheap_page_items(get_raw_page('test_zheap', 1));
+ lp | lp_flags | t_slot | t_infomask2 | t_infomask | t_hoff |  t_bits  | t_infomask_info 
+----+----------+--------+-------------+------------+--------+----------+-----------------
+  1 |        1 |      1 |        2050 |          1 |      6 | 10000000 | 
+  2 |        1 |      1 |        2050 |         33 |      6 | 10000000 | {in-updated}
+  3 |        1 |      1 |        2050 |         65 |      6 | 10000000 | {updated}
+  4 |        1 |      1 |        2050 |       1041 |      6 | 10000000 | {deleted,l-ex}
+  5 |        1 |      1 |        2050 |        897 |      6 | 10000000 | {l-share}
+  6 |        1 |      1 |        2050 |       1153 |      6 | 10000000 | {l-ex}
+  7 |        1 |      1 |        2050 |          2 |      5 |          | 
+(7 rows)
+
+drop table test_zheap;
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
new file mode 100644
index 0000000000..b538af7508
--- /dev/null
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -0,0 +1,39 @@
+/* contrib/pageinspect/pageinspect--1.7--1.8.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pageinspect UPDATE TO '1.8'" to load this file. \quit
+
+--
+-- zheap functions
+--
+
+--
+-- zheap_page_items()
+--
+CREATE FUNCTION zheap_page_items(IN page bytea,
+    OUT lp smallint,
+    OUT lp_off smallint,
+    OUT lp_flags smallint,
+    OUT lp_len smallint,
+    OUT t_slot smallint,
+    OUT t_infomask2 integer,
+    OUT t_infomask integer,
+    OUT t_hoff smallint,
+    OUT t_bits text,
+    OUT t_data bytea,
+    OUT t_infomask_info text[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'zheap_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- zheap_page_slots()
+--
+CREATE FUNCTION zheap_page_slots(IN page bytea,
+    OUT slot_id smallint,
+    OUT epoch int4,
+    OUT xid int4,
+    OUT undoptr int8)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'zheap_page_slots'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/pageinspect/pageinspect.control b/contrib/pageinspect/pageinspect.control
index dcfc61f22d..f8cdf526c6 100644
--- a/contrib/pageinspect/pageinspect.control
+++ b/contrib/pageinspect/pageinspect.control
@@ -1,5 +1,5 @@
 # pageinspect extension
 comment = 'inspect the contents of database pages at a low level'
-default_version = '1.7'
+default_version = '1.8'
 module_pathname = '$libdir/pageinspect'
 relocatable = true
diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql
index 8ac9991837..8f0ef62cdc 100644
--- a/contrib/pageinspect/sql/page.sql
+++ b/contrib/pageinspect/sql/page.sql
@@ -1,6 +1,6 @@
 CREATE EXTENSION pageinspect;
 
-CREATE TABLE test1 (a int, b int);
+CREATE TABLE test1 (a int, b int) USING heap;
 INSERT INTO test1 VALUES (16777217, 131584);
 
 VACUUM test1;  -- set up FSM
diff --git a/contrib/pageinspect/sql/zheap.sql b/contrib/pageinspect/sql/zheap.sql
new file mode 100644
index 0000000000..0b7b6bd4ae
--- /dev/null
+++ b/contrib/pageinspect/sql/zheap.sql
@@ -0,0 +1,38 @@
+CREATE TABLE test_zheap (a int, b int) USING zheap;
+INSERT INTO test_zheap VALUES (16777217, 131584);
+
+-- The page contents can vary, so just test that it can be read
+-- successfully, but don't keep the output.
+
+SELECT pagesize, version FROM page_header(get_raw_page('test_zheap', 1));
+
+SELECT page_checksum(get_raw_page('test_zheap', 1), 1) IS NOT NULL AS silly_checksum_test;
+
+DROP TABLE test_zheap;
+
+-- check that using any of these functions with a partitioned table would fail
+create table test_partitioned (a int) partition by range (a) USING zheap;
+select get_raw_page('test_partitioned', 1); -- error about partitioned table
+
+-- a regular table which is a member of a partition set should work though
+create table test_part1 partition of test_partitioned for values from ( 1 ) to (100) USING zheap;
+select get_raw_page('test_part1', 1); -- get farther and error about empty table
+drop table test_partitioned;
+
+-- The tuple contents can vary, so we perform some basic testing of zheap_page_items.
+-- We perform all the tuple modifications in a single transaction so that t_slot
+-- doesn't change if we change trancsation slots in page during compile time.
+-- Because of the same reason, we cannot check for all possibile output for
+-- t_infomask_info (for example: slot-reused, multilock, l-nokey-ex etc).
+create table test_zheap (a int, b text) USING zheap WITH (autovacuum_enabled=false);
+begin;
+insert into test_zheap (a) select generate_series(1,6);
+update test_zheap set a=10 where a=2;
+update test_zheap set b='abcd' where a=3;
+delete from test_zheap where a=4;
+select * from test_zheap where a=5 for share;
+select * from test_zheap where a=6 for update;
+commit;
+select  lp,lp_flags,t_slot,t_infomask2,t_infomask,t_hoff,t_bits,
+		t_infomask_info from zheap_page_items(get_raw_page('test_zheap', 1));
+drop table test_zheap;
diff --git a/contrib/pageinspect/zheapfuncs.c b/contrib/pageinspect/zheapfuncs.c
new file mode 100644
index 0000000000..834bfce917
--- /dev/null
+++ b/contrib/pageinspect/zheapfuncs.c
@@ -0,0 +1,429 @@
+/*-------------------------------------------------------------------------
+ *
+ * zheapfuncs.c
+ *	  Functions to investigate zheap pages
+ *
+ * We check the input to these functions for corrupt pointers etc. that
+ * might cause crashes, but at the same time we try to print out as much
+ * information as possible, even if it's nonsense. That's because if a
+ * page is corrupt, we don't know why and how exactly it is corrupt, so we
+ * let the user judge it.
+ *
+ * These functions are restricted to superusers for the fear of introducing
+ * security holes if the input checking isn't as water-tight as it should be.
+ * You'd need to be superuser to obtain a raw page image anyway, so
+ * there's hardly any use case for using these without superuser-rights
+ * anyway.
+ *
+ * Copyright (c) 2007-2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pageinspect/zheapfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "pageinspect.h"
+
+#include "access/htup_details.h"
+#include "access/zheap.h"
+#include "funcapi.h"
+#include "catalog/pg_type.h"
+#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+static void decode_infomask(ZHeapTupleHeader ztuphdr, Datum *values, bool *nulls);
+
+/*
+ * bits_to_text
+ *
+ * Converts a bits8-array of 'len' bits to a human-readable
+ * c-string representation.
+ */
+static char *
+bits_to_text(bits8 *bits, int len)
+{
+	int			i;
+	char	   *str;
+
+	str = palloc(len + 1);
+
+	for (i = 0; i < len; i++)
+		str[i] = (bits[(i / 8)] & (1 << (i % 8))) ? '1' : '0';
+
+	str[i] = '\0';
+
+	return str;
+}
+
+/*
+ * decode_infomask
+ *
+ * Converts tuple infomask into an array describing the flags marked in
+ * tuple infomask.
+ */
+static void
+decode_infomask(ZHeapTupleHeader ztuphdr, Datum *values, bool *nulls)
+{
+	ArrayBuildState *raw_attrs;
+	raw_attrs = initArrayResult(TEXTOID, CurrentMemoryContext, false);
+	if (ZHeapTupleHasMultiLockers(ztuphdr->t_infomask) ||
+		IsZHeapTupleModified(ztuphdr->t_infomask) ||
+		ZHeapTupleHasInvalidXact(ztuphdr->t_infomask))
+	{
+		if (ZHeapTupleHasInvalidXact(ztuphdr->t_infomask))
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("slot-reused"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		if (ZHeapTupleHasMultiLockers(ztuphdr->t_infomask))
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("multilock"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		if (ztuphdr->t_infomask & ZHEAP_DELETED)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("deleted"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		if (ztuphdr->t_infomask & ZHEAP_UPDATED)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("updated"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		if (ztuphdr->t_infomask & ZHEAP_INPLACE_UPDATED)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("in-updated"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		if ((ztuphdr->t_infomask & ZHEAP_XID_SHR_LOCK) == ZHEAP_XID_SHR_LOCK)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-share"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		else if (ztuphdr->t_infomask & ZHEAP_XID_NOKEY_EXCL_LOCK)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-nokey-ex"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		else if (ztuphdr->t_infomask & ZHEAP_XID_KEYSHR_LOCK)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-keyshare"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		if (ztuphdr->t_infomask & ZHEAP_XID_EXCL_LOCK)
+		{
+			raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-ex"),
+										 false, TEXTOID, CurrentMemoryContext);
+		}
+		*values = makeArrayResult(raw_attrs, CurrentMemoryContext);
+	}
+	else
+		*nulls = true;
+}
+
+/*
+ * zheap_page_items
+ *
+ * Allows inspection of line pointers and tuple headers of a zheap page.
+ */
+PG_FUNCTION_INFO_V1(zheap_page_items);
+
+typedef struct zheap_page_items_state
+{
+	TupleDesc	tupd;
+	Page		page;
+	uint16		offset;
+} zheap_page_items_state;
+
+Datum
+zheap_page_items(PG_FUNCTION_ARGS)
+{
+	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
+	zheap_page_items_state *inter_call_data = NULL;
+	FuncCallContext *fctx;
+	int			raw_page_size;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use raw page functions"))));
+
+	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext mctx;
+		int			num_trans_slots;
+
+		if (raw_page_size < SizeOfPageHeaderData)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("input page too small (%d bytes)", raw_page_size)));
+
+		fctx = SRF_FIRSTCALL_INIT();
+		mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+		inter_call_data = palloc(sizeof(zheap_page_items_state));
+
+		/* Build a tuple descriptor for our result type */
+		if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		inter_call_data->tupd = tupdesc;
+
+		inter_call_data->offset = FirstOffsetNumber;
+		inter_call_data->page = VARDATA(raw_page);
+
+		fctx->max_calls = PageGetMaxOffsetNumber(inter_call_data->page);
+		fctx->user_fctx = inter_call_data;
+
+		/*
+		 * We cannot check whether this is a zheap page or not. But, we can
+		 * check whether pd_special is set correctly so that it contains the
+		 * expected number of transaction slots in the special space.
+		 */
+		num_trans_slots = (raw_page_size - ((PageHeader)
+							(inter_call_data->page))->pd_special)
+							/ sizeof(ZHeapPageOpaqueData);
+
+	   if (num_trans_slots != ZHEAP_PAGE_TRANS_SLOTS)
+			elog(ERROR, "zheap page contains unexpected number of transaction"
+				 "slots: %d, expecting %d", num_trans_slots, ZHEAP_PAGE_TRANS_SLOTS);
+
+		MemoryContextSwitchTo(mctx);
+	}
+
+	fctx = SRF_PERCALL_SETUP();
+	inter_call_data = fctx->user_fctx;
+
+	if (fctx->call_cntr < fctx->max_calls)
+	{
+		Page		page = inter_call_data->page;
+		HeapTuple	resultTuple;
+		Datum		result;
+		ItemId		id;
+		Datum		values[11];
+		bool		nulls[11];
+		uint16		lp_offset;
+		uint16		lp_flags;
+		uint16		lp_len;
+
+		memset(nulls, 0, sizeof(nulls));
+
+		/* Extract information from the line pointer */
+
+		id = PageGetItemId(page, inter_call_data->offset);
+
+		lp_offset = ItemIdGetOffset(id);
+		lp_flags = ItemIdGetFlags(id);
+		lp_len = ItemIdGetLength(id);
+
+		values[0] = UInt16GetDatum(inter_call_data->offset);
+		values[1] = UInt16GetDatum(lp_offset);
+		values[2] = UInt16GetDatum(lp_flags);
+		values[3] = UInt16GetDatum(lp_len);
+
+		/*
+		 * We do just enough validity checking to make sure we don't reference
+		 * data outside the page passed to us. The page could be corrupt in
+		 * many other ways, but at least we won't crash.
+		 */
+		if (ItemIdHasStorage(id) &&
+			lp_len >= MinZHeapTupleSize &&
+			lp_offset + lp_len <= raw_page_size)
+		{
+			ZHeapTupleHeader ztuphdr;
+			bytea	   *tuple_data_bytea;
+			int			tuple_data_len;
+
+			/* Extract information from the tuple header */
+			ztuphdr = (ZHeapTupleHeader) PageGetItem(page, id);
+
+			values[4] = UInt16GetDatum(ZHeapTupleHeaderGetXactSlot(ztuphdr));
+
+			values[5] = UInt32GetDatum(ztuphdr->t_infomask2);
+			values[6] = UInt32GetDatum(ztuphdr->t_infomask);
+			values[7] = UInt8GetDatum(ztuphdr->t_hoff);
+
+			/*
+			 * We already checked that the item is completely within the raw
+			 * page passed to us, with the length given in the line pointer.
+			 * Let's check that t_hoff doesn't point over lp_len, before using
+			 * it to access t_bits and oid.
+			 */
+			if (ztuphdr->t_hoff >= SizeofZHeapTupleHeader &&
+				ztuphdr->t_hoff <= lp_len)
+			{
+				if (ztuphdr->t_infomask & ZHEAP_HASNULL)
+				{
+					int			bits_len;
+
+					bits_len =
+						BITMAPLEN(ZHeapTupleHeaderGetNatts(ztuphdr)) * BITS_PER_BYTE;
+					values[8] = CStringGetTextDatum(
+													 bits_to_text(ztuphdr->t_bits, bits_len));
+				}
+				else
+					nulls[8] = true;
+
+			}
+			else
+			{
+				nulls[8] = true;
+			}
+
+			/* Copy raw tuple data into bytea attribute */
+			tuple_data_len = lp_len - ztuphdr->t_hoff;
+			tuple_data_bytea = (bytea *) palloc(tuple_data_len + VARHDRSZ);
+			SET_VARSIZE(tuple_data_bytea, tuple_data_len + VARHDRSZ);
+			memcpy(VARDATA(tuple_data_bytea), (char *) ztuphdr + ztuphdr->t_hoff,
+				   tuple_data_len);
+			values[9] = PointerGetDatum(tuple_data_bytea);
+
+			decode_infomask(ztuphdr, &values[10], &nulls[10]);
+		}
+		else
+		{
+			/*
+			 * The line pointer is not used, or it's invalid. Set the rest of
+			 * the fields to NULL
+			 */
+			int			i;
+
+			for (i = 4; i <= 11; i++)
+				nulls[i] = true;
+		}
+
+		/* Build and return the result tuple. */
+		resultTuple = heap_form_tuple(inter_call_data->tupd, values, nulls);
+		result = HeapTupleGetDatum(resultTuple);
+
+		inter_call_data->offset++;
+
+		SRF_RETURN_NEXT(fctx, result);
+	}
+	else
+		SRF_RETURN_DONE(fctx);
+}
+
+/*
+ * zheap_page_slots
+ *
+ * Allows inspection of transaction slots of a zheap page.
+ */
+PG_FUNCTION_INFO_V1(zheap_page_slots);
+
+typedef struct zheap_page_slots_state
+{
+	TupleDesc	tupd;
+	Page		page;
+	uint16		slot_id;
+} zheap_page_slots_state;
+
+Datum
+zheap_page_slots(PG_FUNCTION_ARGS)
+{
+	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
+	zheap_page_slots_state *inter_call_data = NULL;
+	FuncCallContext *fctx;
+	int			raw_page_size;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use raw page functions"))));
+
+	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext mctx;
+		int			num_trans_slots;
+
+		if (raw_page_size < SizeOfPageHeaderData)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("input page too small (%d bytes)", raw_page_size)));
+
+		fctx = SRF_FIRSTCALL_INIT();
+		mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+		inter_call_data = palloc(sizeof(zheap_page_slots_state));
+
+		/* Build a tuple descriptor for our result type */
+		if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		inter_call_data->tupd = tupdesc;
+
+		inter_call_data->slot_id = 0;
+		inter_call_data->page = VARDATA(raw_page);
+
+		fctx->user_fctx = inter_call_data;
+
+		/*
+		 * We cannot check whether this is a zheap page or not. But, we can
+		 * check whether pd_special is set correctly so that it contains the
+		 * expected number of transaction slots in the special space.
+		 */
+		num_trans_slots = (raw_page_size - ((PageHeader)
+							(inter_call_data->page))->pd_special)
+							/ sizeof(ZHeapPageOpaqueData);
+
+		if (num_trans_slots != ZHEAP_PAGE_TRANS_SLOTS)
+			elog(ERROR, "zheap page contains unexpected number of transaction"
+				 "slots: %d, expecting %d", num_trans_slots, ZHEAP_PAGE_TRANS_SLOTS);
+
+	   /*
+		* If the page has tpd slot, last slot is used as tpd slot. In that case,
+		* it will not have any informations about transaction.
+		*/
+		if (ZHeapPageHasTPDSlot((PageHeader) inter_call_data->page))
+			num_trans_slots--;
+		fctx->max_calls = num_trans_slots;
+
+		MemoryContextSwitchTo(mctx);
+	}
+
+	fctx = SRF_PERCALL_SETUP();
+	inter_call_data = fctx->user_fctx;
+
+	if (fctx->call_cntr < fctx->max_calls)
+	{
+		Page		page = inter_call_data->page;
+		HeapTuple	resultTuple;
+		Datum		result;
+		Datum		values[4];
+		bool		nulls[4];
+		ZHeapPageOpaque	opaque;
+		TransInfo	transinfo;
+
+		memset(nulls, 0, sizeof(nulls));
+
+		opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+		transinfo = opaque->transinfo[inter_call_data->slot_id];
+
+		/* Fetch transaction and undo information from slot */
+		values[0] = UInt16GetDatum(inter_call_data->slot_id + 1);
+		values[1] = UInt32GetDatum(transinfo.xid_epoch);
+		values[2] = UInt32GetDatum(transinfo.xid);
+		values[3] = UInt64GetDatum(transinfo.urec_ptr);
+
+		/* Build and return the result tuple. */
+		resultTuple = heap_form_tuple(inter_call_data->tupd, values, nulls);
+		result = HeapTupleGetDatum(resultTuple);
+
+		inter_call_data->slot_id++;
+
+		SRF_RETURN_NEXT(fctx, result);
+	}
+	else
+		SRF_RETURN_DONE(fctx);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4a7121a51f..ca15739f73 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7245,6 +7245,41 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-undo-tablespaces" xreflabel="undo_tablespaces">
+      <term><varname>undo_tablespaces</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>undo_tablespaces</varname> configuration parameter</primary>
+      </indexterm>
+      <indexterm><primary>tablespace</primary><secondary>temporary</secondary></indexterm>
+      </term>
+      <listitem>
+       <para>
+        This variable specifies tablespaces in which to store undo data, when
+        undo-aware storage managers (initially "zheap") perform writes.
+       </para>
+
+       <para>
+        The value is a list of names of tablespaces.  When there is more than
+        one name in the list, <productname>PostgreSQL</productname> chooses an
+        arbitrary one.  If the name doesn't correspond to an existing
+        tablespace, the next name is tried, and so on until all names have
+        been tried.  If no valid tablespace is specified, an error is raised.
+        The validation of the name doesn't happen until the first attempt to
+        write undo data.
+       </para>
+
+       <para>
+        The variable can only be changed before the first statement is
+        executed in a transaction.
+       </para>
+
+       <para>
+        The default value is an empty string, which results in all temporary
+        objects being created in the default tablespace.
+       </para>
+      </listitem>
+     </varlistentry>
+ 
      <varlistentry id="guc-check-function-bodies" xreflabel="check_function_bodies">
       <term><varname>check_function_bodies</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b3336ea9be..a3614e7394 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18463,6 +18463,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <entry><type>timestamp with time zone</type></entry>
       </row>
 
+      <row>
+       <entry><literal>oldest_xid_with_epoch_having_undo</literal></entry>
+       <entry><type>xid</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 96bcc3a63b..eeeb3f9391 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_undo_logs</structname><indexterm><primary>pg_stat_undo_logs</primary></indexterm></entry>
+      <entry>One row for each undo log, showing current pointers,
+       transactions and backends.
+       See <xref linkend="pg-stat-undo-logs-view"/> for details.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -549,7 +557,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    into the kernel's handling of I/O.
   </para>
 
-
   <table id="pg-stat-activity-view" xreflabel="pg_stat_activity">
    <title><structname>pg_stat_activity</structname> View</title>
 
@@ -1646,6 +1653,30 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>TwophaseFileWrite</literal></entry>
          <entry>Waiting for a write of a two phase state file.</entry>
         </row>
+        <row>
+         <entry><literal>UndoCheckpointRead</literal></entry>
+         <entry>Waiting for a read from an undo checkpoint file.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoCheckpointSync</literal></entry>
+         <entry>Waiting for changes to an undo checkpoint file to reach stable storage.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoCheckpointWrite</literal></entry>
+         <entry>Waiting for a write to an undo checkpoint file.</entry>
+        </row>
+         <row>
+         <entry><literal>UndoFileRead</literal></entry>
+         <entry>Waiting for a read from an undo data file.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoFileSync</literal></entry>
+         <entry>Waiting for changes to an undo data file to reach stable storage.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoFileWrite</literal></entry>
+         <entry>Waiting for a write to an undo data file.</entry>
+        </row>
         <row>
          <entry><literal>WALBootstrapSync</literal></entry>
          <entry>Waiting for WAL to reach stable storage during bootstrapping.</entry>
@@ -1722,6 +1753,80 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 </programlisting>
    </para>
 
+  <table id="pg-stat-undo-logs-view" xreflabel="pg_stat_undo_logs">
+   <title><structname>pg_stat_undo_logs</structname> View</title>
+
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>log_number</structfield></entry>
+     <entry><type>oid</type></entry>
+     <entry>Identifier of this undo log</entry>
+    </row>
+    <row>
+     <entry><structfield>persistence</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Persistence level of data stored in this undo log; one of
+      <literal>permanent</literal>, <literal>unlogged</literal> or
+      <literal>temporary</literal>.</entry>
+    </row>
+    <row>
+     <entry><structfield>tablespace</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Tablespace that holds physical storage of this undo log.</entry>
+    </row>
+    <row>
+     <entry><structfield>discard</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Location of the oldest data in this undo log.</entry>
+    </row>
+    <row>
+     <entry><structfield>insert</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Location where the next data will be written in this undo
+      log.</entry>
+    </row>
+    <row>
+     <entry><structfield>end</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Location one byte past the end of the allocated physical storage
+      backing this undo log.</entry>
+    </row>
+    <row>
+     <entry><structfield>xid</structfield></entry>
+     <entry><type>xid</type></entry>
+     <entry>Transaction currently attached to this undo log
+      for writing.</entry>
+    </row>
+    <row>
+     <entry><structfield>pid</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Process ID of the backend currently attached to this undo log
+      for writing.</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_undo_logs</structname> view will have one row for
+   each undo log that exists.  Undo logs are extents within a contiguous
+   addressing space that have their own head and tail pointers.
+   Each  backend that has written undo data is associated with one or more undo
+   log, and is the only backend that is allowed to write data to those undo
+   logs.  Backends can be associated with up to three undo logs at a time,
+   because different undo logs are used for the undo data associated with
+   permanent, unlogged and temporary relations.
+  </para>
+ 
   <table id="pg-stat-replication-view" xreflabel="pg_stat_replication">
    <title><structname>pg_stat_replication</structname> View</title>
    <tgroup cols="3">
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..a693182f72 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -141,6 +141,11 @@ Item
  <entry>Subdirectory containing state files for prepared transactions</entry>
 </row>
 
+<row>
+ <entry><filename>pg_undo</filename></entry>
+ <entry>Subdirectory containing undo log meta-data files</entry>
+</row>
+
 <row>
  <entry><filename>pg_wal</filename></entry>
  <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
@@ -686,6 +691,57 @@ erased (they will be recreated automatically as needed).
 
 </sect1>
 
+<sect1 id="undo-logs">
+
+<title>Undo Logs</title>
+
+<indexterm>
+ <primary>Undo Logs</primary>
+</indexterm>
+
+<para>
+Undo logs hold data that is used for rolling back and for implementing
+MVCC in access managers that are undo-aware (currently "zheap").  The storage
+format of undo logs is optimized for reusing existing files.
+</para>
+
+<para>
+Undo data exists in a 64 bit address space broken up into numbered undo logs
+that represent 1TB extents, for efficient management.  The space is further
+broken up into 1MB segment files, for physical storage.  The name of each file
+is the address of of the first byte in the file, with a period inserted after
+the part that indicates the undo log number.
+</para>
+
+<para>
+Each undo log is created in a particular tablespace and stores data for a
+particular persistence level.
+Undo logs are global in the sense that they don't belong to any particular
+database and may contain undo data from relations in any database.
+Undo files backing undo logs in the default tablespace are stored under
+<varname>PGDATA</varname><filename>/base/undo</filename>, and for other
+tablespaces under <filename>undo</filename> in the appropriate tablespace
+directory.  The system view <xref linkend="pg-stat-undo-logs-view"/> can be
+used to see the cluster's current list of undo logs along with their
+tablespaces and persistence levels.
+</para>
+
+<para>
+Just as relations can have one of the three persistence levels permanent,
+unlogged or temporary, the undo data that is generated by modifying them must
+be stored in an undo log of the same persistence level.  This enables the
+undo data to be discarded at appropriate times along with the relations that
+reference it.
+</para>
+
+<para>
+Undo log files contain standard page headers as described in the next section,
+but the format of the rest of the page is determined by the undo-aware
+access method that reads and writes it.
+</para>
+
+</sect1>
+
 <sect1 id="storage-page-layout">
 
 <title>Database Page Layout</title>
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..42f1beedff 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  table tablesample transam
+			  table tablesample transam undo zheap
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 06dd628a5b..a04bfb7d3b 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -64,16 +64,6 @@
 #include "utils/expandeddatum.h"
 
 
-/* Does att's datatype allow packing into the 1-byte-header varlena format? */
-#define ATT_IS_PACKABLE(att) \
-	((att)->attlen == -1 && (att)->attstorage != 'p')
-/* Use this if it's already known varlena */
-#define VARLENA_ATT_IS_PACKABLE(att) \
-	((att)->attstorage != 'p')
-
-static Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull);
-
-
 /* ----------------------------------------------------------------
  *						misc support routines
  * ----------------------------------------------------------------
@@ -82,7 +72,7 @@ static Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull);
 /*
  * Return the missing value of an attribute, or NULL if there isn't one.
  */
-static Datum
+Datum
 getmissingattr(TupleDesc tupleDesc,
 			   int attnum, bool *isnull)
 {
diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c
index fc88aa376a..6daa76893e 100644
--- a/src/backend/access/common/tupconvert.c
+++ b/src/backend/access/common/tupconvert.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/tupconvert.h"
 #include "executor/tuptable.h"
+#include "access/zheap.h"
 #include "utils/builtins.h"
 
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index ab5aaff156..bd603d9069 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -21,6 +21,7 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "access/transam.h"
+#include "access/zheap.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -1070,26 +1071,72 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
 		hitemid = PageGetItemId(hpage, hoffnum);
 
-		/*
-		 * Follow any redirections until we find something useful.
-		 */
-		while (ItemIdIsRedirected(hitemid))
+		if (!(xlrec->flags & XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP))
 		{
-			hoffnum = ItemIdGetRedirect(hitemid);
-			hitemid = PageGetItemId(hpage, hoffnum);
-			CHECK_FOR_INTERRUPTS();
+			/*
+			 * Follow any redirections until we find something useful.
+			 */
+			while (ItemIdIsRedirected(hitemid))
+			{
+				hoffnum = ItemIdGetRedirect(hitemid);
+				hitemid = PageGetItemId(hpage, hoffnum);
+				CHECK_FOR_INTERRUPTS();
+			}
 		}
 
 		/*
 		 * If the heap item has storage, then read the header and use that to
 		 * set latestRemovedXid.
 		 *
+		 * We have special handling for zheap tuples that are deleted and
+		 * don't have storage.
+		 *
 		 * Some LP_DEAD items may not be accessible, so we ignore them.
 		 */
-		if (ItemIdHasStorage(hitemid))
+		if ((xlrec->flags & XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP) &&
+			ItemIdIsDeleted(hitemid))
 		{
-			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
-			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+			TransactionId   xid;
+			ZHeapTupleData	ztup;
+
+			ztup.t_self = itup->t_tid;
+			ztup.t_len = ItemIdGetLength(hitemid);
+			ztup.t_tableOid = InvalidOid;
+			ztup.t_data = NULL;
+			ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid, NULL, NULL,
+								   false);
+			if (TransactionIdDidCommit(xid) &&
+				TransactionIdFollows(xid, latestRemovedXid))
+				latestRemovedXid = xid;
+		}
+		else if (ItemIdHasStorage(hitemid))
+		{
+			if ((xlrec->flags & XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP) != 0)
+			{
+				ZHeapTupleHeader ztuphdr;
+				ZHeapTupleData	ztup;
+
+				ztuphdr = (ZHeapTupleHeader) PageGetItem(hpage, hitemid);
+				ztup.t_self = itup->t_tid;
+				ztup.t_len = ItemIdGetLength(hitemid);
+				ztup.t_tableOid = InvalidOid;
+				ztup.t_data = ztuphdr;
+
+				if (ztuphdr->t_infomask & ZHEAP_DELETED
+										|| ztuphdr->t_infomask & ZHEAP_UPDATED)
+				{
+					TransactionId	xid;
+
+					ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid,
+										   NULL, NULL, false);
+					ZHeapTupleHeaderAdvanceLatestRemovedXid(ztuphdr, xid, &latestRemovedXid);
+				}
+			}
+			else
+			{
+				htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+				HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+			}
 		}
 		else if (ItemIdIsDead(hitemid))
 		{
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 3eb722ce26..a2f9693cce 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -25,7 +25,7 @@
 #include "storage/predicate.h"
 
 static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode);
+					  Relation heapRel);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -138,7 +138,7 @@ restart_insert:
 
 			if (IsBufferCleanupOK(buf))
 			{
-				_hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node);
+				_hash_vacuum_one_page(rel, metabuf, buf, heapRel);
 
 				if (PageGetFreeSpace(page) >= itemsz)
 					break;		/* OK, now we have enough space */
@@ -337,7 +337,7 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 
 static void
 _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode)
+					  Relation heapRel)
 {
 	OffsetNumber deletable[MaxOffsetNumber];
 	int			ndeletable = 0;
@@ -394,8 +394,10 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
-			xlrec.hnode = hnode;
+			xlrec.hnode = heapRel->rd_node;
 			xlrec.ntuples = ndeletable;
+			xlrec.flags = RelationStorageIsZHeap(heapRel) ?
+								XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP : 0;
 
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f769d828ff..8c3427c476 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -89,9 +89,6 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 static Bitmapset *HeapDetermineModifiedColumns(Relation relation,
 							 Bitmapset *interesting_cols,
 							 HeapTuple oldtup, HeapTuple newtup);
-static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
-					 LockTupleMode mode, LockWaitPolicy wait_policy,
-					 bool *have_tuple_lock);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 						  uint16 old_infomask2, TransactionId add_to_xmax,
 						  LockTupleMode mode, bool is_update,
@@ -127,36 +124,7 @@ static bool ProjIndexIsUnchanged(Relation relation, HeapTuple oldtup, HeapTuple
  * Don't look at lockstatus/updstatus directly!  Use get_mxact_status_for_lock
  * instead.
  */
-static const struct
-{
-	LOCKMODE	hwlock;
-	int			lockstatus;
-	int			updstatus;
-}
 
-			tupleLockExtraInfo[MaxLockTupleMode + 1] =
-{
-	{							/* LockTupleKeyShare */
-		AccessShareLock,
-		MultiXactStatusForKeyShare,
-		-1						/* KeyShare does not allow updating tuples */
-	},
-	{							/* LockTupleShare */
-		RowShareLock,
-		MultiXactStatusForShare,
-		-1						/* Share does not allow updating tuples */
-	},
-	{							/* LockTupleNoKeyExclusive */
-		ExclusiveLock,
-		MultiXactStatusForNoKeyUpdate,
-		MultiXactStatusNoKeyUpdate
-	},
-	{							/* LockTupleExclusive */
-		AccessExclusiveLock,
-		MultiXactStatusForUpdate,
-		MultiXactStatusUpdate
-	}
-};
 
 /* Get the LOCKMODE for a given MultiXactStatus */
 #define LOCKMODE_from_mxstatus(status) \
@@ -169,8 +137,6 @@ static const struct
  */
 #define LockTupleTuplock(rel, tup, mode) \
 	LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
-#define UnlockTupleTuplock(rel, tup, mode) \
-	UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
 #define ConditionalLockTupleTuplock(rel, tup, mode) \
 	ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
 
@@ -433,7 +399,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber page)
 			else
 				valid = HeapTupleSatisfies(&loctup, snapshot, buffer);
 
-			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, &loctup,
+			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd,
+											(void *) &loctup,
 											buffer, snapshot);
 
 			if (valid)
@@ -648,7 +615,8 @@ heapgettup(HeapScanDesc scan,
 				 */
 				valid = HeapTupleSatisfies(tuple, snapshot, scan->rs_cbuf);
 
-				CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, tuple,
+				CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd,
+												(void *) tuple,
 												scan->rs_cbuf, snapshot);
 
 				if (valid && key != NULL)
@@ -1768,9 +1736,10 @@ heap_fetch(Relation relation,
 	valid = HeapTupleSatisfies(tuple, snapshot, buffer);
 
 	if (valid)
-		PredicateLockTuple(relation, tuple, snapshot);
+		PredicateLockTid(relation, &(tuple->t_self), snapshot,
+						   HeapTupleHeaderGetXmin(tuple->t_data));
 
-	CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
+	CheckForSerializableConflictOut(valid, relation, (void *) tuple, buffer, snapshot);
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
@@ -1908,7 +1877,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 
 			/* If it's visible per the snapshot, we must return it */
 			valid = HeapTupleSatisfies(heapTuple, snapshot, buffer);
-			CheckForSerializableConflictOut(valid, relation, heapTuple,
+			CheckForSerializableConflictOut(valid, relation, (void *) heapTuple,
 											buffer, snapshot);
 			/* reset to original, non-redirected, tid */
 			heapTuple->t_self = *tid;
@@ -1916,7 +1885,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 			if (valid)
 			{
 				ItemPointerSetOffsetNumber(tid, offnum);
-				PredicateLockTuple(relation, heapTuple, snapshot);
+				PredicateLockTid(relation, &(heapTuple)->t_self, snapshot,
+									HeapTupleHeaderGetXmin(heapTuple->t_data));
 				if (all_dead)
 					*all_dead = false;
 				return true;
@@ -2082,7 +2052,7 @@ heap_get_latest_tid(Relation relation,
 		 * result candidate.
 		 */
 		valid = HeapTupleSatisfies(&tp, snapshot, buffer);
-		CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
+		CheckForSerializableConflictOut(valid, relation, (void *) &tp, buffer, snapshot);
 		if (valid)
 			*tid = ctid;
 
@@ -3036,7 +3006,7 @@ l1:
 	 * being visible to the scan (i.e., an exclusive buffer content lock is
 	 * continuously held from this point until the tuple delete is visible).
 	 */
-	CheckForSerializableConflictIn(relation, &tp, buffer);
+	CheckForSerializableConflictIn(relation, &(tp.t_self), buffer);
 
 	/* replace cid with a combo cid if necessary */
 	HeapTupleHeaderAdjustCmax(tp.t_data, &cid, &iscombo);
@@ -3962,7 +3932,7 @@ l2:
 	 * will include checking the relation level, there is no benefit to a
 	 * separate check for the new tuple.
 	 */
-	CheckForSerializableConflictIn(relation, &oldtup, buffer);
+	CheckForSerializableConflictIn(relation, &(oldtup.t_self), buffer);
 
 	/*
 	 * At this point newbuf and buffer are both pinned and locked, and newbuf
@@ -5114,7 +5084,7 @@ out_unlocked:
  * Returns false if it was unable to obtain the lock; this can only happen if
  * wait_policy is Skip.
  */
-static bool
+bool
 heap_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
 					 LockWaitPolicy wait_policy, bool *have_tuple_lock)
 {
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 95513dfec8..80c6e5ed82 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1399,9 +1399,10 @@ heapam_scan_bitmap_pagescan(TableScanDesc sscan,
 			if (valid)
 			{
 				scan->rs_vistuples[ntup++] = offnum;
-				PredicateLockTuple(scan->rs_scan.rs_rd, &loctup, snapshot);
+				PredicateLockTid(scan->rs_scan.rs_rd, &loctup.t_self, snapshot,
+								 HeapTupleHeaderGetXmin(loctup.t_data));
 			}
-			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, &loctup,
+			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &loctup,
 											buffer, snapshot);
 		}
 	}
@@ -1627,7 +1628,7 @@ heapam_scan_sample_next_tuple(TableScanDesc sscan, struct SampleScanState *scans
 
 			/* in pagemode, heapgetpage did this for us */
 			if (!pagemode)
-				CheckForSerializableConflictOut(visible, scan->rs_scan.rs_rd, tuple,
+				CheckForSerializableConflictOut(visible, scan->rs_scan.rs_rd, (void *) tuple,
 												scan->rs_cbuf, scan->rs_scan.rs_snapshot);
 
 			/* Try next tuple from same page. */
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 1ac1a20c1d..eb03fae9a4 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -760,6 +760,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
 	Assert(htup->t_tableOid != InvalidOid);
 
 	snapshot->xmin = snapshot->xmax = InvalidTransactionId;
+	snapshot->subxid = InvalidSubTransactionId;
 	snapshot->speculativeToken = 0;
 
 	if (!HeapTupleHeaderXminCommitted(tuple))
@@ -1842,3 +1843,73 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer)
 
 	return false; /* keep compiler quiet */
 }
+
+/*
+ * This is a helper function for CheckForSerializableConflictOut.
+ *
+ * Check to see whether the tuple has been written to by a concurrent
+ * transaction, either to create it not visible to us, or to delete it
+ * while it is visible to us.  The "visible" bool indicates whether the
+ * tuple is visible to us, while HeapTupleSatisfiesVacuum checks what else
+ * is going on with it.  The caller should have a share lock on the buffer.
+ */
+bool
+HeapTupleHasSerializableConflictOut(bool visible, HeapTuple tuple, Buffer buffer,
+									TransactionId *xid)
+{
+	HTSV_Result htsvResult;
+	htsvResult = HeapTupleSatisfiesVacuum(tuple, TransactionXmin, buffer);
+	switch (htsvResult)
+	{
+		case HEAPTUPLE_LIVE:
+			if (visible)
+				return false;
+			*xid = HeapTupleHeaderGetXmin(tuple->t_data);
+			break;
+		case HEAPTUPLE_RECENTLY_DEAD:
+			if (!visible)
+				return false;
+			*xid = HeapTupleHeaderGetUpdateXid(tuple->t_data);
+			break;
+		case HEAPTUPLE_DELETE_IN_PROGRESS:
+			*xid = HeapTupleHeaderGetUpdateXid(tuple->t_data);
+			break;
+		case HEAPTUPLE_INSERT_IN_PROGRESS:
+			*xid = HeapTupleHeaderGetXmin(tuple->t_data);
+			break;
+		case HEAPTUPLE_DEAD:
+			return false;
+		default:
+
+			/*
+			 * The only way to get to this default clause is if a new value is
+			 * added to the enum type without adding it to this switch
+			 * statement.  That's a bug, so elog.
+			 */
+			elog(ERROR, "unrecognized return value from HeapTupleSatisfiesVacuum: %u", htsvResult);
+
+			/*
+			 * In spite of having all enum values covered and calling elog on
+			 * this default, some compilers think this is a code path which
+			 * allows xid to be used below without initialization. Silence
+			 * that warning.
+			 */
+			*xid = InvalidTransactionId;
+	}
+	Assert(TransactionIdIsValid(*xid));
+	Assert(TransactionIdFollowsOrEquals(*xid, TransactionXmin));
+
+	/*
+	 * Find top level xid.  Bail out if xid is too early to be a conflict, or
+	 * if it's our own xid.
+	 */
+	if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny()))
+		return false;
+	*xid = SubTransGetTopmostTransaction(*xid);
+	if (TransactionIdPrecedes(*xid, TransactionXmin))
+		return false;
+	if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny()))
+		return false;
+
+	return true;
+}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index b8b5871559..856a14c90b 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -19,6 +19,8 @@
 #include "access/hio.h"
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
+#include "access/zheap.h"
+#include "access/zhtup.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
@@ -76,7 +78,7 @@ RelationPutHeapTuple(Relation relation,
 /*
  * Read in a buffer, using bulk-insert strategy if bistate isn't NULL.
  */
-static Buffer
+Buffer
 ReadBufferBI(Relation relation, BlockNumber targetBlock,
 			 BulkInsertState bistate)
 {
@@ -118,7 +120,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * must not be InvalidBuffer.  If both buffers are specified, buffer1 must
  * be less than buffer2.
  */
-static void
+void
 GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 					 BlockNumber block1, BlockNumber block2,
 					 Buffer *vmbuffer1, Buffer *vmbuffer2)
@@ -174,7 +176,7 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
  * amount which ramps up as the degree of contention ramps up, but limiting
  * the result to some sane overall value.
  */
-static void
+void
 RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 {
 	BlockNumber blockNum,
@@ -216,7 +218,17 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 				 BufferGetBlockNumber(buffer),
 				 RelationGetRelationName(relation));
 
-		PageInit(page, BufferGetPageSize(buffer), 0);
+		if (RelationStorageIsZHeap(relation))
+		{
+			Assert(BufferGetBlockNumber(buffer) != ZHEAP_METAPAGE);
+			ZheapInitPage(page, BufferGetPageSize(buffer));
+			freespace = PageGetZHeapFreeSpace(page);
+		}
+		else
+		{
+			PageInit(page, BufferGetPageSize(buffer), 0);
+			freespace = PageGetHeapFreeSpace(page);
+		}
 
 		/*
 		 * We mark all the new buffers dirty, but do nothing to write them
@@ -227,8 +239,6 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 
 		/* we'll need this info below */
 		blockNum = BufferGetBlockNumber(buffer);
-		freespace = PageGetHeapFreeSpace(page);
-
 		UnlockReleaseBuffer(buffer);
 
 		/* Remember first block number thus added. */
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 486cde4aff..385f1b20f5 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -71,20 +71,10 @@ typedef struct toast_compress_header
 static void toast_delete_datum(Relation rel, Datum value, bool is_speculative);
 static Datum toast_save_datum(Relation rel, Datum value,
 				 struct varlena *oldexternal, int options);
-static bool toastrel_valueid_exists(Relation toastrel, Oid valueid);
-static bool toastid_valueid_exists(Oid toastrelid, Oid valueid);
 static struct varlena *toast_fetch_datum(struct varlena *attr);
 static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 						int32 sliceoffset, int32 length);
 static struct varlena *toast_decompress_datum(struct varlena *attr);
-static int toast_open_indexes(Relation toastrel,
-				   LOCKMODE lock,
-				   Relation **toastidxs,
-				   int *num_indexes);
-static void toast_close_indexes(Relation *toastidxs, int num_indexes,
-					LOCKMODE lock);
-static void init_toast_snapshot(Snapshot toast_snapshot);
-
 
 /* ----------
  * heap_tuple_fetch_attr -
@@ -1787,7 +1777,7 @@ toast_delete_datum(Relation rel, Datum value, bool is_speculative)
  *	toast rows with that ID; see notes for GetNewOidWithIndex().
  * ----------
  */
-static bool
+bool
 toastrel_valueid_exists(Relation toastrel, Oid valueid)
 {
 	bool		result = false;
@@ -1835,7 +1825,7 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid)
  *	As above, but work from toast rel's OID not an open relation
  * ----------
  */
-static bool
+bool
 toastid_valueid_exists(Oid toastrelid, Oid valueid)
 {
 	bool		result;
@@ -2289,7 +2279,7 @@ toast_decompress_datum(struct varlena *attr)
  *	relation in this array. It is the responsibility of the caller of this
  *	function to close the indexes as well as free them.
  */
-static int
+int
 toast_open_indexes(Relation toastrel,
 				   LOCKMODE lock,
 				   Relation **toastidxs,
@@ -2348,7 +2338,7 @@ toast_open_indexes(Relation toastrel,
  *	Close an array of indexes for a toast relation and free it. This should
  *	be called for a set of indexes opened previously with toast_open_indexes.
  */
-static void
+void
 toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
 {
 	int			i;
@@ -2367,7 +2357,7 @@ toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
  *	just use the oldest one.  This is safe: at worst, we will get a "snapshot
  *	too old" error that might have been avoided otherwise.
  */
-static void
+void
 init_toast_snapshot(Snapshot toast_snapshot)
 {
 	Snapshot	snapshot = GetOldestSnapshot();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 429f9ad52a..36139e39f4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "access/zhtup.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
@@ -83,15 +84,6 @@
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
 
-/*
- * When a table has no indexes, vacuum the FSM after every 8GB, approximately
- * (it won't be exact because we only vacuum FSM after processing a heap page
- * that has some removable tuples).  When there are indexes, this is ignored,
- * and we vacuum FSM after each index/heap cleaning pass.
- */
-#define VACUUM_FSM_EVERY_PAGES \
-	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
-
 /*
  * Guesstimation of number of dead tuples per page.  This is used to
  * provide an upper limit to memory allocated when vacuuming small
@@ -111,35 +103,6 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
-typedef struct LVRelStats
-{
-	/* hasindex = true means two-pass strategy; false means one-pass */
-	bool		hasindex;
-	/* Overall statistics about rel */
-	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages we skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
-	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
-	double		new_dead_tuples;	/* new estimated total # of dead tuples */
-	BlockNumber pages_removed;
-	double		tuples_deleted;
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-	/* List of TIDs of tuples we intend to delete */
-	/* NB: this list is ordered by TID address */
-	int			num_dead_tuples;	/* current # of entries */
-	int			max_dead_tuples;	/* # slots allocated in array */
-	ItemPointer dead_tuples;	/* array of ItemPointerData */
-	int			num_index_scans;
-	TransactionId latestRemovedXid;
-	bool		lock_waiter_detected;
-} LVRelStats;
-
-
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
@@ -156,21 +119,12 @@ static void lazy_scan_heap(Relation onerel, int options,
 			   bool aggressive);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
-static void lazy_vacuum_index(Relation indrel,
-				  IndexBulkDeleteResult **stats,
-				  LVRelStats *vacrelstats);
-static void lazy_cleanup_index(Relation indrel,
-				   IndexBulkDeleteResult *stats,
-				   LVRelStats *vacrelstats);
 static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
-static bool should_attempt_truncation(LVRelStats *vacrelstats);
-static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
-						 LVRelStats *vacrelstats);
+						 LVRelStats *vacrelstats,
+						 BufferAccessStrategy vac_strategy);
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
-static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
-					   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
@@ -287,7 +241,7 @@ heap_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	 * Optionally truncate the relation.
 	 */
 	if (should_attempt_truncation(vacrelstats))
-		lazy_truncate_heap(onerel, vacrelstats);
+		lazy_truncate_heap(onerel, vacrelstats, vac_strategy);
 
 	/* Report that we are now doing final cleanup */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -746,7 +700,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			for (i = 0; i < nindexes; i++)
 				lazy_vacuum_index(Irel[i],
 								  &indstats[i],
-								  vacrelstats);
+								  vacrelstats,
+								  vac_strategy);
 
 			/*
 			 * Report that we are now vacuuming the heap.  We also increase
@@ -1385,7 +1340,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		for (i = 0; i < nindexes; i++)
 			lazy_vacuum_index(Irel[i],
 							  &indstats[i],
-							  vacrelstats);
+							  vacrelstats,
+							  vac_strategy);
 
 		/* Report that we are now vacuuming the heap */
 		hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP;
@@ -1413,7 +1369,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 	/* Do post-vacuum cleanup and statistics update for each index */
 	for (i = 0; i < nindexes; i++)
-		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats, vac_strategy);
 
 	/* If no indexes, make log report that lazy_vacuum_heap would've made */
 	if (vacuumed_pages)
@@ -1682,10 +1638,11 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
  *		Delete all the index entries pointing to tuples listed in
  *		vacrelstats->dead_tuples, and update running statistics.
  */
-static void
+void
 lazy_vacuum_index(Relation indrel,
 				  IndexBulkDeleteResult **stats,
-				  LVRelStats *vacrelstats)
+				  LVRelStats *vacrelstats,
+				  BufferAccessStrategy vac_strategy)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -1714,10 +1671,11 @@ lazy_vacuum_index(Relation indrel,
 /*
  *	lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
  */
-static void
+void
 lazy_cleanup_index(Relation indrel,
 				   IndexBulkDeleteResult *stats,
-				   LVRelStats *vacrelstats)
+				   LVRelStats *vacrelstats,
+				   BufferAccessStrategy vac_strategy)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -1790,7 +1748,7 @@ lazy_cleanup_index(Relation indrel,
  * called for before we actually do it.  If you change the logic here, be
  * careful to depend only on fields that lazy_scan_heap updates on-the-fly.
  */
-static bool
+bool
 should_attempt_truncation(LVRelStats *vacrelstats)
 {
 	BlockNumber possibly_freeable;
@@ -1808,8 +1766,9 @@ should_attempt_truncation(LVRelStats *vacrelstats)
 /*
  * lazy_truncate_heap - try to truncate off any empty pages at the end
  */
-static void
-lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
+void
+lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats,
+				   BufferAccessStrategy vac_strategy)
 {
 	BlockNumber old_rel_pages = vacrelstats->rel_pages;
 	BlockNumber new_rel_pages;
@@ -1889,7 +1848,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
+		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats,
+												 vac_strategy);
 
 		if (new_rel_pages >= old_rel_pages)
 		{
@@ -1937,7 +1897,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
+count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats,
+						 BufferAccessStrategy vac_strategy)
 {
 	BlockNumber blkno;
 	BlockNumber prefetchedUntil;
@@ -2050,6 +2011,17 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 			 * this page.  We formerly thought that DEAD tuples could be
 			 * thrown away, but that's not so, because we'd not have cleaned
 			 * out their index entries.
+			 *
+			 * XXX - This function is used by both heap and zheap and the
+			 * behavior must be same in both the cases.  However, for zheap,
+			 * there could be some unused items that contain pending xact
+			 * information for the current transaction.  It is okay to
+			 * truncate such pages as even if the transaction rolled back
+			 * after this point, we won't be reclaiming the truncated pages
+			 * or making the unused items back to dead.  We can add Assert
+			 * to check if the pending xact is the current transaction, but to
+			 * do that we need some storage engine specific check which seems
+			 * too much for the purpose for which it is required.
 			 */
 			if (ItemIdIsUsed(itemid))
 			{
@@ -2113,7 +2085,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 /*
  * lazy_record_dead_tuple - remember one deletable tuple
  */
-static void
+void
 lazy_record_dead_tuple(LVRelStats *vacrelstats,
 					   ItemPointer itemptr)
 {
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..0ed915a58e 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -87,6 +87,8 @@
 
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
+#include "access/zheapam_xlog.h"
+#include "access/zheap.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -285,7 +287,10 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 #endif
 
 	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
-	Assert(InRecovery || BufferIsValid(heapBuf));
+
+	/* For zheap we do not set heapBuf's status hence can be invalid */
+	Assert(RelationStorageIsZHeap(rel) ||
+			(InRecovery || BufferIsValid(heapBuf)));
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
 
 	/* Check that we have the right heap page pinned, if present */
@@ -312,20 +317,32 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
-										  cutoff_xid, flags);
-
-				/*
-				 * If data checksums are enabled (or wal_log_hints=on), we
-				 * need to protect the heap page from being torn.
-				 */
-				if (XLogHintBitIsNeeded())
+				if (RelationStorageIsZHeap(rel))
 				{
-					Page		heapPage = BufferGetPage(heapBuf);
-
-					/* caller is expected to set PD_ALL_VISIBLE first */
-					Assert(PageIsAllVisible(heapPage));
-					PageSetLSN(heapPage, recptr);
+					recptr = log_zheap_visible(rel->rd_node, heapBuf, vmBuf,
+											   cutoff_xid, flags);
+					/*
+					 * We do not have a page wise visibility flag in zheap.
+					 * So no need to set LSN on zheap page.
+					 */
+				}
+				else
+				{
+					recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+											  cutoff_xid, flags);
+
+					/*
+					 * If data checksums are enabled (or wal_log_hints=on), we
+					 * need to protect the heap page from being torn.
+					 */
+					if (XLogHintBitIsNeeded())
+					{
+						Page		heapPage = BufferGetPage(heapBuf);
+
+						/* caller is expected to set PD_ALL_VISIBLE first */
+						Assert(PageIsAllVisible(heapPage));
+						PageSetLSN(heapPage, recptr);
+					}
 				}
 			}
 			PageSetLSN(page, recptr);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fe5af31f87..23271d0cde 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -206,7 +206,7 @@ index_insert(Relation indexRelation,
 
 	if (!(indexRelation->rd_amroutine->ampredlocks))
 		CheckForSerializableConflictIn(indexRelation,
-									   (HeapTuple) NULL,
+									   (ItemPointer) NULL,
 									   InvalidBuffer);
 
 	return indexRelation->rd_amroutine->aminsert(indexRelation, values, isnull,
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b2ad95f970..ed7c3163af 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -57,7 +57,7 @@ static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
 				 Relation heapRel, Buffer buf, OffsetNumber offset,
 				 ScanKey itup_scankey,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
+				 uint32 *speculativeToken, SubTransactionId *subxid);
 static void _bt_findinsertloc(Relation rel,
 				  Buffer *bufptr,
 				  OffsetNumber *offsetptr,
@@ -250,10 +250,12 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		SubTransactionId subxid = InvalidSubTransactionId;
 
 		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
-								 checkUnique, &is_unique, &speculativeToken);
+								 checkUnique, &is_unique, &speculativeToken,
+								 &subxid);
 
 		if (TransactionIdIsValid(xwait))
 		{
@@ -267,9 +269,12 @@ top:
 			 */
 			if (speculativeToken)
 				SpeculativeInsertionWait(xwait, speculativeToken);
+			else if (subxid != InvalidSubTransactionId)
+				SubXactLockTableWait(xwait, subxid, rel, &itup->t_tid,
+									 XLTW_InsertIndex);
 			else
-				XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
-
+				XactLockTableWait(xwait, rel, &itup->t_tid,
+								  XLTW_InsertIndex);
 			/* start over... */
 			if (stack)
 				_bt_freestack(stack);
@@ -331,7 +336,7 @@ static TransactionId
 _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+				 uint32 *speculativeToken, SubTransactionId *subxid)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -449,6 +454,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 							_bt_relbuf(rel, nbuf);
 						/* Tell _bt_doinsert to wait... */
 						*speculativeToken = SnapshotDirty.speculativeToken;
+						*subxid = SnapshotDirty.subxid;
 						return xwait;
 					}
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..d893a00ed5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1067,6 +1067,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 		xlrec_delete.hnode = heapRel->rd_node;
 		xlrec_delete.nitems = nitems;
+		xlrec_delete.flags = RelationStorageIsZHeap(heapRel) ?
+								XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP : 0;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..395e642501 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,6 +21,7 @@
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/zheap.h"
 #include "storage/procarray.h"
 #include "miscadmin.h"
 
@@ -543,7 +544,6 @@ btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
 	ItemId		iitemid,
 				hitemid;
 	IndexTuple	itup;
-	HeapTupleHeader htuphdr;
 	BlockNumber hblkno;
 	OffsetNumber hoffnum;
 	TransactionId latestRemovedXid = InvalidTransactionId;
@@ -622,27 +622,75 @@ btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
 		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
 		hitemid = PageGetItemId(hpage, hoffnum);
 
-		/*
-		 * Follow any redirections until we find something useful.
-		 */
-		while (ItemIdIsRedirected(hitemid))
+		if (!(xlrec->flags & XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP))
 		{
-			hoffnum = ItemIdGetRedirect(hitemid);
-			hitemid = PageGetItemId(hpage, hoffnum);
-			CHECK_FOR_INTERRUPTS();
+			/*
+			 * Follow any redirections until we find something useful.
+			 */
+			while (ItemIdIsRedirected(hitemid))
+			{
+				hoffnum = ItemIdGetRedirect(hitemid);
+				hitemid = PageGetItemId(hpage, hoffnum);
+				CHECK_FOR_INTERRUPTS();
+			}
 		}
 
 		/*
 		 * If the heap item has storage, then read the header and use that to
 		 * set latestRemovedXid.
 		 *
+		 * We have special handling for zheap tuples that are deleted and
+		 * don't have storage.
+		 *
 		 * Some LP_DEAD items may not be accessible, so we ignore them.
 		 */
-		if (ItemIdHasStorage(hitemid))
+		if ((xlrec->flags & XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP) &&
+			ItemIdIsDeleted(hitemid))
 		{
-			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			TransactionId	xid;
+			ZHeapTupleData	ztup;
+
+			ztup.t_self = itup->t_tid;
+			ztup.t_len = ItemIdGetLength(hitemid);
+			ztup.t_tableOid = InvalidOid;
+			ztup.t_data = NULL;
+			ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid, NULL, NULL,
+								   false);
+			if (TransactionIdDidCommit(xid) &&
+				TransactionIdFollows(xid, latestRemovedXid))
+				latestRemovedXid = xid;
+		}
+		else if (ItemIdHasStorage(hitemid))
+		{
+			if ((xlrec->flags & XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP) != 0)
+			{
+				ZHeapTupleHeader ztuphdr;
+				ZHeapTupleData	ztup;
+
+				ztuphdr = (ZHeapTupleHeader) PageGetItem(hpage, hitemid);
+				ztup.t_self = itup->t_tid;
+				ztup.t_len = ItemIdGetLength(hitemid);
+				ztup.t_tableOid = InvalidOid;
+				ztup.t_data = ztuphdr;
+
+				if (ztuphdr->t_infomask & ZHEAP_DELETED
+										|| ztuphdr->t_infomask & ZHEAP_UPDATED)
+				{
+					TransactionId	xid;
+
+					ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid,
+										   NULL, NULL, false);
+					elog(DEBUG1, "TransactionId: %d",xid);
+					ZHeapTupleHeaderAdvanceLatestRemovedXid(ztuphdr, xid, &latestRemovedXid);
+				}
+			}
+			else
+			{
+				HeapTupleHeader htuphdr;
+				htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
 
-			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+				HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+			}
 		}
 		else if (ItemIdIsDead(hitemid))
 		{
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1dda..9d3cdf8233 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
 	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o tpddesc.o undoactiondesc.o \
+	   undologdesc.o xactdesc.o xlogdesc.o zheapamdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/tpddesc.c b/src/backend/access/rmgrdesc/tpddesc.c
new file mode 100644
index 0000000000..a41c2ddff6
--- /dev/null
+++ b/src/backend/access/rmgrdesc/tpddesc.c
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * tpddesc.c
+ *	  rmgr descriptor routines for access/undo/tpdxlog.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/tpddesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd_xlog.h"
+
+void
+tpd_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	info &= XLOG_TPD_OPMASK;
+	if (info == XLOG_ALLOCATE_TPD_ENTRY)
+	{
+		xl_tpd_allocate_entry *xlrec = (xl_tpd_allocate_entry *) rec;
+
+		appendStringInfo(buf, "prevblk %u nextblk %u offset %u",
+						 xlrec->prevblk, xlrec->nextblk, xlrec->offnum);
+	}
+	else if (info == XLOG_TPD_FREE_PAGE)
+	{
+		xl_tpd_free_page *xlrec = (xl_tpd_free_page *) rec;
+
+		appendStringInfo(buf, "prevblk %u nextblk %u",
+						 xlrec->prevblkno, xlrec->nextblkno);
+	}
+}
+
+const char *
+tpd_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_ALLOCATE_TPD_ENTRY:
+			id = "ALLOCATE TPD ENTRY";
+			break;
+		case XLOG_ALLOCATE_TPD_ENTRY | XLOG_TPD_INIT_PAGE:
+			id = "ALLOCATE TPD ENTRY+INIT";
+			break;
+		case XLOG_TPD_CLEAN:
+			id = "TPD CLEAN";
+			break;
+		case XLOG_TPD_CLEAR_LOCATION:
+			id = "TPD CLEAR LOCATION";
+			break;
+		case XLOG_INPLACE_UPDATE_TPD_ENTRY:
+			id = "INPLACE UPDATE TPD ENTRY";
+			break;
+		case XLOG_TPD_FREE_PAGE:
+			id = "TPD FREE PAGE";
+			break;
+		case XLOG_TPD_CLEAN_ALL_ENTRIES:
+			id = "TPD CLEAN ALL ENTRIES";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/undoactiondesc.c b/src/backend/access/rmgrdesc/undoactiondesc.c
new file mode 100644
index 0000000000..0a23fece4a
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undoactiondesc.c
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoactiondesc.c
+ *	  rmgr descriptor routines for access/undo/undoactionxlog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undoactiondesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undoaction_xlog.h"
+
+void
+undoaction_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDO_PAGE)
+	{
+		uint8	*flags = (uint8 *) rec;
+
+		appendStringInfo(buf, "page_contains_tpd_slot: %c ",
+						 (*flags & XLU_PAGE_CONTAINS_TPD_SLOT) ? 'T' : 'F');
+		appendStringInfo(buf, "is_page_initialized: %c ",
+						 (*flags & XLU_INIT_PAGE) ? 'T' : 'F');
+		if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT)
+		{
+			xl_undoaction_page *xlrec =
+						(xl_undoaction_page *) ((char *) flags + sizeof(uint8));
+
+			appendStringInfo(buf, "urec_ptr %lu xid %u trans_slot_id %u",
+							 xlrec->urec_ptr, xlrec->xid, xlrec->trans_slot_id);
+		}
+	}
+	else if (info == XLOG_UNDO_RESET_SLOT)
+	{
+		xl_undoaction_reset_slot *xlrec = (xl_undoaction_reset_slot *) rec;
+
+		appendStringInfo(buf, "urec_ptr %lu trans_slot_id %u",
+						 xlrec->urec_ptr, xlrec->trans_slot_id);
+	}
+}
+
+const char *
+undoaction_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDO_PAGE:
+			id = "UNDO PAGE";
+			break;
+		case XLOG_UNDO_RESET_SLOT:
+			id = "UNDO RESET SLOT";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 0000000000..5855b9b49e
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,104 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ *	  rmgr descriptor routines for access/undo/undolog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDOLOG_CREATE)
+	{
+		xl_undolog_create *xlrec = (xl_undolog_create *) rec;
+
+		appendStringInfo(buf, "logno %u", xlrec->logno);
+	}
+	else if (info == XLOG_UNDOLOG_EXTEND)
+	{
+		xl_undolog_extend *xlrec = (xl_undolog_extend *) rec;
+
+		appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_ATTACH)
+	{
+		xl_undolog_attach *xlrec = (xl_undolog_attach *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid);
+	}
+	else if (info == XLOG_UNDOLOG_META)
+	{
+		xl_undolog_meta *xlrec = (xl_undolog_meta *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u insert " UndoLogOffsetFormat
+						 " last_xact_start " UndoLogOffsetFormat
+						 " prevlen=%d"
+						 " is_first_record=%d",
+						 xlrec->logno, xlrec->xid, xlrec->meta.insert,
+						 xlrec->meta.last_xact_start,
+						 xlrec->meta.prevlen,
+						 xlrec->meta.is_first_rec);
+	}
+	else if (info == XLOG_UNDOLOG_DISCARD)
+	{
+		xl_undolog_discard *xlrec = (xl_undolog_discard *) rec;
+
+		appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end "
+						 UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->discard, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_REWIND)
+	{
+		xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec;
+
+		appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d",
+						 xlrec->logno, xlrec->insert, xlrec->prevlen);
+	}
+
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			id = "CREATE";
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			id = "EXTEND";
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			id = "ATTACH";
+			break;
+		case XLOG_UNDOLOG_META:
+			id = "UNDO_META";
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			id = "DISCARD";
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			id = "REWIND";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 00741c7b09..987e39c830 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -47,7 +47,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
 						 "oldest/newest commit timestamp xid: %u/%u; "
-						 "oldest running xid %u; %s",
+						 "oldest running xid %u; "
+						 "oldest xid with epoch having undo " UINT64_FORMAT "; %s",
 						 (uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -63,6 +64,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
+						 checkpoint->oldestXidWithEpochHavingUndo,
 						 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
diff --git a/src/backend/access/rmgrdesc/zheapamdesc.c b/src/backend/access/rmgrdesc/zheapamdesc.c
new file mode 100644
index 0000000000..a5d88c21b0
--- /dev/null
+++ b/src/backend/access/rmgrdesc/zheapamdesc.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * zheapamdesc.c
+ *	  rmgr descriptor routines for access/zheap/zheapamxlog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/zheapamdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/zheapam_xlog.h"
+
+void
+zheap_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	info &= XLOG_ZHEAP_OPMASK;
+	if (info == XLOG_ZHEAP_CLEAN)
+	{
+		xl_zheap_clean *xlrec = (xl_zheap_clean *) rec;
+
+		appendStringInfo(buf, "remxid %u", xlrec->latestRemovedXid);
+	}
+	else if (info == XLOG_ZHEAP_INSERT)
+	{
+		xl_undo_header *xlundohdr = (xl_undo_header *) rec;
+		xl_zheap_insert *xlrec = (xl_zheap_insert *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+		appendStringInfo(buf, "off %u, blkprev %lu", xlrec->offnum, xlundohdr->blkprev);
+	}
+	else if(info == XLOG_ZHEAP_MULTI_INSERT)
+	{
+		xl_undo_header *xlundohdr = (xl_undo_header *) rec;
+		xl_zheap_multi_insert *xlrec = (xl_zheap_multi_insert *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+		appendStringInfo(buf, "%d tuples", xlrec->ntuples);
+	}
+	else if (info == XLOG_ZHEAP_DELETE)
+	{
+		xl_undo_header *xlundohdr = (xl_undo_header *) rec;
+		xl_zheap_delete *xlrec = (xl_zheap_delete *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+		appendStringInfo(buf, "off %u, trans_slot %u, hasUndoTuple: %c, blkprev %lu",
+						 xlrec->offnum, xlrec->trans_slot_id,
+						 (xlrec->flags & XLZ_HAS_DELETE_UNDOTUPLE) ? 'T' : 'F',
+						 xlundohdr->blkprev);
+	}
+	else if (info == XLOG_ZHEAP_UPDATE)
+	{
+		xl_undo_header *xlundohdr = (xl_undo_header *) rec;
+		xl_zheap_update *xlrec = (xl_zheap_update *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+		appendStringInfo(buf, "oldoff %u, trans_slot %u, hasUndoTuple: %c, newoff: %u, blkprev %lu",
+						 xlrec->old_offnum, xlrec->old_trans_slot_id,
+						 (xlrec->flags & XLZ_HAS_UPDATE_UNDOTUPLE) ? 'T' : 'F',
+						 xlrec->new_offnum,
+						 xlundohdr->blkprev);
+	}
+	else if (info == XLOG_ZHEAP_FREEZE_XACT_SLOT)
+	{
+		xl_zheap_freeze_xact_slot *xlrec = (xl_zheap_freeze_xact_slot *) rec;
+
+		appendStringInfo(buf, "latest frozen xid %u nfrozen %u",
+						 xlrec->lastestFrozenXid, xlrec->nFrozen);
+	}
+	else if (info == XLOG_ZHEAP_INVALID_XACT_SLOT)
+	{
+		uint16  nCompletedSlots = *(uint16 *) rec;
+
+		appendStringInfo(buf, "completed_slots %u", nCompletedSlots);
+	}
+	else if (info == XLOG_ZHEAP_LOCK)
+	{
+		xl_undo_header *xlundohdr = (xl_undo_header *) rec;
+		xl_zheap_lock *xlrec = (xl_zheap_lock *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+		appendStringInfo(buf, "off %u, xid %u, trans_slot_id %u",
+						 xlrec->offnum, xlrec->prev_xid, xlrec->trans_slot_id);
+	}
+}
+
+void
+zheap2_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	info &= XLOG_ZHEAP_OPMASK;
+	if (info == XLOG_ZHEAP_CONFIRM)
+	{
+		xl_zheap_confirm *xlrec = (xl_zheap_confirm *) rec;
+
+		appendStringInfo(buf, "off %u: flags %u", xlrec->offnum, xlrec->flags);
+	}
+	else if (info == XLOG_ZHEAP_UNUSED)
+	{
+		xl_undo_header *xlundohdr = (xl_undo_header *) rec;
+		xl_zheap_unused *xlrec = (xl_zheap_unused *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+		appendStringInfo(buf, "remxid %u, trans_slot_id %u, blkprev %lu",
+						 xlrec->latestRemovedXid, xlrec->trans_slot_id,
+						 xlundohdr->blkprev);
+	}
+	else if (info == XLOG_ZHEAP_VISIBLE)
+	{
+		xl_zheap_visible *xlrec = (xl_zheap_visible *) rec;
+
+		appendStringInfo(buf, "cutoff xid %u flags %d",
+						 xlrec->cutoff_xid, xlrec->flags);
+	}
+}
+
+const char *
+zheap_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_ZHEAP_CLEAN:
+			id = "CLEAN";
+			break;
+		case XLOG_ZHEAP_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_ZHEAP_INSERT | XLOG_ZHEAP_INIT_PAGE:
+			id = "INSERT+INIT";
+			break;
+		case XLOG_ZHEAP_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_ZHEAP_UPDATE:
+			id = "UPDATE";
+			break;
+		case XLOG_ZHEAP_UPDATE | XLOG_ZHEAP_INIT_PAGE:
+			id = "UPDATE+INIT";
+			break;
+		case XLOG_ZHEAP_FREEZE_XACT_SLOT:
+			id = "FREEZE_XACT_SLOT";
+			break;
+		case XLOG_ZHEAP_INVALID_XACT_SLOT:
+			id = "INVALID_XACT_SLOT";
+			break;
+		case XLOG_ZHEAP_LOCK:
+			id = "LOCK";
+			break;
+		case XLOG_ZHEAP_MULTI_INSERT:
+			id = "MULTI_INSERT";
+			break;
+		case XLOG_ZHEAP_MULTI_INSERT | XLOG_ZHEAP_INIT_PAGE:
+			id = "MULTI_INSERT+INIT";
+			break;
+	}
+
+	return id;
+}
+
+const char *
+zheap2_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_ZHEAP_CONFIRM:
+			id = "CONFIRM";
+			break;
+		case XLOG_ZHEAP_UNUSED:
+			id = "UNUSED";
+			break;
+		case XLOG_ZHEAP_VISIBLE:
+			id = "VISIBLE";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56c4c..00aa180aee 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -18,8 +18,12 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#include "access/tpd_xlog.h"
+#include "access/undoaction_xlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/zheapam_xlog.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e65dccc6a2..46ec852ca5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -915,6 +915,13 @@ typedef struct TwoPhaseFileHeader
 	uint16		gidlen;			/* length of the GID - GID follows the header */
 	XLogRecPtr	origin_lsn;		/* lsn of this record at origin node */
 	TimestampTz origin_timestamp;	/* time of prepare at origin node */
+
+	/*
+	 * We need the locations of start and end undo record pointers when rollbacks
+	 * are to be performed for prepared transactions using zheap relations.
+	 */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	end_urec_ptr[UndoPersistenceLevels];
 } TwoPhaseFileHeader;
 
 /*
@@ -989,7 +996,8 @@ save_state_data(const void *data, uint32 len)
  * Initializes data structure and inserts the 2PC file header record.
  */
 void
-StartPrepare(GlobalTransaction gxact)
+StartPrepare(GlobalTransaction gxact, UndoRecPtr *start_urec_ptr,
+			 UndoRecPtr *end_urec_ptr)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
 	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
@@ -1020,6 +1028,11 @@ StartPrepare(GlobalTransaction gxact)
 	hdr.database = proc->databaseId;
 	hdr.prepared_at = gxact->prepared_at;
 	hdr.owner = gxact->owner;
+
+	/* save the start and end undo record pointers */
+	memcpy(hdr.start_urec_ptr, start_urec_ptr, sizeof(hdr.start_urec_ptr));
+	memcpy(hdr.end_urec_ptr, end_urec_ptr, sizeof(hdr.end_urec_ptr));
+
 	hdr.nsubxacts = xactGetCommittedChildren(&children);
 	hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
 	hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
@@ -1452,6 +1465,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	RelFileNode *delrels;
 	int			ndelrels;
 	SharedInvalidationMessage *invalmsgs;
+	int			i;
+	UndoRecPtr start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr end_urec_ptr[UndoPersistenceLevels];
 
 	/*
 	 * Validate the GID, and lock the GXACT to ensure that two backends do not
@@ -1489,6 +1505,38 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	invalmsgs = (SharedInvalidationMessage *) bufptr;
 	bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
 
+	/* save the start and end undo record pointers */
+	memcpy(start_urec_ptr, hdr->start_urec_ptr, sizeof(start_urec_ptr));
+	memcpy(end_urec_ptr, hdr->end_urec_ptr, sizeof(end_urec_ptr));
+
+	/*
+	 * Perform undo actions, if there are undologs for this transaction.
+	 * We need to perform undo actions while we are still in transaction.
+	 * Never push rollbacks of temp tables to undo worker.
+	 */
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (end_urec_ptr[i] != InvalidUndoRecPtr && !isCommit)
+		{
+			bool	result = false;
+			uint64	rollback_size = 0;
+
+			if (i != UNDO_TEMP)
+				rollback_size = end_urec_ptr[i] - start_urec_ptr[i];
+
+			if (rollback_size >= rollback_overflow_size * 1024 * 1024)
+				result = PushRollbackReq(end_urec_ptr[i], start_urec_ptr[i], InvalidOid);
+
+			/*
+			 * ZBORKED: set rellock = true, as we do *not* actually have all
+			 * the locks, but that'll probably deadlock?
+			 */
+			if (!result)
+				execute_undo_actions(end_urec_ptr[i], start_urec_ptr[i], true,
+									 true, true);
+		}
+	}
+
 	/* compute latestXid among all children */
 	latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a5eb29e01a..79a217e25c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -284,9 +284,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	TransactionId xidStopLimit;
 	TransactionId xidWrapLimit;
 	TransactionId curXid;
+	TransactionId oldestXidHavingUndo;
 
 	Assert(TransactionIdIsNormal(oldest_datfrozenxid));
 
+	/*
+	 * To determine the last safe xid that can be allocated, we need to
+	 * consider oldestXidHavingUndo.  The oldestXidHavingUndo will be only
+	 * valid for zheap storage engine, so it won't impact any other storage
+	 * engine.
+	 */
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+	if (TransactionIdIsValid(oldestXidHavingUndo))
+		oldest_datfrozenxid = Min(oldest_datfrozenxid, oldestXidHavingUndo);
+
 	/*
 	 * The place where we actually get into deep trouble is halfway around
 	 * from the oldest potentially-existing XID.  (This calculation is
@@ -354,6 +366,13 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
 	curXid = ShmemVariableCache->nextXid;
 	LWLockRelease(XidGenLock);
 
+	/*
+	 * Fixme - The messages in below code need some adjustment for zheap.
+	 * They should reflect that the system needs to discard the undo.  We
+	 * can add it once we have a pluggable storage API which might provide
+	 * us some way to distinguish among differnt storage engines.
+	 */
+
 	/* Log the info */
 	ereport(DEBUG1,
 			(errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384..766ae2c3b5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/tpd.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -41,6 +42,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "postmaster/undoloop.h"
 #include "replication/logical.h"
 #include "replication/logicallauncher.h"
 #include "replication/origin.h"
@@ -66,6 +68,8 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers ResetUndoBuffers()
+#define	AtAbort_ResetTPDBuffers	ResetTPDBuffers()
 
 /*
  *	User-tweakable parameters
@@ -188,8 +192,12 @@ typedef struct TransactionStateData
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
-	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
-	struct TransactionStateData *parent;	/* back link to parent */
+	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
+	bool		subXactLock;		/* has lock created for subtransaction? */
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
+	struct TransactionStateData *parent;		/* back link to parent */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -274,6 +282,20 @@ typedef struct SubXactCallbackItem
 
 static SubXactCallbackItem *SubXact_callbacks = NULL;
 
+/* Location in undo log from where to start applying the undo actions. */
+static UndoRecPtr UndoActionStartPtr[UndoPersistenceLevels] =
+															{InvalidUndoRecPtr,
+															 InvalidUndoRecPtr,
+															 InvalidUndoRecPtr};
+
+/* Location in undo log up to which undo actions need to be applied. */
+static UndoRecPtr UndoActionEndPtr[UndoPersistenceLevels] =
+															{InvalidUndoRecPtr,
+															 InvalidUndoRecPtr,
+															 InvalidUndoRecPtr};
+
+/* Do we need to perform any undo actions? */
+static bool	PerformUndoActions = false;
 
 /* local function prototypes */
 static void AssignTransactionId(TransactionState s);
@@ -616,6 +638,28 @@ AssignTransactionId(TransactionState s)
 	}
 }
 
+/*
+ *	SetCurrentSubTransactionLocked
+ */
+void
+SetCurrentSubTransactionLocked()
+{
+	TransactionState s = CurrentTransactionState;
+
+	s->subXactLock = true;
+}
+
+/*
+ *	HasCurrentSubTransactionLock
+ */
+bool
+HasCurrentSubTransactionLock()
+{
+	TransactionState s = CurrentTransactionState;
+
+	return s->subXactLock;
+}
+
 /*
  *	GetCurrentSubTransactionId
  */
@@ -627,6 +671,17 @@ GetCurrentSubTransactionId(void)
 	return s->subTransactionId;
 }
 
+/*
+ * GetCurrentTransactionResOwner
+ */
+ResourceOwner
+GetCurrentTransactionResOwner(void)
+{
+	TransactionState s = CurrentTransactionState;
+
+	return s->curTransactionOwner;
+}
+
 /*
  *	SubTransactionIsActive
  *
@@ -675,6 +730,15 @@ GetCurrentCommandId(bool used)
 	return currentCommandId;
 }
 
+/*
+ *	GetCurrentCommandIdUsed
+ */
+bool
+GetCurrentCommandIdUsed(void)
+{
+	return currentCommandIdUsed;
+}
+
 /*
  *	SetParallelStartTimestamps
  *
@@ -911,6 +975,24 @@ IsInParallelMode(void)
 	return CurrentTransactionState->parallelModeLevel != 0;
 }
 
+/*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr));
+	UndoPersistence upersistence = log->meta.persistence;
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
 /*
  *	CommandCounterIncrement
  */
@@ -1800,6 +1882,7 @@ StartTransaction(void)
 {
 	TransactionState s;
 	VirtualTransactionId vxid;
+	int i;
 
 	/*
 	 * Let's just make sure the state stack is empty
@@ -1878,6 +1961,14 @@ StartTransaction(void)
 	nUnreportedXids = 0;
 	s->didLogXid = false;
 
+	/* initialize undo record locations for the transaction */
+	for(i = 0; i < UndoPersistenceLevels; i++)
+	{
+		s->start_urec_ptr[i] = InvalidUndoRecPtr;
+		s->latest_urec_ptr[i] = InvalidUndoRecPtr;
+	}
+	s->subXactLock = false;
+
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -2152,6 +2243,10 @@ CommitTransaction(void)
 	AtEOXact_ApplyLauncher(true);
 	pgstat_report_xact_timestamp(0);
 
+	/* In single user mode, discard all the undo logs, once committed. */
+	if (!IsUnderPostmaster)
+		UndoLogDiscardAll();
+
 	CurrentResourceOwner = NULL;
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
@@ -2187,7 +2282,7 @@ CommitTransaction(void)
  * NB: if you change this routine, better look at CommitTransaction too!
  */
 static void
-PrepareTransaction(void)
+PrepareTransaction(UndoRecPtr *start_urec_ptr, UndoRecPtr *end_urec_ptr)
 {
 	TransactionState s = CurrentTransactionState;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2335,7 +2430,7 @@ PrepareTransaction(void)
 	 * PREPARED; in particular, pay attention to whether things should happen
 	 * before or after releasing the transaction's locks.
 	 */
-	StartPrepare(gxact);
+	StartPrepare(gxact, start_urec_ptr, end_urec_ptr);
 
 	AtPrepare_Notify();
 	AtPrepare_Locks();
@@ -2632,6 +2727,8 @@ AbortTransaction(void)
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
 		pgstat_report_xact_timestamp(0);
+		AtAbort_ResetUndoBuffers;
+		AtAbort_ResetTPDBuffers;
 	}
 
 	/*
@@ -2767,6 +2864,12 @@ void
 CommitTransactionCommand(void)
 {
 	TransactionState s = CurrentTransactionState;
+	UndoRecPtr	end_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	int	i;
+
+	memcpy(start_urec_ptr, s->start_urec_ptr, sizeof(start_urec_ptr));
+	memcpy(end_urec_ptr, s->latest_urec_ptr, sizeof(end_urec_ptr));
 
 	switch (s->blockState)
 	{
@@ -2856,7 +2959,7 @@ CommitTransactionCommand(void)
 			 * return to the idle state.
 			 */
 		case TBLOCK_PREPARE:
-			PrepareTransaction();
+			PrepareTransaction(start_urec_ptr, end_urec_ptr);
 			s->blockState = TBLOCK_DEFAULT;
 			break;
 
@@ -2902,6 +3005,24 @@ CommitTransactionCommand(void)
 			{
 				CommitSubTransaction();
 				s = CurrentTransactionState;	/* changed by pop */
+
+				/*
+				 * Update the end undo record pointer if it's not valid with
+				 * the currently popped transaction's end undo record pointer.
+				 * This is particularly required when the first command of
+				 * the transaction is of type which does not require an undo,
+				 * e.g. savepoint x.
+				 * Accordingly, update the start undo record pointer.
+				 */
+				for (i = 0; i < UndoPersistenceLevels; i++)
+				{
+					if (!UndoRecPtrIsValid(end_urec_ptr[i]))
+						end_urec_ptr[i] = s->latest_urec_ptr[i];
+
+					if (UndoRecPtrIsValid(s->start_urec_ptr[i]))
+						start_urec_ptr[i] = s->start_urec_ptr[i];
+				}
+
 			} while (s->blockState == TBLOCK_SUBCOMMIT);
 			/* If we had a COMMIT command, finish off the main xact too */
 			if (s->blockState == TBLOCK_END)
@@ -2913,7 +3034,7 @@ CommitTransactionCommand(void)
 			else if (s->blockState == TBLOCK_PREPARE)
 			{
 				Assert(s->parent == NULL);
-				PrepareTransaction();
+				PrepareTransaction(start_urec_ptr, end_urec_ptr);
 				s->blockState = TBLOCK_DEFAULT;
 			}
 			else
@@ -3007,7 +3128,18 @@ void
 AbortCurrentTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
+	int i;
 
+	/*
+	 * The undo actions are allowed to be executed at the end of statement
+	 * execution when we are not in transaction block, otherwise they are
+	 * executed when user explicitly ends the transaction.
+	 * 
+	 * So if we are in a transaction block don't set the PerformUndoActions
+	 * because this flag will be set when user explicitly issue rollback or
+	 * rollback to savepoint.
+	 */
+	PerformUndoActions = false;
 	switch (s->blockState)
 	{
 		case TBLOCK_DEFAULT:
@@ -3041,6 +3173,16 @@ AbortCurrentTransaction(void)
 			AbortTransaction();
 			CleanupTransaction();
 			s->blockState = TBLOCK_DEFAULT;
+
+			/*
+			 * We are outside the transaction block so remember the required
+			 * information to perform undo actions and also set the
+			 * PerformUndoActions so that we execute it before completing this
+			 * command.
+			 */
+			PerformUndoActions = true;
+			memcpy (UndoActionStartPtr, s->latest_urec_ptr, sizeof(UndoActionStartPtr));
+			memcpy (UndoActionEndPtr, s->start_urec_ptr, sizeof(UndoActionEndPtr));
 			break;
 
 			/*
@@ -3077,6 +3219,9 @@ AbortCurrentTransaction(void)
 			AbortTransaction();
 			CleanupTransaction();
 			s->blockState = TBLOCK_DEFAULT;
+
+			/* Failed during commit, so we need to perform the undo actions. */
+			PerformUndoActions = true;
 			break;
 
 			/*
@@ -3096,6 +3241,9 @@ AbortCurrentTransaction(void)
 		case TBLOCK_ABORT_END:
 			CleanupTransaction();
 			s->blockState = TBLOCK_DEFAULT;
+
+			/* Failed during commit, so we need to perform the undo actions. */
+			PerformUndoActions = true;
 			break;
 
 			/*
@@ -3106,6 +3254,12 @@ AbortCurrentTransaction(void)
 			AbortTransaction();
 			CleanupTransaction();
 			s->blockState = TBLOCK_DEFAULT;
+
+			/*
+			 * Failed while executing the rollback command, need perform any
+			 * pending undo actions.
+			 */
+			PerformUndoActions = true;
 			break;
 
 			/*
@@ -3117,6 +3271,12 @@ AbortCurrentTransaction(void)
 			AbortTransaction();
 			CleanupTransaction();
 			s->blockState = TBLOCK_DEFAULT;
+
+			/*
+			 * Perform any pending actions if failed while preparing the
+			 * transaction.
+			 */
+			PerformUndoActions = true;
 			break;
 
 			/*
@@ -3139,6 +3299,17 @@ AbortCurrentTransaction(void)
 		case TBLOCK_SUBCOMMIT:
 		case TBLOCK_SUBABORT_PENDING:
 		case TBLOCK_SUBRESTART:
+			/*
+			 * If we are here and still UndoActionStartPtr is valid that means
+			 * the subtransaction failed while executing the undo action, so
+			 * store its undo action start point in parent so that parent can
+			 * start its undo action from this point.
+			 */
+			for (i = 0; i < UndoPersistenceLevels; i++)
+			{
+				if (UndoRecPtrIsValid(UndoActionStartPtr[i]))
+					s->parent->latest_urec_ptr[i] = UndoActionStartPtr[i];
+			}
 			AbortSubTransaction();
 			CleanupSubTransaction();
 			AbortCurrentTransaction();
@@ -3155,6 +3326,109 @@ AbortCurrentTransaction(void)
 	}
 }
 
+/*
+ * XactPerformUndoActionsIfPending - Execute pending undo actions.
+ *
+ * If the parent transaction state is valid (when there is an error in the
+ * subtransaction and rollback to savepoint is executed), then allow to
+ * perform undo actions in it, otherwise perform them in a new transaction.
+ */
+void
+XactPerformUndoActionsIfPending()
+{
+	TransactionState s = CurrentTransactionState;
+	uint64 rollback_size = 0;
+	bool new_xact = true, result = false, no_pending_action = true;
+	UndoRecPtr parent_latest_urec_ptr[UndoPersistenceLevels];
+	int i = 0;
+
+	if (!PerformUndoActions)
+		return;
+
+	/* If there is no undo log for any persistence level, then return. */
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (UndoRecPtrIsValid(UndoActionStartPtr[i]))
+		{
+			no_pending_action = false;
+			break;
+		}
+	}
+
+	if (no_pending_action)
+	{
+		PerformUndoActions = false;
+		return;
+	}
+
+	/*
+	 * Execute undo actions under parent transaction, if any. Otherwise start
+	 * a new transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+	{
+		memcpy(parent_latest_urec_ptr, s->latest_urec_ptr,
+			   sizeof (parent_latest_urec_ptr));
+		new_xact = false;
+	}
+
+	/*
+	 * If this is a large rollback request then push it to undo-worker
+	 * through RollbackHT, undo-worker will perform it's undo actions later.
+	 * Never push the rollbacks for temp tables.
+	 */
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (!UndoRecPtrIsValid(UndoActionStartPtr[i]))
+			continue;
+
+		if (i == UNDO_TEMP)
+			goto perform_rollback;
+		else
+			rollback_size = UndoActionStartPtr[i] - UndoActionEndPtr[i];
+
+		if (new_xact && rollback_size > rollback_overflow_size * 1024 * 1024)
+			result = PushRollbackReq(UndoActionStartPtr[i], UndoActionEndPtr[i], InvalidOid);
+
+		if (!result)
+		{
+perform_rollback:
+			if (new_xact)
+			{
+				TransactionState xact;
+
+				/* Start a new transaction for performing the rollback */
+				StartTransactionCommand();
+				xact = CurrentTransactionState;
+
+				/*
+				 * Store the previous transactions start and end undo record
+				 * pointers into this transaction's state so that if there is
+				 * some error while performing undo actions we can restart
+				 * from begining.
+				 */
+				memcpy(xact->start_urec_ptr, UndoActionEndPtr,
+					   sizeof(UndoActionEndPtr));
+				memcpy(xact->latest_urec_ptr, UndoActionStartPtr,
+					   sizeof(UndoActionStartPtr));
+			}
+
+			execute_undo_actions(UndoActionStartPtr[i], UndoActionEndPtr[i],
+								 new_xact, true, true);
+
+			if (new_xact)
+				CommitTransactionCommand();
+			else
+			{
+				/* Restore parent's state. */
+				s->latest_urec_ptr[i] = parent_latest_urec_ptr[i];
+			}
+		}
+	}
+
+	PerformUndoActions = false;
+}
+
 /*
  *	PreventInTransactionBlock
  *
@@ -3556,6 +3830,10 @@ EndTransactionBlock(void)
 {
 	TransactionState s = CurrentTransactionState;
 	bool		result = false;
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
+	int i ;
+
+	memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr));
 
 	switch (s->blockState)
 	{
@@ -3601,6 +3879,16 @@ EndTransactionBlock(void)
 					elog(FATAL, "EndTransactionBlock: unexpected state %s",
 						 BlockStateAsString(s->blockState));
 				s = s->parent;
+
+				/*
+				 * We are calculating latest_urec_ptr, even though its a commit
+				 * case.  This is to handle any error during the commit path.
+				 */
+				for (i = 0; i < UndoPersistenceLevels; i++)
+				{
+					if (!UndoRecPtrIsValid(latest_urec_ptr[i]))
+						latest_urec_ptr[i] = s->latest_urec_ptr[i];
+				}
 			}
 			if (s->blockState == TBLOCK_INPROGRESS)
 				s->blockState = TBLOCK_END;
@@ -3626,6 +3914,12 @@ EndTransactionBlock(void)
 					elog(FATAL, "EndTransactionBlock: unexpected state %s",
 						 BlockStateAsString(s->blockState));
 				s = s->parent;
+
+				for (i = 0; i < UndoPersistenceLevels; i++)
+				{
+					if(!UndoRecPtrIsValid(latest_urec_ptr[i]))
+						latest_urec_ptr[i] = s->latest_urec_ptr[i];
+				}
 			}
 			if (s->blockState == TBLOCK_INPROGRESS)
 				s->blockState = TBLOCK_ABORT_PENDING;
@@ -3678,6 +3972,18 @@ EndTransactionBlock(void)
 			break;
 	}
 
+	/*
+	 * We need to perform undo actions if the transaction is failed.  Remember
+	 * the required information to perform undo actions at the end of
+	 * statement execution.
+	 */
+	if (!result)
+		PerformUndoActions = true;
+
+	memcpy(UndoActionStartPtr, latest_urec_ptr, sizeof(UndoActionStartPtr));
+	memcpy(UndoActionEndPtr, TopTransactionStateData.start_urec_ptr,
+			sizeof(UndoActionEndPtr));
+
 	return result;
 }
 
@@ -3691,6 +3997,10 @@ void
 UserAbortTransactionBlock(void)
 {
 	TransactionState s = CurrentTransactionState;
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
+	int i ;
+
+	memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr));
 
 	switch (s->blockState)
 	{
@@ -3729,6 +4039,12 @@ UserAbortTransactionBlock(void)
 					elog(FATAL, "UserAbortTransactionBlock: unexpected state %s",
 						 BlockStateAsString(s->blockState));
 				s = s->parent;
+				for(i = 0; i < UndoPersistenceLevels; i++)
+				{
+					if (!UndoRecPtrIsValid(latest_urec_ptr[i]))
+						latest_urec_ptr[i] = s->latest_urec_ptr[i];
+				}
+
 			}
 			if (s->blockState == TBLOCK_INPROGRESS)
 				s->blockState = TBLOCK_ABORT_PENDING;
@@ -3786,6 +4102,54 @@ UserAbortTransactionBlock(void)
 				 BlockStateAsString(s->blockState));
 			break;
 	}
+
+	/*
+	 * Remember the required information for performing undo actions. So that
+	 * if there is any failure in executing the undo action we can execute
+	 * it later.
+	 */
+	memcpy (UndoActionStartPtr, latest_urec_ptr, sizeof(UndoActionStartPtr));
+	memcpy (UndoActionEndPtr, s->start_urec_ptr, sizeof(UndoActionEndPtr));
+
+	/*
+	 * If we are in a valid transaction state then execute the undo action here
+	 * itself, otherwise we have already stored the required information for
+	 * executing the undo action later.
+	 */
+	if (CurrentTransactionState->state == TRANS_INPROGRESS)
+	{
+		for (i = 0; i < UndoPersistenceLevels; i++)
+		{
+			if (latest_urec_ptr[i])
+			{
+				if (i == UNDO_TEMP)
+					execute_undo_actions(UndoActionStartPtr[i], UndoActionEndPtr[i],
+										false, true, true);
+				else
+				{
+					uint64 size = latest_urec_ptr[i] - s->start_urec_ptr[i];
+					bool result = false;
+
+					/*
+					 * If this is a large rollback request then push it to undo-worker
+					 * through RollbackHT, undo-worker will perform it's undo actions
+					 * later.
+					 */
+					if (size >= rollback_overflow_size * 1024 * 1024)
+						result = PushRollbackReq(UndoActionStartPtr[i], UndoActionEndPtr[i], InvalidOid);
+
+					if (!result)
+					{
+						execute_undo_actions(UndoActionStartPtr[i], UndoActionEndPtr[i],
+											true, true, true);
+						UndoActionStartPtr[i] = InvalidUndoRecPtr;
+					}
+				}
+			}
+		}
+	}
+	else
+		PerformUndoActions = true;
 }
 
 /*
@@ -3935,6 +4299,12 @@ ReleaseSavepoint(const char *name)
 	TransactionState s = CurrentTransactionState;
 	TransactionState target,
 				xact;
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	int i = 0;
+
+	memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr));
+	memcpy(start_urec_ptr, s->start_urec_ptr, sizeof(start_urec_ptr));
 
 	/*
 	 * Workers synchronize transaction state at the beginning of each parallel
@@ -4028,8 +4398,34 @@ ReleaseSavepoint(const char *name)
 		if (xact == target)
 			break;
 		xact = xact->parent;
+		for (i = 0; i < UndoPersistenceLevels; i++)
+		{
+			if (!UndoRecPtrIsValid(latest_urec_ptr[i]))
+				latest_urec_ptr[i] = xact->latest_urec_ptr[i];
+
+			if (UndoRecPtrIsValid(xact->start_urec_ptr[i]))
+				start_urec_ptr[i] = xact->start_urec_ptr[i];
+		}
+
+
 		Assert(PointerIsValid(xact));
 	}
+
+	/*
+	 * Before cleaning up the current sub transaction state, overwrite parent
+	 * transaction's latest_urec_ptr with current transaction's latest_urec_ptr
+	 * so that in case parent transaction get aborted we will not skip
+	 * performing undo for this transaction.  Also set the start_urec_ptr if
+	 * parent start_urec_ptr is not valid.
+	 */
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (UndoRecPtrIsValid(latest_urec_ptr[i]))
+			xact->parent->latest_urec_ptr[i] = latest_urec_ptr[i];
+		if (!UndoRecPtrIsValid(xact->parent->start_urec_ptr[i]))
+			xact->parent->start_urec_ptr[i] = start_urec_ptr[i];
+	}
+
 }
 
 /*
@@ -4044,6 +4440,12 @@ RollbackToSavepoint(const char *name)
 	TransactionState s = CurrentTransactionState;
 	TransactionState target,
 				xact;
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	int i = 0;
+
+	memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr));
+	memcpy(start_urec_ptr, s->start_urec_ptr, sizeof(start_urec_ptr));
 
 	/*
 	 * Workers synchronize transaction state at the beginning of each parallel
@@ -4143,6 +4545,15 @@ RollbackToSavepoint(const char *name)
 				 BlockStateAsString(xact->blockState));
 		xact = xact->parent;
 		Assert(PointerIsValid(xact));
+		for (i = 0; i < UndoPersistenceLevels; i++)
+		{
+			if (!UndoRecPtrIsValid(latest_urec_ptr[i]))
+				latest_urec_ptr[i] = xact->latest_urec_ptr[i];
+
+			if (UndoRecPtrIsValid(xact->start_urec_ptr[i]))
+				start_urec_ptr[i] = xact->start_urec_ptr[i];
+		}
+
 	}
 
 	/* And mark the target as "restart pending" */
@@ -4153,6 +4564,34 @@ RollbackToSavepoint(const char *name)
 	else
 		elog(FATAL, "RollbackToSavepoint: unexpected state %s",
 			 BlockStateAsString(xact->blockState));
+
+	/*
+	 * Remember the required information for performing undo actions. So that
+	 * if there is any failure in executing the undo action we can execute
+	 * it later.
+	 */
+	memcpy (UndoActionStartPtr, latest_urec_ptr, sizeof(UndoActionStartPtr));
+	memcpy (UndoActionEndPtr, start_urec_ptr, sizeof(UndoActionEndPtr));
+
+	/*
+	 * If we are in a valid transaction state then execute the undo action here
+	 * itself, otherwise we have already stored the required information for
+	 * executing the undo action later.
+	 */
+	if (s->state == TRANS_INPROGRESS)
+	{
+		for ( i = 0; i < UndoPersistenceLevels; i++)
+		{
+			if (UndoRecPtrIsValid(latest_urec_ptr[i]))
+			{
+				execute_undo_actions(latest_urec_ptr[i], start_urec_ptr[i], false, true, false);
+				xact->latest_urec_ptr[i] = InvalidUndoRecPtr;
+				UndoActionStartPtr[i] = InvalidUndoRecPtr;
+			}
+		}
+	}
+	else
+		PerformUndoActions = true;
 }
 
 /*
@@ -4240,6 +4679,7 @@ void
 ReleaseCurrentSubTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
+	int i;
 
 	/*
 	 * Workers synchronize transaction state at the beginning of each parallel
@@ -4258,6 +4698,22 @@ ReleaseCurrentSubTransaction(void)
 			 BlockStateAsString(s->blockState));
 	Assert(s->state == TRANS_INPROGRESS);
 	MemoryContextSwitchTo(CurTransactionContext);
+
+	/*
+	 * Before cleaning up the current sub transaction state, overwrite parent
+	 * transaction's latest_urec_ptr with current transaction's latest_urec_ptr
+	 * so that in case parent transaction get aborted we will not skip
+	 * performing undo for this transaction.
+	 */
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (UndoRecPtrIsValid(s->latest_urec_ptr[i]))
+			s->parent->latest_urec_ptr[i] = s->latest_urec_ptr[i];
+
+		if (!UndoRecPtrIsValid(s->parent->start_urec_ptr[i]))
+			s->parent->start_urec_ptr[i] = s->start_urec_ptr[i];
+	}
+
 	CommitSubTransaction();
 	s = CurrentTransactionState;	/* changed by pop */
 	Assert(s->state == TRANS_INPROGRESS);
@@ -4274,6 +4730,14 @@ void
 RollbackAndReleaseCurrentSubTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
+	UndoRecPtr latest_urec_ptr[UndoPersistenceLevels] = {InvalidUndoRecPtr,
+														 InvalidUndoRecPtr,
+														 InvalidUndoRecPtr};
+	UndoRecPtr start_urec_ptr[UndoPersistenceLevels] = {InvalidUndoRecPtr,
+														InvalidUndoRecPtr,
+														InvalidUndoRecPtr};
+	UndoRecPtr parent_latest_urec_ptr[UndoPersistenceLevels];
+	int i;
 
 	/*
 	 * Unlike ReleaseCurrentSubTransaction(), this is nominally permitted
@@ -4320,6 +4784,19 @@ RollbackAndReleaseCurrentSubTransaction(void)
 	if (s->blockState == TBLOCK_SUBINPROGRESS)
 		AbortSubTransaction();
 
+	/*
+	 * Remember the required information to perform undo actions before
+	 * cleaning up the subtransaction state.
+	 */
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (UndoRecPtrIsValid(s->latest_urec_ptr[i]))
+		{
+			latest_urec_ptr[i] = s->latest_urec_ptr[i];
+			start_urec_ptr[i] = s->start_urec_ptr[i];
+		}
+	}
+
 	/* And clean it up, too */
 	CleanupSubTransaction();
 
@@ -4328,6 +4805,30 @@ RollbackAndReleaseCurrentSubTransaction(void)
 				s->blockState == TBLOCK_INPROGRESS ||
 				s->blockState == TBLOCK_IMPLICIT_INPROGRESS ||
 				s->blockState == TBLOCK_STARTED);
+
+	for (i = 0; i < UndoPersistenceLevels; i++)
+	{
+		if (UndoRecPtrIsValid(latest_urec_ptr[i]))
+		{
+			 parent_latest_urec_ptr[i] = s->latest_urec_ptr[i];
+
+			/*
+			 * Store the undo action start point in the parent state so that
+			 * we can apply undo actions these undos also during rollback of
+			 * parent transaction in case of error while applying the undo
+			 * actions.
+			 */
+			s->latest_urec_ptr[i] = latest_urec_ptr[i];
+			execute_undo_actions(latest_urec_ptr[i], start_urec_ptr[i], false,
+								 true, true);
+
+			/* Restore parent state. */
+			s->latest_urec_ptr[i] = parent_latest_urec_ptr[i];
+		}
+	}
+
+	/* Successfully performed undo actions so reset the flag. */
+	PerformUndoActions = false;
 }
 
 /*
@@ -4541,6 +5042,7 @@ static void
 StartSubTransaction(void)
 {
 	TransactionState s = CurrentTransactionState;
+	int i;
 
 	if (s->state != TRANS_DEFAULT)
 		elog(WARNING, "StartSubTransaction while in %s state",
@@ -4558,6 +5060,14 @@ StartSubTransaction(void)
 	AtSubStart_Notify();
 	AfterTriggerBeginSubXact();
 
+	/* initialize undo record locations for the transaction */
+	for(i = 0; i < UndoPersistenceLevels; i++)
+	{
+		s->start_urec_ptr[i] = InvalidUndoRecPtr;
+		s->latest_urec_ptr[i] = InvalidUndoRecPtr;
+	}
+
+	s->subXactLock = false;
 	s->state = TRANS_INPROGRESS;
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..5b37702b7c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -972,6 +973,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata,
 				 XLogRecPtr fpw_lsn,
+				 XLogRecPtr OldRedoRecPtr,
 				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
@@ -1066,6 +1068,21 @@ XLogInsertRecord(XLogRecData *rdata,
 		return InvalidXLogRecPtr;
 	}
 
+	/*
+	 * If the redo point is changed and wal need to include the undo attach
+	 * information i.e. (this is the first WAL which after the checkpoint).
+	 * then return from here so that the caller can restart.
+	 */
+	if (rechdr->xl_rmid == RM_ZHEAP_ID &&
+		OldRedoRecPtr != InvalidXLogRecPtr &&
+		OldRedoRecPtr != RedoRecPtr &&
+		NeedUndoMetaLog(RedoRecPtr))
+	{
+		WALInsertLockRelease();
+		END_CRIT_SECTION();
+		return InvalidXLogRecPtr;
+	}
+
 	/*
 	 * Reserve space for the record in the WAL. This also sets the xl_prev
 	 * pointer.
@@ -4493,6 +4510,8 @@ WriteControlFile(void)
 	ControlFile->float4ByVal = FLOAT4PASSBYVAL;
 	ControlFile->float8ByVal = FLOAT8PASSBYVAL;
 
+	ControlFile->zheap_page_trans_slots = ZHEAP_PAGE_TRANS_SLOTS;
+
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile->crc);
 	COMP_CRC32C(ControlFile->crc,
@@ -4725,6 +4744,13 @@ ReadControlFile(void)
 						   " but the server was compiled without USE_FLOAT8_BYVAL."),
 				 errhint("It looks like you need to recompile or initdb.")));
 #endif
+	if (ControlFile->zheap_page_trans_slots != ZHEAP_PAGE_TRANS_SLOTS)
+		ereport(FATAL,
+				(errmsg("database files are incompatible with server"),
+				 errdetail("The database cluster was initialized with ZHEAP_PAGE_TRANS_SLOTS %d,"
+						   " but the server was compiled with ZHEAP_PAGE_TRANS_SLOTS %d.",
+						   ControlFile->zheap_page_trans_slots, (int) ZHEAP_PAGE_TRANS_SLOTS),
+				 errhint("It looks like you need to recompile or initdb.")));
 
 	wal_segment_size = ControlFile->xlog_seg_size;
 
@@ -5169,6 +5195,7 @@ BootStrapXLOG(void)
 	checkPoint.newestCommitTsXid = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
+	checkPoint.oldestXidWithEpochHavingUndo = InvalidTransactionId;
 
 	ShmemVariableCache->nextXid = checkPoint.nextXid;
 	ShmemVariableCache->nextOid = checkPoint.nextOid;
@@ -6603,6 +6630,10 @@ StartupXLOG(void)
 			(errmsg_internal("commit timestamp Xid oldest/newest: %u/%u",
 							 checkPoint.oldestCommitTsXid,
 							 checkPoint.newestCommitTsXid)));
+	ereport(DEBUG1,
+			(errmsg_internal("oldest xid with epoch having undo: " UINT64_FORMAT,
+							 checkPoint.oldestXidWithEpochHavingUndo)));
+
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -6620,6 +6651,10 @@ StartupXLOG(void)
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
 
+	/* Read oldest xid having undo from checkpoint and set in proc global. */
+	pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo,
+						checkPoint.oldestXidWithEpochHavingUndo);
+
 	/*
 	 * Initialize replication slots, before there's a chance to remove
 	 * required resources.
@@ -6693,6 +6728,9 @@ StartupXLOG(void)
 	 */
 	restoreTwoPhaseData();
 
+	/* Recover undo log meta data corresponding to this checkpoint. */
+	StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
@@ -7315,7 +7353,13 @@ StartupXLOG(void)
 	 * end-of-recovery steps fail.
 	 */
 	if (InRecovery)
+	{
 		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+		ResetUndoLogs(UNDO_UNLOGGED);
+	}
+
+	/* Always reset temporary undo logs. */
+	ResetUndoLogs(UNDO_TEMP);
 
 	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
@@ -8312,6 +8356,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 	*epoch = ckptXidEpoch;
 }
 
+/*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
@@ -8752,6 +8825,9 @@ CreateCheckPoint(int flags)
 		checkPoint.nextOid += ShmemVariableCache->oidCount;
 	LWLockRelease(OidGenLock);
 
+	checkPoint.oldestXidWithEpochHavingUndo =
+			pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo);
+
 	MultiXactGetCheckptMulti(shutdown,
 							 &checkPoint.nextMulti,
 							 &checkPoint.nextMultiOffset,
@@ -9020,6 +9096,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
+	CheckPointUndoLogs(checkPointRedo, ControlFile->checkPointCopy.redo);
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
@@ -9661,6 +9738,9 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 
+		pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo,
+							checkPoint.oldestXidWithEpochHavingUndo);
+
 		/*
 		 * No need to set oldestClogXid here as well; it'll be set when we
 		 * redo an xl_clog_truncate if it changed since initialization.
@@ -9719,6 +9799,8 @@ xlog_redo(XLogReaderState *record)
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
+		ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo =
+									checkPoint.oldestXidWithEpochHavingUndo;
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
@@ -9726,6 +9808,12 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo =
+									checkPoint.oldestXidWithEpochHavingUndo;
+
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/*
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
@@ -9765,6 +9853,9 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
 
+		pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo,
+							checkPoint.oldestXidWithEpochHavingUndo);
+
 		/*
 		 * NB: This may perform multixact truncation when replaying WAL
 		 * generated by an older primary.
@@ -9785,6 +9876,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/* TLI should not change in an on-line checkpoint */
 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
 			ereport(PANIC,
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 034d5b3b62..ce6a8c1a83 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -459,7 +459,8 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, InvalidXLogRecPtr,
+								  curinsert_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -467,6 +468,63 @@ XLogInsert(RmgrId rmid, uint8 info)
 	return EndPos;
 }
 
+/*
+ * XLogInsertExtended
+ *		Like XLogInsert, but with extra options.
+ *
+ * The internal logic of this function is almost same as XLogInsert, but there
+ * are some differences: unlike XLogInsert, this function will not retry for WAL
+ * insert if the page image inclusion decision got changed instead it will
+ * return immediately, and it will not calculate the latest value of RedoRecPtr
+ * like XLogInsert, instead it will take that as input from caller so that if
+ * the caller has not included the tuple info (because page image is not present
+ * in the WAL) it can start over again if including page image decision got
+ * changed later during WAL insertion.
+ */
+XLogRecPtr
+XLogInsertExtended(RmgrId rmid, uint8 info, XLogRecPtr RedoRecPtr,
+				   bool doPageWrites)
+{
+	XLogRecPtr	EndPos;
+	XLogRecPtr	fpw_lsn;
+	XLogRecData *rdt;
+
+	/* XLogBeginInsert() must have been called. */
+	if (!begininsert_called)
+		elog(ERROR, "XLogBeginInsert was not called");
+
+	/*
+	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
+	 */
+	if ((info & ~(XLR_RMGR_INFO_MASK |
+				  XLR_SPECIAL_REL_UPDATE |
+				  XLR_CHECK_CONSISTENCY)) != 0)
+		elog(PANIC, "invalid xlog info mask %02X", info);
+
+	TRACE_POSTGRESQL_WAL_INSERT(rmid, info);
+
+	/*
+	 * In bootstrap mode, we don't actually log anything but XLOG resources;
+	 * return a phony record pointer.
+	 */
+	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
+	{
+		XLogResetInsertion();
+		EndPos = SizeOfXLogLongPHD; /* start of 1st chkpt record */
+		return EndPos;
+	}
+
+	rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
+							 &fpw_lsn);
+
+	EndPos = XLogInsertRecord(rdt, fpw_lsn, RedoRecPtr, curinsert_flags);
+
+	XLogResetInsertion();
+
+	return EndPos;
+}
+
 /*
  * Assemble a WAL record from the registered data and buffers into an
  * XLogRecData chain, ready for insertion with XLogInsertRecord().
@@ -783,8 +841,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	 * Fill in the fields in the record header. Prev-link is filled in later,
 	 * once we know where in the WAL the record will be inserted. The CRC does
 	 * not include the record header yet.
+	 *
+	 * Since zheap storage always use TopTransactionId, if this xlog is for the
+	 * zheap then get the TopTransactionId.
 	 */
-	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
+	if (rmid == RM_ZHEAP_ID)
+		rechdr->xl_xid = GetTopTransactionIdIfAny();
+	else
+		rechdr->xl_xid = GetCurrentTransactionIdIfAny();
 	rechdr->xl_tot_len = total_len;
 	rechdr->xl_info = info;
 	rechdr->xl_rmid = rmid;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc9220f..8718a4fdfc 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
new file mode 100644
index 0000000000..585f15cdb1
--- /dev/null
+++ b/src/backend/access/undo/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/undo
+#
+# IDENTIFICATION
+#    src/backend/access/undo/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = undodiscard.o undoinsert.o undolog.o undorecord.o undoaction.o undoactionxlog.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/README b/src/backend/access/undo/README
new file mode 100644
index 0000000000..9ba81f960d
--- /dev/null
+++ b/src/backend/access/undo/README
@@ -0,0 +1,169 @@
+src/backend/access/undo/README
+
+Undo Logs
+=========
+
+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.  See src/backend/access/zheap/README for
+more information on zheap.  The undo log subsystem is concerned with
+raw storage optimized for efficient recycling and buffered random
+access.
+
+Like redo data (the WAL), undo data consists of records identified by
+their location within a 64 bit address space.  Unlike redo data, the
+addressing space is internally divided up unto multiple numbered logs.
+The first 24 bits of an UndoRecPtr identify the undo log number, and
+the remaining 40 bits address the space within that undo log.  Higher
+level code (zheap) is largely oblivious to this internal structure and
+deals mostly in opaque UndoRecPtr values.
+
+Using multiple undo logs instead of a single uniform space avoids the
+contention that would result from a single insertion point, since each
+session can be given sole access to write data into a given undo log.
+It also allows for parallelized space reclamation.
+
+Like redo data, undo data is stored on disk in numbered segment files
+that are recycled as required.  Unlike redo data, undo data is
+accessed through the buffer pool.  In this respect it is similar to
+regular relation data.  Buffer content is written out to disk during
+checkpoints and whenever it is evicted to make space for another page.
+However, unlike regular relation data, undo data has a chance of never
+being written to disk at all: if a page is allocated and and then
+later discarded without an intervening checkpoint and without an
+eviction provoked by memory pressure, then no disk IO is generated.
+
+Keeping the undo data physically separate from redo data and accessing
+it though the existing shared buffers mechanism allows it to be
+accessed efficiently for MVCC purposes.
+
+Meta-Data
+=========
+
+At any given time the set of undo logs that exists is tracked in
+shared memory and can be inspected in the pg_stat_undo_logs view.  For
+each undo log, a set of properties called the undo log's meta-data are
+tracked:
+
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged, temporary)
+* the "discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+* the "end" pointer: a new undo segment file will be needed at this point
+
+The three pointers discard, insert and end move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When discard == insert, the undo log is empty
+(everything that has ever been inserted has since been discarded).
+The insert pointer advances when regular backends allocate new space,
+and the discard pointer usually advances when an undo worker process
+determines that no session could need the data either for rollback or
+for finding old versions of tuples to satisfy a snapshot.  In some
+special cases including single-user mode and temporary undo logs the
+discard pointer might also be advanced synchronously by a foreground
+session.
+
+In order to provide constant time access to undo log meta-data given
+an UndoRecPtr, there is conceptually an array of UndoLogControl
+objects indexed by undo log number.  Since that array would be too
+large and since we expect the set of active undo log numbers to be
+small and clustered, we only keep small ranges of that logical array
+in memory at a time.  We use the higher order bits of the undo log
+number to identify a 'bank' (array fragment), and then the lower order
+bits to identify a slot within the bank.  Each bank is backed by a DSM
+segment.  We expect to need just 1 or 2 such DSM segments to exist at
+any time.
+
+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under PGDATA/pg_undo/, using the
+checkpoint's redo point (a WAL LSN) as its filename.  At startup time,
+the redo point's file can be used to restore all undo logs' meta-data
+as of the moment of the redo point into shared memory.  Changes to the
+discard pointer and end pointer are WAL-logged by undolog.c and will
+bring the in-memory meta-data up to date in the event of recovery
+after a crash.  Changes to insert pointers are included in other WAL
+records (see below).
+
+Responsibility for creating, deleting and recycling undo log segment
+files and WAL logging the associated meta-data changes lies with
+src/backend/storage/undo/undolog.c.
+
+Persistence Levels and Tablespaces
+==================================
+
+When new undo log space is requested by client code, the persistence
+level of the relation being modified and the current value of the GUC
+"undo_tablespaces" controls which undo log is selected.  If the
+session is already attached to a suitable undo log and it hasn't run
+out of address space, it can be used immediately.  Otherwise a
+suitable undo log must be either found or created.  The system should
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are used).
+
+When an unlogged relation is modified, undo data generated by the
+operation must be stored in an unlogged undo log.  This causes the
+undo data to be deleted along with all unlogged relations during
+recovery from a non-shutdown checkpoint.  Likewise, temporary
+relations require special treatment: their buffers are backend-local
+and they cannot be accessed by other backend including undo workers.
+
+Non-empty undo logs in a tablespace prevent the tablespace from being
+dropped.
+
+Undo Log Contents
+=================
+
+Undo log contents are written into 1MB segment files under
+PGDATA/base/undo/ or PGDATA/pg_tblspc/VERSION/undo/ using filenames
+that encode the address (UndoRecPtr) of their first byte.  A period
+'.'  separates the undo log number part from the offset part, for the
+benefit of human administrators.
+
+Undo logs are page-oriented and use regular PosgreSQL page headers
+including checksums (if enabled) and LSNs.  An UndoRecPtr can be used
+to obtain a buffer and an offset within the buffer, and then regular
+buffer locking and page LSN rules apply.  While space is allocated by
+asking for a given number of usable bytes (not including page
+headers), client code is responsible for stepping over the page
+headers and advancing to the next page.
+
+Responsibility for WAL-logging the contents of the undo log lies with
+client code (ie zheap).  While undolog.c WAL-logs all meta-data
+changes except insert points and checkpoints all meta-data including
+insert points, client code is responsible for allocating undo log
+space in the same sequence at recovery time.  This avoids having to
+WAL-log insertion points explicitly and separately for every insertion
+into an undo log, greatly reducing WAL traffic.  (WAL is still
+generated by undolog.c whenever a 1MB segment boundary is crossed,
+since that also advances the end pointer.)
+
+One complication of this scheme for implicit insert pointer movement
+is that recovery doesn't naturally have access to the association
+between transactions and undo logs.  That is, while 'do' sessions have
+a currently attached undo log from which they allocate new space,
+recovery is performed by a single startup process which has no concept
+of the sessions that generated the WAL it is replaying.  For that
+reason, an xid->undo log number map is maintained at recovery time.
+At 'do' time, a WAL record is emitted the first time any permanent
+undo log is used in a given transaction, so that the mapping can be
+recovered at redo time.  That allows a stream of allocations to be
+directed to the appropriate undo logs so that the same resulting
+stream of undo log pointer can be produced.  (Unlogged and temporary
+undo logs don't have this problem since they aren't used at recovery
+time.)
+
+Another complication is that the checkpoint files written under pg_undo
+may contain inconsistent data during recovery from an online checkpoint
+(after a crash or base backup).  To compensate for this, client code
+must arrange to log an undo log meta-data record when inserting the
+first WAL record that might cause undo log access during recovery.
+This is conceptually similar to full page images after checkpoints,
+but limited to one meta-data WAL record per undo log per checkpoint.
+
+src/backend/storage/buffer/bufmgr.c is unaware of the existence of
+undo log as a separate category of buffered data.  Reading and writing
+of buffered undo log pages is handled by a new storage manager in
+src/backend/storage/smgr/undo_file.c.  See
+src/backend/storage/smgr/README for more details.
diff --git a/src/backend/access/undo/undoaction.c b/src/backend/access/undo/undoaction.c
new file mode 100644
index 0000000000..16eb1ef04d
--- /dev/null
+++ b/src/backend/access/undo/undoaction.c
@@ -0,0 +1,1631 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoaction.c
+ *	  execute undo actions
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoaction.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/undoaction_xlog.h"
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/visibilitymap.h"
+#include "access/xact.h"
+#include "access/zheap.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "postmaster/undoloop.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "utils/relfilenodemap.h"
+#include "utils/syscache.h"
+#include "miscadmin.h"
+#include "storage/shmem.h"
+#include "access/undodiscard.h"
+
+#define ROLLBACK_HT_SIZE	1024
+
+static bool execute_undo_actions_page(List *luinfo, UndoRecPtr urec_ptr,
+					 Oid reloid, TransactionId xid, BlockNumber blkno,
+					 bool blk_chain_complete, bool norellock);
+static inline void undo_action_insert(Relation rel, Page page, OffsetNumber off,
+									  TransactionId xid);
+static void RollbackHTRemoveEntry(UndoRecPtr start_urec_ptr);
+
+/* This is the hash table to store all the rollabck requests. */
+static HTAB *RollbackHT;
+
+/* undo record information */
+typedef struct UndoRecInfo
+{
+	UndoRecPtr	urp;	/* undo recptr (undo record location). */
+	UnpackedUndoRecord	*uur;	/* actual undo record. */
+} UndoRecInfo;
+
+/*
+ * execute_undo_actions - Execute the undo actions
+ *
+ * from_urecptr - undo record pointer from where to start applying undo action.
+ * to_urecptr	- undo record pointer upto which point apply undo action.
+ * nopartial	- true if rollback is for complete transaction.
+ * rewind		- whether to rewind the insert location of the undo log or not.
+ *				  Only the backend executed the transaction can rewind, but
+ *				  any other process e.g. undo worker should not rewind it.
+ *				  Because, if the backend have already inserted new undo records
+ *				  for the next transaction and if we rewind then we will loose
+ *				  the undo record inserted for the new transaction.
+ * 	rellock	  -	  if the caller already has the lock on the required relation,
+ *				  then this flag is false, i.e. we do not need to acquire any
+ *				  lock here. If the flag is true then we need to acquire lock
+ *				  here itself, because caller will not be having any lock.
+ *				  When we are performing undo actions for prepared transactions,
+ *			      or for rollback to savepoint, we need not to lock as we already
+ *				  have the lock on the table. In cases like error or when
+ *				  rollbacking from the undo worker we need to have proper locks.
+ */
+void
+execute_undo_actions(UndoRecPtr from_urecptr, UndoRecPtr to_urecptr,
+					 bool nopartial, bool rewind, bool rellock)
+{
+	UnpackedUndoRecord *uur = NULL;
+	UndoRecPtr	urec_ptr, prev_urec_ptr, prev_blkprev;
+	UndoRecPtr	save_urec_ptr;
+	Oid			prev_reloid = InvalidOid;
+	ForkNumber	prev_fork = InvalidForkNumber;
+	BlockNumber	prev_block = InvalidBlockNumber;
+	List	   *luinfo = NIL;
+	bool		more_undo;
+	TransactionId xid = InvalidTransactionId;
+	UndoRecInfo	*urec_info;
+
+	Assert(from_urecptr != InvalidUndoRecPtr);
+	/*
+	 * If the location upto which rollback need to be done is not provided,
+	 * then rollback the complete transaction.
+	 * FIXME: this won't work if undolog crossed the limit of 1TB, because
+	 * then from_urecptr and to_urecptr will be from different lognos.
+	 */
+	if (to_urecptr == InvalidUndoRecPtr)
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(from_urecptr);
+		to_urecptr = UndoLogGetLastXactStartPoint(logno);
+	}
+
+	prev_blkprev = save_urec_ptr = urec_ptr = from_urecptr;
+
+	if (nopartial)
+	{
+		uur = UndoFetchRecord(urec_ptr, InvalidBlockNumber, InvalidOffsetNumber,
+							  InvalidTransactionId, NULL, NULL);
+		if (uur == NULL)
+			return;
+
+		xid = uur->uur_xid;
+		UndoRecordRelease(uur);
+		uur = NULL;
+
+		/*
+		 * Grab the undo action apply lock before start applying the undo action
+		 * this will prevent applying undo actions concurrently.  If we do not
+		 * get the lock that mean its already being applied concurrently or the
+		 * discard worker might be pushing its request to the rollback hash
+		 * table
+		 */
+		if (!ConditionTransactionUndoActionLock(xid))
+			return;
+	}
+
+	prev_urec_ptr = InvalidUndoRecPtr;
+	while (prev_urec_ptr != to_urecptr)
+	{
+		Oid			reloid = InvalidOid;
+		uint16		urec_prevlen;
+
+		more_undo = true;
+
+		prev_urec_ptr = urec_ptr;
+
+		/* Fetch the undo record for given undo_recptr. */
+		uur = UndoFetchRecord(urec_ptr, InvalidBlockNumber,
+						 InvalidOffsetNumber, InvalidTransactionId, NULL, NULL);
+
+		if (uur != NULL)
+			reloid = uur->uur_reloid;
+
+		/*
+		 * If the record is already discarded by undo worker or if the relation
+		 * is dropped or truncated, then we cannot fetch record successfully.
+		 * Hence, exit quietly.
+		 *
+		 * Note: reloid remains InvalidOid for a discarded record.
+		 */
+		if (!OidIsValid(reloid))
+		{
+			/* release the undo records for which action has been replayed */
+			while (luinfo)
+			{
+				UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo);
+
+				UndoRecordRelease(urec_info->uur);
+				pfree(urec_info);
+				luinfo = list_delete_first(luinfo);
+			}
+
+			/* Release the undo action lock before returning. */
+			if (nopartial)
+				TransactionUndoActionLockRelease(xid);
+
+			/* Release the just-fetched record */
+			if (uur != NULL)
+				UndoRecordRelease(uur);
+
+			return;
+		}
+
+		xid = uur->uur_xid;
+
+		/* Collect the undo records that belong to the same page. */
+		if (!OidIsValid(prev_reloid) ||
+			(prev_reloid == reloid &&
+			 prev_fork == uur->uur_fork &&
+			 prev_block == uur->uur_block &&
+			 prev_blkprev == urec_ptr))
+		{
+			prev_reloid = reloid;
+			prev_fork = uur->uur_fork;
+			prev_block = uur->uur_block;
+
+			/* Prepare an undo record information element. */
+			urec_info = palloc(sizeof(UndoRecInfo));
+			urec_info->urp = urec_ptr;
+			urec_info->uur = uur;
+
+			luinfo = lappend(luinfo, urec_info);
+			urec_prevlen = uur->uur_prevlen;
+			save_urec_ptr = uur->uur_blkprev;
+
+			/* The undo chain must continue till we reach to_urecptr */
+			if (urec_prevlen > 0 && urec_ptr != to_urecptr)
+			{
+				urec_ptr = UndoGetPrevUndoRecptr(urec_ptr, urec_prevlen);
+				prev_blkprev = uur->uur_blkprev;
+				continue;
+			}
+			else
+				more_undo = false;
+		}
+		else
+		{
+			more_undo = true;
+		}
+
+		/*
+		 * If no more undo is left to be processed and we are rolling back the
+		 * complete transaction, then we can consider that the undo chain for a
+		 * block is complete.
+		 * If the previous undo pointer in the page is invalid, then also the
+		 * undo chain for the current block is completed.
+		 */
+		if ((!more_undo && nopartial) || !UndoRecPtrIsValid(save_urec_ptr))
+		{
+			execute_undo_actions_page(luinfo, save_urec_ptr, prev_reloid,
+									  xid, prev_block, true, rellock);
+		}
+		else
+		{
+			execute_undo_actions_page(luinfo, save_urec_ptr, prev_reloid,
+									  xid, prev_block, false, rellock);
+		}
+
+		/* release the undo records for which action has been replayed */
+		while (luinfo)
+		{
+			UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo);
+
+			UndoRecordRelease(urec_info->uur);
+			pfree(urec_info);
+			luinfo = list_delete_first(luinfo);
+		}
+
+		/*
+		 * There are still more records to process, so keep moving backwards
+		 * in the chain.
+		 */
+		if (more_undo)
+		{
+			/* Prepare an undo record information element. */
+			urec_info = palloc(sizeof(UndoRecInfo));
+			urec_info->urp = urec_ptr;
+			urec_info->uur = uur;
+			luinfo = lappend(luinfo, urec_info);
+
+			prev_reloid = reloid;
+			prev_fork = uur->uur_fork;
+			prev_block = uur->uur_block;
+			save_urec_ptr = uur->uur_blkprev;
+
+			/*
+			 * Continue to process the records if this is not the last undo
+			 * record in chain.
+			 */
+			urec_prevlen = uur->uur_prevlen;
+			if (urec_prevlen > 0 && urec_ptr != to_urecptr)
+				urec_ptr = UndoGetPrevUndoRecptr(urec_ptr, urec_prevlen);
+			else
+				break;
+		}
+		else
+			break;
+	}
+
+	/* Apply the undo actions for the remaining records. */
+	if (list_length(luinfo))
+	{
+		execute_undo_actions_page(luinfo, save_urec_ptr, prev_reloid,
+								  xid, prev_block, nopartial ? true : false,
+								  rellock);
+
+		/* release the undo records for which action has been replayed */
+		while (luinfo)
+		{
+			UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo);
+
+			UndoRecordRelease(urec_info->uur);
+			pfree(urec_info);
+			luinfo = list_delete_first(luinfo);
+		}
+	}
+
+	if (rewind)
+	{
+		/* Read the current log from undo */
+		UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(to_urecptr));
+
+		/* Read the prevlen from the first record of this transaction. */
+		uur = UndoFetchRecord(to_urecptr, InvalidBlockNumber,
+							  InvalidOffsetNumber, InvalidTransactionId,
+							  NULL, NULL);
+		/*
+		 * If undo is already discarded before we rewind, then do nothing.
+		 */
+		if (uur == NULL)
+			return;
+
+
+		/*
+		 * In ZGetMultiLockMembers we fetch the undo record without a
+		 * buffer lock so it's possible that a transaction in the slot
+		 * can rollback and rewind the undo record pointer.  To prevent
+		 * that we acquire the rewind lock before rewinding the undo record
+		 * pointer and the same lock will be acquire by ZGetMultiLockMembers
+		 * in shared mode.  Other places where we fetch the undo record we
+		 * don't need this lock as we are doing that under the buffer lock.
+		 * So remember to acquire the rewind lock in shared mode wherever we
+		 * are fetching the undo record of non commited transaction without
+		 * buffer lock.
+		 */
+		LWLockAcquire(&log->rewind_lock, LW_EXCLUSIVE);
+		UndoLogRewind(to_urecptr, uur->uur_prevlen);
+		LWLockRelease(&log->rewind_lock);
+
+		UndoRecordRelease(uur);
+	}
+
+	if (nopartial)
+	{
+		/*
+		 * Set undo action apply completed in the transaction header if this is
+		 * a main transaction and we have not rewound its undo.
+		 */
+		if (!rewind)
+		{
+			/*
+			 * Undo action is applied so delete the hash table entry and release
+			 * the undo action lock.
+			 */
+			RollbackHTRemoveEntry(from_urecptr);
+
+			/*
+			 * Prepare and update the progress of the undo action apply in the
+			 * transaction header.
+			 */
+			PrepareUpdateUndoActionProgress(to_urecptr, 1);
+
+			START_CRIT_SECTION();
+
+			/* Update the progress in the transaction header. */
+			UndoRecordUpdateTransInfo();
+
+			/* WAL log the undo apply progress. */
+			{
+				xl_undoapply_progress xlrec;
+
+				xlrec.urec_ptr = to_urecptr;
+				xlrec.progress = 1;
+
+				/*
+				 * FIXME : We need to register undo buffers and set LSN for them
+				 * that will be required for FPW of the undo buffers.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+				(void) XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS);
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseUndoBuffers();
+		}
+
+		TransactionUndoActionLockRelease(xid);
+	}
+}
+
+/*
+ * process_and_execute_undo_actions_page
+ *
+ * Collect all the undo for the input buffer and execute.  Here, we don't know
+ * the to_urecptr and we can not collect from undo meta data also like we do in
+ * execute_undo_actions, because we might be applying undo of some old
+ * transaction and may be from different undo log as well.
+ *
+ * from_urecptr - undo record pointer from where to start applying the undo.
+ * rel			- relation descriptor for which undo to be applied.
+ * buffer		- buffer for which unto to be processed.
+ * epoch		- epoch of the xid passed.
+ * xid			- aborted transaction id whose effects needs to be reverted.
+ * slot_no		- transaction slot number of xid.
+ */
+void
+process_and_execute_undo_actions_page(UndoRecPtr from_urecptr, Relation rel,
+									  Buffer buffer, uint32 epoch,
+									  TransactionId xid, int slot_no)
+{
+	UnpackedUndoRecord *uur = NULL;
+	UndoRecPtr	urec_ptr = from_urecptr;
+	List	   *luinfo = NIL;
+	Page		page;
+	UndoRecInfo	*urec_info;
+	bool	actions_applied = false;
+
+	/*
+	 * Process and collect the undo for the block until we reach the first
+	 * record of the transaction.
+	 *
+	 * Fixme: This can lead to unbounded use of memory, so we should collect
+	 * the undo in chunks based on work_mem or some other memory unit.
+	 */
+	do
+	{
+		/* Fetch the undo record for given undo_recptr. */
+		uur = UndoFetchRecord(urec_ptr, InvalidBlockNumber,
+							  InvalidOffsetNumber, InvalidTransactionId,
+							  NULL, NULL);
+		/*
+		 * If the record is already discarded by undo worker, or the xid we
+		 * want to rollback has already applied its undo actions then just
+		 * cleanup the slot and exit.
+		 */
+		if(uur == NULL || uur->uur_xid != xid)
+		{
+			if (uur != NULL)
+				UndoRecordRelease(uur);
+			break;
+		}
+
+		/* Prepare an undo element. */
+		urec_info = palloc(sizeof(UndoRecInfo));
+		urec_info->urp = urec_ptr;
+		urec_info->uur = uur;
+
+		/* Collect the undo records. */
+		luinfo = lappend(luinfo, urec_info);
+		urec_ptr = uur->uur_blkprev;
+
+		/*
+		 * If we have exhausted the undo chain for the slot, then we are done.
+		 */
+		if (!UndoRecPtrIsValid(urec_ptr))
+			break;
+	} while (true);
+
+	if (list_length(luinfo))
+		actions_applied = execute_undo_actions_page(luinfo, urec_ptr,
+													rel->rd_id, xid,
+													BufferGetBlockNumber(buffer),
+													true,
+													false);
+	/* Release undo records and undo elements*/
+	while (luinfo)
+	{
+		UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo);
+
+		UndoRecordRelease(urec_info->uur);
+		pfree(urec_info);
+		luinfo = list_delete_first(luinfo);
+	}
+
+	/*
+	 * Clear the transaction id from the slot.  We expect that if the undo
+	 * actions are applied by execute_undo_actions_page then it would have
+	 * cleared the xid, otherwise we will clear it here.
+	 */
+	if (!actions_applied)
+	{
+		int		slot_no;
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		slot_no = PageGetTransactionSlotId(rel, buffer, epoch, xid, &urec_ptr,
+										   true, false, NULL);
+		/*
+		 * If someone has already cleared the transaction info, then we don't
+		 * need to do anything.
+		 */
+		if (slot_no != InvalidXactSlotId)
+		{
+			START_CRIT_SECTION();
+
+			/* Clear the epoch and xid from the slot. */
+			PageSetTransactionSlotInfo(buffer, slot_no, 0,
+									   InvalidTransactionId, urec_ptr);
+			MarkBufferDirty(buffer);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+				xl_undoaction_reset_slot	xlrec;
+
+				xlrec.flags = 0;
+				xlrec.urec_ptr = urec_ptr;
+				xlrec.trans_slot_id = slot_no;
+
+				XLogBeginInsert();
+
+				XLogRegisterData((char *) &xlrec, SizeOfUndoActionResetSlot);
+				XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+				/* Register tpd buffer if the slot belongs to tpd page. */
+				if (slot_no > ZHEAP_PAGE_TRANS_SLOTS)
+				{
+					xlrec.flags |= XLU_RESET_CONTAINS_TPD_SLOT;
+					RegisterTPDBuffer(page, 1);
+				}
+
+				recptr = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_RESET_SLOT);
+
+				PageSetLSN(page, recptr);
+				if (xlrec.flags & XLU_RESET_CONTAINS_TPD_SLOT)
+					TPDPageSetLSN(page, recptr);
+			}
+
+			END_CRIT_SECTION();
+		}
+
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		UnlockReleaseTPDBuffers();
+	}
+}
+
+/*
+ * undo_action_insert - perform the undo action for insert
+ *
+ *	This will mark the tuple as dead so that the future access to it can't see
+ *	this tuple.  We mark it as unused if there is no other index pointing to
+ *	it, otherwise mark it as dead.
+ */
+static inline void
+undo_action_insert(Relation rel, Page page, OffsetNumber off,
+				   TransactionId xid)
+{
+	ItemId		lp;
+	bool		relhasindex;
+
+	/*
+	 * This will mark the tuple as dead so that the future
+	 * access to it can't see this tuple.  We mark it as
+	 * unused if there is no other index pointing to it,
+	 * otherwise mark it as dead.
+	*/
+	relhasindex = RelationGetForm(rel)->relhasindex;
+	lp = PageGetItemId(page, off);
+	Assert(ItemIdIsNormal(lp));
+	if (relhasindex)
+	{
+		ItemIdSetDead(lp);
+	}
+	else
+	{
+		ItemIdSetUnused(lp);
+		/* Set hint bit for ZPageAddItem */
+		PageSetHasFreeLinePointers(page);
+	}
+
+	ZPageSetPrunable(page, xid);
+}
+
+/*
+ * execute_undo_actions_page - Execute the undo actions for a page
+ *
+ *	After applying all the undo actions for a page, we clear the transaction
+ *	slot on a page if the undo chain for block is complete, otherwise just
+ *	rewind the undo pointer to the last record for that block that precedes
+ *	the last undo record for which action is replayed.
+ *
+ *	luinfo - list of undo records (along with their location) for which undo
+ *			 action needs to be replayed.
+ *	urec_ptr - undo record pointer to which we need to rewind.
+ *	reloid	- OID of relation on which undo actions needs to be applied.
+ *	blkno	- block number on which undo actions needs to be applied.
+ *	blk_chain_complete - indicates whether the undo chain for block is
+ *						 complete.
+ *	nopartial - true if rollback is for complete transaction. If we are not
+ *				rolling back the complete transaction then we need to apply the
+ *				undo action for UNDO_INVALID_XACT_SLOT also because in such
+ *				case we will rewind the insert undo location.
+ *	rellock	  -	if the caller already has the lock on the required relation,
+ *				then this flag is false, i.e. we do not need to acquire any
+ *				lock here. If the flag is true then we need to acquire lock
+ *				here itself, because caller will not be having any lock.
+ *				When we are performing undo actions for prepared transactions,
+ *				or for rollback to savepoint, we need not to lock as we already
+ *				have the lock on the table. In cases like error or when
+ *				rollbacking from the undo worker we need to have proper locks.
+ *
+ *	returns true, if successfully applied the undo actions, otherwise, false.
+ */
+static bool
+execute_undo_actions_page(List *luinfo, UndoRecPtr urec_ptr, Oid reloid,
+						  TransactionId xid, BlockNumber blkno,
+						  bool blk_chain_complete, bool rellock)
+{
+	ListCell   *l_iter;
+	Relation	rel;
+	Buffer		buffer;
+	Page		page;
+	UndoRecPtr	slot_urec_ptr;
+	uint32		epoch;
+	int			slot_no = 0;
+	int			tpd_map_size = 0;
+	char	   *tpd_offset_map = NULL;
+	UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo);
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		need_init = false;
+	bool		tpd_page_locked = false;
+	bool		is_tpd_map_updated = false;
+
+	/*
+	 * FIXME: If reloid is not valid then we have nothing to do. In future,
+	 * we might want to do it differently for transactions that perform both
+	 * DDL and DML operations.
+	 */
+	if (!OidIsValid(reloid))
+	{
+		elog(LOG, "ignoring undo for invalid reloid");
+		return false;
+	}
+
+	if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(reloid)))
+		return false;
+
+	/*
+	 * If the action is executed by backend as a result of rollback, we must
+	 * already have an appropriate lock on relation.
+	 */
+	if (rellock)
+		rel = heap_open(reloid, RowExclusiveLock);
+	else
+		rel = heap_open(reloid, NoLock);
+
+	if (RelationGetNumberOfBlocks(rel) <= blkno)
+	{
+		/*
+		 * This is possible if the underlying relation is truncated just before
+		 * taking the relation lock above.
+		 */
+		heap_close(rel, NoLock);
+		return false;
+	}
+
+	buffer = ReadBuffer(rel, blkno);
+
+	/*
+	 * If there is a undo action of type UNDO_ITEMID_UNUSED then might need
+	 * to clear visibility_map. Since we cannot call visibilitymap_pin or
+	 * visibilitymap_status within a critical section it shall be called
+	 * here and let it be before taking the buffer lock on page.
+	 */
+	foreach(l_iter, luinfo)
+	{
+		UndoRecInfo *urec_info = (UndoRecInfo *) lfirst(l_iter);
+		UnpackedUndoRecord *uur = urec_info->uur;
+
+		if (uur->uur_type == UNDO_ITEMID_UNUSED)
+		{
+			visibilitymap_pin(rel, blkno, &vmbuffer);
+			break;
+		}
+	}
+
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	page = BufferGetPage(buffer);
+
+	/*
+	 * Identify the slot number for this transaction.  As we never allow undo
+	 * more than 2-billion transactions, we can compute epoch from xid.
+	 *
+	 * Here, we will always take a lock on the tpd_page, if there is a tpd
+	 * slot on the page.  This is required because sometimes we only come to
+	 * know that we need to update the tpd page after applying the undo record.
+	 * Now, the case where this can happen is when during DO operation the
+	 * slot of previous updater is a non-TPD slot, but by the time we came for
+	 * rollback it became a TPD slot which means this information won't be even
+	 * recorded in undo.
+	 */
+	epoch = GetEpochForXid(xid);
+	slot_no = PageGetTransactionSlotId(rel, buffer, epoch, xid,
+									   &slot_urec_ptr, true, true,
+									   &tpd_page_locked);
+
+	/*
+	 * If undo action has been already applied for this page then skip
+	 * the process altogether.  If we didn't find a slot corresponding to
+	 * xid, we consider the transaction is already rolledback.
+	 *
+	 * The logno of slot's undo record pointer must be same as the logno
+	 * of undo record to be applied.
+	 */
+	if (slot_no == InvalidXactSlotId ||
+	   (UndoRecPtrGetLogNo(slot_urec_ptr) != UndoRecPtrGetLogNo(urec_info->urp)) ||
+	   (UndoRecPtrGetLogNo(slot_urec_ptr) == UndoRecPtrGetLogNo(urec_ptr) &&
+		slot_urec_ptr <= urec_ptr))
+	{
+		UnlockReleaseBuffer(buffer);
+		heap_close(rel, NoLock);
+
+		UnlockReleaseTPDBuffers();
+
+		return false;
+	}
+
+	/*
+	 * We might need to update the TPD offset map while applying undo actions,
+	 * so get the size of the TPD offset map and allocate the memory to fetch
+	 * that outside the critical section.  It is quite possible that the TPD
+	 * entry is already pruned by this time, in which case, we will mark the
+	 * slot as frozen.
+	 *
+	 * XXX It would have been better if we fetch the tpd map only when
+	 * required, but that won't be possible in all cases.  Sometimes
+	 * we will come to know only during processing particular undo record.
+	 * Now, we can process the undo records partially outside critical section
+	 * such that we know whether we need TPD map or not, but that seems to
+	 * be overkill.
+	 */
+	if (tpd_page_locked)
+	{
+		tpd_map_size = TPDPageGetOffsetMapSize(buffer);
+		if (tpd_map_size > 0)
+			tpd_offset_map = palloc(tpd_map_size);
+	}
+
+	START_CRIT_SECTION();
+
+	foreach(l_iter, luinfo)
+	{
+		UndoRecInfo *urec_info = (UndoRecInfo *) lfirst(l_iter);
+		UnpackedUndoRecord *uur = urec_info->uur;
+
+		/* Skip already applied undo. */
+		if (slot_urec_ptr < urec_info->urp)
+			continue;
+
+		switch (uur->uur_type)
+		{
+			case UNDO_INSERT:
+				{
+					int			i,
+								nline;
+					ItemId		lp;
+
+					undo_action_insert(rel, page, uur->uur_offset, xid);
+
+					nline = PageGetMaxOffsetNumber(page);
+					need_init = true;
+					for (i = FirstOffsetNumber; i <= nline; i++)
+					{
+						lp = PageGetItemId(page, i);
+						if (ItemIdIsUsed(lp) || ItemIdHasPendingXact(lp))
+						{
+							need_init = false;
+							break;
+						}
+					}
+				}
+				break;
+			case UNDO_MULTI_INSERT:
+				{
+					OffsetNumber	start_offset;
+					OffsetNumber	end_offset;
+					OffsetNumber	iter_offset;
+					int				i,
+									nline;
+					ItemId			lp;
+
+					start_offset = ((OffsetNumber *) uur->uur_payload.data)[0];
+					end_offset = ((OffsetNumber *) uur->uur_payload.data)[1];
+
+					for (iter_offset = start_offset;
+						 iter_offset <= end_offset;
+						 iter_offset++)
+					{
+						undo_action_insert(rel, page, iter_offset, xid);
+					}
+
+					nline = PageGetMaxOffsetNumber(page);
+					need_init = true;
+					for (i = FirstOffsetNumber; i <= nline; i++)
+					{
+						lp = PageGetItemId(page, i);
+						if (ItemIdIsUsed(lp) || ItemIdHasPendingXact(lp))
+						{
+							need_init = false;
+							break;
+						}
+					}
+				}
+				break;
+			case UNDO_DELETE:
+			case UNDO_UPDATE:
+			case UNDO_INPLACE_UPDATE:
+				{
+					ItemId		lp;
+					ZHeapTupleHeader zhtup;
+					TransactionId	slot_xid;
+					Size		offset = 0;
+					uint32		undo_tup_len;
+					int			trans_slot;
+					uint16		infomask;
+					int			prev_trans_slot;
+
+					/* Copy the entire tuple from undo. */
+					lp = PageGetItemId(page, uur->uur_offset);
+					Assert(ItemIdIsNormal(lp));
+					zhtup = (ZHeapTupleHeader) PageGetItem(page, lp);
+					infomask = zhtup->t_infomask;
+					trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup);
+
+					undo_tup_len = *((uint32 *) &uur->uur_tuple.data[offset]);
+					ItemIdChangeLen(lp, undo_tup_len);
+					/* skip ctid and tableoid stored in undo tuple */
+					offset += sizeof(uint32) + sizeof(ItemPointerData) +
+						sizeof(Oid);
+					memcpy(zhtup,
+						   (ZHeapTupleHeader) &uur->uur_tuple.data[offset],
+						   undo_tup_len);
+
+					/*
+					 * Fetch previous transaction slot on tuple formed from
+					 * undo record.
+					 */
+					prev_trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup);
+
+					/*
+					 * If the previous version of the tuple points to a TPD
+					 * slot then we need to update the slot in the offset map
+					 * of the TPD entry.  But, only if we still have a valid
+					 * TPD entry for the page otherwise the old tuple version
+					 * must be all visible and we can mark the slot as frozen.
+					 */
+					if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT &&
+						tpd_offset_map)
+					{
+						TransactionId	prev_slot_xid;
+
+						/* Fetch TPD slot from the undo. */
+						if (uur->uur_type == UNDO_UPDATE)
+							prev_trans_slot = *(int *) ((char *) uur->uur_payload.data +
+												sizeof(ItemPointerData));
+						else
+							prev_trans_slot = *(int *) uur->uur_payload.data;
+
+						/*
+						 * If the previous transaction slot points to a TPD
+						 * slot then we need to update the slot in the offset
+						 * map of the TPD entry.
+						 *
+						 * This is the case where during DO operation the
+						 * previous updater belongs to a non-TPD slot whereas
+						 * now the same slot has become a TPD slot.  In such
+						 * cases, we need to update offset-map.
+						 */
+						GetTransactionSlotInfo(buffer,
+											   InvalidOffsetNumber,
+											   prev_trans_slot,
+											   NULL,
+											   &prev_slot_xid,
+											   NULL,
+											   false,
+											   true);
+
+						TPDPageSetOffsetMapSlot(buffer, prev_trans_slot,
+												uur->uur_offset);
+
+						/* Here, we updated TPD offset map, so need to log. */
+						if (!is_tpd_map_updated)
+							is_tpd_map_updated = true;
+
+						/*
+						 * If transaction slot to which tuple point is not
+						 * same as the previous transaction slot, so that we
+						 * need to mark the tuple with a special flag.
+						 */
+						if (uur->uur_prevxid != prev_slot_xid)
+							zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+					}
+					else if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+					{
+						ZHeapTupleHeaderSetXactSlot(zhtup, ZHTUP_SLOT_FROZEN);
+					}
+					else if (prev_trans_slot == ZHEAP_PAGE_TRANS_SLOTS &&
+							 ZHeapPageHasTPDSlot((PageHeader) page))
+					{
+						TransactionId	prev_slot_xid;
+
+						/* TPD page must be locked by now. */
+						Assert(tpd_page_locked);
+
+						/*
+						 * If the previous transaction slot points to a TPD
+						 * slot then we need to update the slot in the offset
+						 * map of the TPD entry.
+						 *
+						 * This is the case where during DO operation the
+						 * previous updater belongs to a non-TPD slot whereas
+						 * now the same slot has become a TPD slot.  In such
+						 * cases, we need to update offset-map.
+						 */
+						GetTransactionSlotInfo(buffer,
+											   InvalidOffsetNumber,
+											   prev_trans_slot,
+											   NULL,
+											   &prev_slot_xid,
+											   NULL,
+											   false,
+											   true);
+						TPDPageSetOffsetMapSlot(buffer,
+												ZHEAP_PAGE_TRANS_SLOTS + 1,
+												uur->uur_offset);
+
+						/* Here, we updated TPD offset map, so need to log. */
+						if (!is_tpd_map_updated)
+							is_tpd_map_updated = true;
+
+						/*
+						 * If transaction slot to which tuple point is not
+						 * same as the previous transaction slot, so that we
+						 * need to mark the tuple with a special flag.
+						 */
+						if (uur->uur_prevxid != prev_slot_xid)
+							zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+					}
+					else
+					{
+						trans_slot = GetTransactionSlotInfo(buffer,
+															uur->uur_offset,
+															trans_slot,
+															NULL,
+															&slot_xid,
+															NULL,
+															false,
+															false);
+
+						if (TransactionIdEquals(uur->uur_prevxid,
+												FrozenTransactionId))
+						{
+							/*
+							 * If the previous xid is frozen, then we can
+							 * safely mark the tuple as frozen.
+							 */
+							ZHeapTupleHeaderSetXactSlot(zhtup,
+														ZHTUP_SLOT_FROZEN);
+						}
+						else if (trans_slot != ZHTUP_SLOT_FROZEN &&
+								 uur->uur_prevxid != slot_xid)
+						{
+							/*
+							 * If the transaction slot to which tuple point got
+							 * reused by this time, then we need to mark the
+							 * tuple with a special flag.  See comments atop
+							 * PageFreezeTransSlots.
+							 */
+							zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+						}
+					}
+
+					/*
+					 * We always need to retain the strongest locker
+					 * information on the the tuple (as part of infomask and
+					 * infomask2), if there are multiple lockers on a tuple.
+					 * This is because the conflict detection mechanism works
+					 * based on strongest locker.  See
+					 * zheap_update/zheap_delete.  We have allowed to override
+					 * the transaction slot information with whatever is
+					 * present in undo as we have taken care during DO
+					 * operation that it contains previous strongest locker
+					 * information.  See compute_new_xid_infomask.
+					 */
+					if (ZHeapTupleHasMultiLockers(infomask))
+					{
+						/* ZHeapTupleHeaderSetXactSlot(zhtup, trans_slot); */
+						zhtup->t_infomask |= ZHEAP_MULTI_LOCKERS;
+						zhtup->t_infomask &= ~(zhtup->t_infomask &
+											   ZHEAP_LOCK_MASK);
+						zhtup->t_infomask |= infomask & ZHEAP_LOCK_MASK;
+
+						/*
+						 * If the tuple originally has INVALID_XACT_SLOT set,
+						 * then we need to retain it as that must be the
+						 * information of strongest locker.
+						 */
+						if (ZHeapTupleHasInvalidXact(infomask))
+							zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+					}
+				}
+				break;
+			case UNDO_XID_LOCK_ONLY:
+			case UNDO_XID_LOCK_FOR_UPDATE:
+				{
+					ItemId		lp;
+					ZHeapTupleHeader zhtup, undo_tup_hdr;
+					uint16		infomask;
+
+					/* Copy the entire tuple from undo. */
+					lp = PageGetItemId(page, uur->uur_offset);
+					Assert(ItemIdIsNormal(lp));
+					zhtup = (ZHeapTupleHeader) PageGetItem(page, lp);
+					infomask = zhtup->t_infomask;
+
+					/*
+					 * Override the tuple header values with values retrieved
+					 * from undo record except when there are multiple
+					 * lockers.  In such cases, we want to retain the strongest
+					 * locker information present in infomask and infomask2.
+					 */
+					undo_tup_hdr = (ZHeapTupleHeader) uur->uur_tuple.data;
+					zhtup->t_hoff = undo_tup_hdr->t_hoff;
+
+					if (!(ZHeapTupleHasMultiLockers(infomask)))
+					{
+						int			trans_slot;
+						int			prev_trans_slot PG_USED_FOR_ASSERTS_ONLY;
+						TransactionId	slot_xid;
+
+						zhtup->t_infomask2 = undo_tup_hdr->t_infomask2;
+						zhtup->t_infomask = undo_tup_hdr->t_infomask;
+
+						/*
+						 * We need to set the previous slot for tuples that are
+						 * locked for update as such tuples changed the slot
+						 * while acquiring the lock.
+						 */
+						if (uur->uur_type == UNDO_XID_LOCK_ONLY)
+						{
+							/*
+							 * Set the slot in the tpd offset map. for detailed
+							 * comments refer undo actions of update/delete.
+							 */
+							if ((uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) &&
+								tpd_offset_map)
+							{
+								TransactionId	prev_slot_xid;
+
+								prev_trans_slot = *(int *)((char *)uur->uur_payload.data +
+												sizeof(LockTupleMode));
+								/*
+								 * If the previous transaction slot points to a TPD
+								 * slot then we need to update the slot in the offset
+								 * map of the TPD entry.
+								 *
+								 * This is the case where during DO operation the
+								 * previous updater belongs to a non-TPD slot whereas
+								 * now the same slot has become a TPD slot.  In such
+								 * cases, we need to update offset-map.
+								 */
+								GetTransactionSlotInfo(buffer,
+													  InvalidOffsetNumber,
+													  prev_trans_slot,
+													  NULL,
+													  &prev_slot_xid,
+													  NULL,
+													  false,
+													  true);
+
+								TPDPageSetOffsetMapSlot(buffer, prev_trans_slot,
+														uur->uur_offset);
+
+								/*
+								 * Here, we updated TPD offset map, so need to
+								 * log.
+								 */
+								if (!is_tpd_map_updated)
+									is_tpd_map_updated = true;
+
+								/*
+								 * If transaction slot to which tuple point is not
+								 * same as the previous transaction slot, so that we
+								 * need to mark the tuple with a special flag.
+								 */
+								if (prev_slot_xid != uur->uur_prevxid)
+									zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+							}
+							else if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+								prev_trans_slot = ZHTUP_SLOT_FROZEN;
+							else
+								prev_trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup);
+
+							trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup);
+							trans_slot = GetTransactionSlotInfo(buffer,
+																uur->uur_offset,
+																trans_slot,
+																NULL,
+																&slot_xid,
+																NULL,
+																false,
+																false);
+
+							/*
+							 * For a non multi locker case, the slot in undo (and
+							 * hence on tuple) must be either a frozen slot or the
+							 * previous slot. Generally, we always set the multi-locker
+							 * bit on the tuple whenever the tuple slot is not frozen.
+							 * But, if the tuple is inserted/modified by the same
+							 * transaction that later takes a lock on it, we keep the
+							 * transaction slot as it is.
+							 * See compute_new_xid_infomask for details.
+							 */
+							Assert(trans_slot == ZHTUP_SLOT_FROZEN ||
+								   trans_slot == prev_trans_slot);
+						}
+						else
+						{
+							/*
+							 * Fetch previous transaction slot on tuple formed from
+							 * undo record.
+							 */
+							prev_trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup);
+
+							/*
+							 * If the previous version of the tuple points to a TPD
+							 * slot then we need to update the slot in the offset map
+							 * of the TPD entry.  But, only if we still have a valid
+							 * TPD entry for the page otherwise the old tuple version
+							 * must be all visible and we can mark the slot as frozen.
+							 */
+							if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT &&
+								tpd_offset_map)
+							{
+								TransactionId	prev_slot_xid;
+
+								prev_trans_slot = *(int *)((char *)uur->uur_payload.data + sizeof(LockTupleMode));
+
+								/*
+								 * If the previous transaction slot points to a TPD
+								 * slot then we need to update the slot in the offset
+								 * map of the TPD entry.
+								 *
+								 * This is the case where during DO operation the
+								 * previous updater belongs to a non-TPD slot whereas
+								 * now the same slot has become a TPD slot.  In such
+								 * cases, we need to update offset-map.
+								 */
+								GetTransactionSlotInfo(buffer,
+													  InvalidOffsetNumber,
+													  prev_trans_slot,
+													  NULL,
+													  &prev_slot_xid,
+													  NULL,
+													  false,
+													  true);
+
+								TPDPageSetOffsetMapSlot(buffer, prev_trans_slot,
+														uur->uur_offset);
+
+								/* Here, we updated TPD offset map, so need to
+								 * log.
+								 */
+								if (!is_tpd_map_updated)
+									is_tpd_map_updated = true;
+
+								/*
+								 * If transaction slot to which tuple point is not
+								 * same as the previous transaction slot, so that we
+								 * need to mark the tuple with a special flag.
+								 */
+								if (prev_slot_xid != uur->uur_prevxid)
+									zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+							}
+							else if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+							{
+								ZHeapTupleHeaderSetXactSlot(zhtup, ZHTUP_SLOT_FROZEN);
+							}
+							else if (prev_trans_slot == ZHEAP_PAGE_TRANS_SLOTS &&
+									 ZHeapPageHasTPDSlot((PageHeader) page))
+							{
+								TransactionId	prev_slot_xid;
+
+								/* TPD page must be locked by now. */
+								Assert(tpd_page_locked);
+
+								/*
+								 * If the previous transaction slot points to a TPD
+								 * slot then we need to update the slot in the offset
+								 * map of the TPD entry.
+								 *
+								 * This is the case where during DO operation the
+								 * previous updater belongs to a non-TPD slot whereas
+								 * now the same slot has become a TPD slot.  In such
+								 * cases, we need to update offset-map.
+								 */
+								GetTransactionSlotInfo(buffer,
+													   InvalidOffsetNumber,
+													   prev_trans_slot,
+													   NULL,
+													   &prev_slot_xid,
+													   NULL,
+													   false,
+													   true);
+
+								TPDPageSetOffsetMapSlot(buffer,
+														ZHEAP_PAGE_TRANS_SLOTS + 1,
+														uur->uur_offset);
+
+								/* Here, we updated TPD offset map, so need to
+								 * log.
+								 */
+								if (!is_tpd_map_updated)
+									is_tpd_map_updated = true;
+
+								if (prev_slot_xid != uur->uur_prevxid)
+								{
+									/*
+									 * Here, transaction slot to which tuple point is not
+									 * same as the previous transaction slot, so that we
+									 * need to mark the tuple with a special flag.
+									 */
+									zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+								}
+							}
+							else
+							{
+								trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup);
+								trans_slot = GetTransactionSlotInfo(buffer,
+																	uur->uur_offset,
+																	trans_slot,
+																	NULL,
+																	&slot_xid,
+																	NULL,
+																	false,
+																	false);
+
+								if (TransactionIdEquals(uur->uur_prevxid,
+														FrozenTransactionId))
+								{
+									/*
+									 * If the previous xid is frozen, then we can
+									 * safely mark the tuple as frozen.
+									 */
+									ZHeapTupleHeaderSetXactSlot(zhtup,
+																ZHTUP_SLOT_FROZEN);
+								}
+								else if (trans_slot != ZHTUP_SLOT_FROZEN &&
+										 uur->uur_prevxid != slot_xid)
+								{
+									/*
+									 * If the transaction slot to which tuple point got
+									 * reused by this time, then we need to mark the
+									 * tuple with a special flag.  See comments atop
+									 * PageFreezeTransSlots.
+									 */
+									zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+								}
+							}
+						}
+					}
+				}
+				break;
+			case UNDO_XID_MULTI_LOCK_ONLY:
+				break;
+			case UNDO_ITEMID_UNUSED:
+				{
+					int item_count, i;
+					OffsetNumber *unused;
+
+					unused = ((OffsetNumber *) uur->uur_payload.data);
+					item_count = (uur->uur_payload.len / sizeof(OffsetNumber));
+
+					/*
+					 * We need to preserve all the unused items in zheap so
+					 * that they can't be reused till the corresponding index
+					 * entries are removed.  So, marking them dead is
+					 * a sufficient indication for the index to remove the
+					 * entry in index.
+					 */
+					for (i = 0; i < item_count; i++)
+					{
+						ItemId		itemid;
+
+						itemid = PageGetItemId(page, unused[i]);
+						ItemIdSetDead(itemid);
+					}
+
+					/* clear visibility map */
+					Assert(BufferIsValid(vmbuffer));
+					visibilitymap_clear(rel, blkno, vmbuffer,
+										VISIBILITYMAP_VALID_BITS);
+
+				}
+				break;
+			default:
+				elog(ERROR, "unsupported undo record type");
+		}
+	}
+
+	/*
+	 * If the undo chain for the block is complete then set the xid in the slot
+	 * as InvalidTransactionId.  But, rewind the slot urec_ptr to the previous
+	 * urec_ptr in the slot.  This is to make sure if any transaction reuse the
+	 * transaction slot and rollback then put back the previous transaction's
+	 * urec_ptr.
+	 */
+	if (blk_chain_complete)
+	{
+		epoch = 0;
+		xid = InvalidTransactionId;
+	}
+
+	PageSetTransactionSlotInfo(buffer, slot_no, epoch, xid, urec_ptr);
+
+	MarkBufferDirty(buffer);
+
+	/*
+	 * We are logging the complete page for undo actions, so we don't need to
+	 * record the data for individual operations.  We can optimize it by
+	 * recording the data for individual operations, but again if there are
+	 * multiple operations, then it might be better to log the complete page.
+	 * So we can have some threshold above which we always log the complete
+	 * page.
+	 */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		uint8	flags = 0;
+
+		if (slot_no > ZHEAP_PAGE_TRANS_SLOTS)
+			flags |= XLU_PAGE_CONTAINS_TPD_SLOT;
+		if (BufferIsValid(vmbuffer))
+			flags |= XLU_PAGE_CLEAR_VISIBILITY_MAP;
+		if (is_tpd_map_updated)
+		{
+			/* TPD page must be locked. */
+			Assert(tpd_page_locked);
+			/* tpd_offset_map must be non-null. */
+			Assert(tpd_offset_map);
+			flags |= XLU_CONTAINS_TPD_OFFSET_MAP;
+		}
+		if (need_init)
+			flags |= XLU_INIT_PAGE;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &flags, sizeof(uint8));
+		XLogRegisterBuffer(0, buffer, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		/* Log the TPD details, if the transaction slot belongs to TPD. */
+		if (flags & XLU_PAGE_CONTAINS_TPD_SLOT)
+		{
+			xl_undoaction_page	xlrec;
+
+			xlrec.urec_ptr = urec_ptr;
+			xlrec.xid = xid;
+			xlrec.trans_slot_id = slot_no;
+			XLogRegisterData((char *) &xlrec, SizeOfUndoActionPage);
+		}
+
+		/*
+		 * Log the TPD offset map if we have modified it.
+		 *
+		 * XXX Another option could be that we track all the offset map entries
+		 * of TPD which got modified while applying the undo and only log those
+		 * information into the WAL.
+		 */
+		if (is_tpd_map_updated)
+		{
+			/* Fetch the TPD offset map and write into the WAL record. */
+			TPDPageGetOffsetMap(buffer, tpd_offset_map, tpd_map_size);
+			XLogRegisterData((char *) tpd_offset_map, tpd_map_size);
+		}
+
+		if (flags & XLU_PAGE_CONTAINS_TPD_SLOT ||
+			flags & XLU_CONTAINS_TPD_OFFSET_MAP)
+		{
+			RegisterTPDBuffer(page, 1);
+		}
+
+		recptr = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_PAGE);
+
+		PageSetLSN(page, recptr);
+		if (flags & XLU_PAGE_CONTAINS_TPD_SLOT ||
+			flags & XLU_CONTAINS_TPD_OFFSET_MAP)
+			TPDPageSetLSN(page, recptr);
+	}
+
+	/*
+	 * During rollback, if all the itemids are marked as unused, we need
+	 * to initialize the page, so that the next insertion can see the
+	 * page as initialized.  This serves two purposes (a) On next insertion,
+	 * we can safely set the XLOG_ZHEAP_INIT_PAGE flag in WAL (OTOH, if we
+	 * don't initialize the page here and set the flag, wal consistency
+	 * checker can complain), (b) we don't accumulate the dead space in the
+	 * page.
+	 *
+	 * Note that we initialize the page after writing WAL because the TPD
+	 * routines use last slot in page to determine TPD block number.
+	 */
+	if (need_init)
+		ZheapInitPage(page, (Size) BLCKSZ);
+
+	END_CRIT_SECTION();
+
+	/* Free TPD offset map memory. */
+	if (tpd_offset_map)
+		pfree(tpd_offset_map);
+
+	/*
+	 * Release any remaining pin on visibility map page.
+	 */
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	UnlockReleaseBuffer(buffer);
+	UnlockReleaseTPDBuffers();
+
+	heap_close(rel, NoLock);
+
+	return true;
+}
+
+/*
+ * To return the size of the hash-table for rollbacks.
+ */
+int
+RollbackHTSize(void)
+{
+	return hash_estimate_size(ROLLBACK_HT_SIZE, sizeof(RollbackHashEntry));
+}
+
+/*
+ * To initialize the hash-table for rollbacks in shared memory
+ * for the given size.
+ */
+void
+InitRollbackHashTable(void)
+{
+	HASHCTL info;
+	MemSet(&info, 0, sizeof(info));
+
+	info.keysize = sizeof(UndoRecPtr);
+	info.entrysize = sizeof(RollbackHashEntry);
+	info.hash = tag_hash;
+
+	RollbackHT = ShmemInitHash("Undo actions Lookup Table",
+								ROLLBACK_HT_SIZE, ROLLBACK_HT_SIZE, &info,
+								HASH_ELEM | HASH_FUNCTION | HASH_FIXED_SIZE);
+}
+
+/*
+ * To push the rollback requests from backend to the hash-table.
+ * Return true if the request is successfully added, else false
+ * and the caller may execute undo actions itself.
+ */
+bool
+PushRollbackReq(UndoRecPtr start_urec_ptr, UndoRecPtr end_urec_ptr, Oid dbid)
+{
+	bool found = false;
+	RollbackHashEntry *rh;
+
+	/* Do not push any rollback request if working in single user-mode */
+	if (!IsUnderPostmaster)
+		return false;
+	/*
+	 * If the location upto which rollback need to be done is not provided,
+	 * then rollback the complete transaction.
+	 */
+	if (start_urec_ptr == InvalidUndoRecPtr)
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(end_urec_ptr);
+		start_urec_ptr = UndoLogGetLastXactStartPoint(logno);
+	}
+
+	Assert(UndoRecPtrIsValid(start_urec_ptr));
+
+	/* If there is no space to accomodate new request, then we can't proceed. */
+	if (RollbackHTIsFull())
+		return false;
+
+	if(!UndoRecPtrIsValid(end_urec_ptr))
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(start_urec_ptr);
+		end_urec_ptr = UndoLogGetLastXactStartPoint(logno);
+	}
+
+	LWLockAcquire(RollbackHTLock, LW_EXCLUSIVE);
+
+	rh = (RollbackHashEntry *) hash_search(RollbackHT, &start_urec_ptr,
+										   HASH_ENTER_NULL, &found);
+	if (!rh)
+	{
+		LWLockRelease(RollbackHTLock);
+		return false;
+	}
+	/* We shouldn't try to push the same rollback request again. */
+	if (!found)
+	{
+		rh->start_urec_ptr = start_urec_ptr;
+		rh->end_urec_ptr = end_urec_ptr;
+		rh->dbid = (dbid == InvalidOid) ? MyDatabaseId : dbid;
+	}
+	LWLockRelease(RollbackHTLock);
+
+	return true;
+}
+
+/*
+ * To perform the undo actions for the transactions whose rollback
+ * requests are in hash table. Sequentially, scan the hash-table
+ * and perform the undo-actions for the respective transactions.
+ * Once, the undo-actions are applied, remove the entry from the
+ * hash table.
+ */
+void
+RollbackFromHT(Oid dbid)
+{
+	UndoRecPtr start[ROLLBACK_HT_SIZE];
+	UndoRecPtr end[ROLLBACK_HT_SIZE];
+	RollbackHashEntry *rh;
+	HASH_SEQ_STATUS status;
+	int i = 0;
+
+	/* Fetch the rollback requests */
+	LWLockAcquire(RollbackHTLock, LW_SHARED);
+
+	Assert(hash_get_num_entries(RollbackHT) <= ROLLBACK_HT_SIZE);
+	hash_seq_init(&status, RollbackHT);
+	while (RollbackHT != NULL &&
+		  (rh = (RollbackHashEntry *) hash_seq_search(&status)) != NULL)
+	{
+		if (rh->dbid == dbid)
+		{
+			start[i] = rh->start_urec_ptr;
+			end[i] = rh->end_urec_ptr;
+			i++;
+		}
+	}
+
+	LWLockRelease(RollbackHTLock);
+
+	/* Execute the rollback requests */
+	while(--i >= 0)
+	{
+		Assert(UndoRecPtrIsValid(start[i]));
+		Assert(UndoRecPtrIsValid(end[i]));
+
+		StartTransactionCommand();
+		execute_undo_actions(start[i], end[i], true, false, true);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Remove the rollback request entry from the rollback hash table.
+ */
+static void
+RollbackHTRemoveEntry(UndoRecPtr start_urec_ptr)
+{
+	LWLockAcquire(RollbackHTLock, LW_EXCLUSIVE);
+
+	hash_search(RollbackHT, &start_urec_ptr, HASH_REMOVE, NULL);
+
+	LWLockRelease(RollbackHTLock);
+}
+
+/*
+ * To check if the rollback requests in the hash table are all
+ * completed or not. This is required because we don't not want to
+ * expose RollbackHT in xact.c, where it is required to ensure
+ * that we push the resuests only when there is some space in
+ * the hash-table.
+ */
+bool
+RollbackHTIsFull(void)
+{
+	bool result = false;
+
+	LWLockAcquire(RollbackHTLock, LW_SHARED);
+
+	if (hash_get_num_entries(RollbackHT) >= ROLLBACK_HT_SIZE)
+		result = true;
+
+	LWLockRelease(RollbackHTLock);
+
+	return result;
+}
+
+/*
+ * Get database list from the rollback hash table.
+ */
+List *
+RollbackHTGetDBList()
+{
+	HASH_SEQ_STATUS status;
+	RollbackHashEntry	*rh;
+	List	*dblist = NIL;
+
+	/* Fetch the rollback requests */
+	LWLockAcquire(RollbackHTLock, LW_SHARED);
+
+	hash_seq_init(&status, RollbackHT);
+	while (RollbackHT != NULL &&
+		  (rh = (RollbackHashEntry *) hash_seq_search(&status)) != NULL)
+		dblist = list_append_unique_oid(dblist, rh->dbid);
+
+	LWLockRelease(RollbackHTLock);
+
+	return dblist;
+}
+
+/*
+ *		ConditionTransactionUndoActionLock
+ *
+ * Insert a lock showing that the undo action for given transaction is in
+ * progress. This is only done for the main transaction not for the
+ * sub-transaction.
+ */
+bool
+ConditionTransactionUndoActionLock(TransactionId xid)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_TRANSACTION_UNDOACTION(tag, xid);
+
+	if (LOCKACQUIRE_NOT_AVAIL == LockAcquire(&tag, ExclusiveLock, false, true))
+		return false;
+	else
+		return true;
+}
+
+/*
+ *		TransactionUndoActionLockRelease
+ *
+ * Delete the lock showing that the undo action given transaction ID is in
+ * progress.
+ */
+void
+TransactionUndoActionLockRelease(TransactionId xid)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_TRANSACTION_UNDOACTION(tag, xid);
+
+	LockRelease(&tag, ExclusiveLock, false);
+}
diff --git a/src/backend/access/undo/undoactionxlog.c b/src/backend/access/undo/undoactionxlog.c
new file mode 100644
index 0000000000..1b8c6306ba
--- /dev/null
+++ b/src/backend/access/undo/undoactionxlog.c
@@ -0,0 +1,233 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoactionxlog.c
+ *	  WAL replay logic for undo actions.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/undo/undoactionxlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/undoaction_xlog.h"
+#include "access/visibilitymap.h"
+#include "access/xlog.h"
+#include "access/xlogutils.h"
+#include "access/zheap.h"
+
+#if 0
+static void
+undo_xlog_insert(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undo_insert *xlrec = (xl_undo_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+	ItemId		lp;
+	XLogRedoAction action;
+
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(buffer);
+
+		lp = PageGetItemId(page, xlrec->offnum);
+		if (xlrec->relhasindex)
+		{
+			ItemIdSetDead(lp);
+		}
+		else
+		{
+			ItemIdSetUnused(lp);
+			/* Set hint bit for ZPageAddItem */
+			/*PageSetHasFreeLinePointers(page);*/
+		}
+
+		PageSetLSN(BufferGetPage(buffer), lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+#endif
+
+/*
+ * replay of undo page operation
+ */
+static void
+undo_xlog_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer	buf;
+	xl_undoaction_page	*xlrec = NULL;
+	char	*offsetmap = NULL,
+			*data = NULL;
+	XLogRedoAction action;
+	uint8	*flags = (uint8 *) XLogRecGetData(record);
+
+	if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT ||
+		*flags & XLU_CONTAINS_TPD_OFFSET_MAP)
+	{
+		data = (char *) flags + sizeof(uint8);
+		if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT)
+		{
+			xlrec = (xl_undoaction_page *) data;
+			data += sizeof(xl_undoaction_page);
+		}
+		if (*flags & XLU_CONTAINS_TPD_OFFSET_MAP)
+			offsetmap = data;
+	}
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Undo page record did not contain a full-page image");
+
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		uint32	xid_epoch = 0;
+
+		/*
+		 * We need to replay the record for TPD only when this record contains
+		 * slot from TPD.
+		 */
+		Assert(*flags & XLU_PAGE_CONTAINS_TPD_SLOT ||
+			   *flags & XLU_CONTAINS_TPD_OFFSET_MAP);
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT)
+			{
+				if (TransactionIdIsValid(xlrec->xid))
+					xid_epoch = GetEpochForXid(xlrec->xid);
+				TPDPageSetTransactionSlotInfo(buf, xlrec->trans_slot_id,
+											  xid_epoch,
+											  xlrec->xid, xlrec->urec_ptr);
+			}
+
+			if (offsetmap)
+				TPDPageSetOffsetMap(buf, offsetmap);
+
+			TPDPageSetLSN(BufferGetPage(buf), lsn);
+		}
+	}
+
+	if (*flags & XLU_PAGE_CLEAR_VISIBILITY_MAP)
+	{
+		Relation	reln;
+		Buffer		vmbuffer = InvalidBuffer;
+		RelFileNode target_node;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno);
+		reln = CreateFakeRelcacheEntry(target_node);
+		visibilitymap_pin(reln, blkno, &vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+
+	/*
+	 * Reset Page only at the end if asked, page level flag
+	 * PD_PAGE_HAS_TPD_SLOT and TPD slot are needed before that TPD routines.
+	 */
+	if (*flags & XLU_INIT_PAGE)
+		ZheapInitPage(BufferGetPage(buf), (Size) BLCKSZ);
+
+	UnlockReleaseBuffer(buf);
+	UnlockReleaseTPDBuffers();
+}
+
+/*
+ * replay of undo reset slot operation
+ */
+static void
+undo_xlog_reset_xid(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undoaction_reset_slot	*xlrec = (xl_undoaction_reset_slot *) XLogRecGetData(record);
+	Buffer		buf;
+	XLogRedoAction action;
+
+	action = XLogReadBufferForRedo(record, 0, &buf);
+
+	/*
+	 * Reseting the TPD slot is handled separately so only handle the page
+	 * slot here.
+	 */
+	if (action == BLK_NEEDS_REDO &&
+		xlrec->trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS)
+	{
+		Page	page;
+		ZHeapPageOpaque	opaque;
+		int		slot_no = xlrec->trans_slot_id;
+
+		page = BufferGetPage(buf);
+		opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+		opaque->transinfo[slot_no - 1].xid_epoch = 0;
+		opaque->transinfo[slot_no - 1].xid = InvalidTransactionId;
+		opaque->transinfo[slot_no - 1].urec_ptr = xlrec->urec_ptr;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		Assert(xlrec->flags & XLU_RESET_CONTAINS_TPD_SLOT);
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			TPDPageSetTransactionSlotInfo(buf, xlrec->trans_slot_id,
+										  0, InvalidTransactionId,
+										  xlrec->urec_ptr);
+			TPDPageSetLSN(BufferGetPage(buf), lsn);
+		}
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+	UnlockReleaseTPDBuffers();
+}
+
+/*
+ * Replay of undo apply progress.
+ */
+static void
+undo_xlog_apply_progress(XLogReaderState *record)
+{
+	xl_undoapply_progress	*xlrec = (xl_undoapply_progress *) XLogRecGetData(record);
+
+	/* Update the progress in the transaction header. */
+	PrepareUpdateUndoActionProgress(xlrec->urec_ptr, xlrec->progress);
+	UndoRecordUpdateTransInfo();
+	UnlockReleaseUndoBuffers();
+}
+
+void
+undoaction_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDO_PAGE:
+			undo_xlog_page(record);
+			break;
+		case XLOG_UNDO_RESET_SLOT:
+			undo_xlog_reset_xid(record);
+			break;
+		case XLOG_UNDO_APPLY_PROGRESS:
+			undo_xlog_apply_progress(record);
+			break;
+		default:
+			elog(PANIC, "undoaction_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/undo/undodiscard.c b/src/backend/access/undo/undodiscard.c
new file mode 100644
index 0000000000..2464fb6dc2
--- /dev/null
+++ b/src/backend/access/undo/undodiscard.c
@@ -0,0 +1,469 @@
+/*-------------------------------------------------------------------------
+ *
+ * undodiscard.c
+ *	  discard undo records
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undodiscard.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undolog.h"
+#include "access/undodiscard.h"
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/proc.h"
+#include "utils/resowner.h"
+#include "postmaster/undoloop.h"
+
+static UndoRecPtr FetchLatestUndoPtrForXid(UndoRecPtr urecptr,
+										   UnpackedUndoRecord *uur_start,
+										   UndoLogControl *log);
+
+/*
+ * Discard the undo for the log
+ *
+ * Search the undo log, get the start record for each transaction until we get
+ * the transaction with xid >= xmin or an invalid xid.  Then call undolog
+ * routine to discard upto that point and update the memory structure for the
+ * log slot. We set the hibernate flag if we do not have any undo logs, this
+ * flag is passed to the undo worker wherein it determines if system is idle
+ * and it should sleep for sometime.
+ *
+ * Return the oldest xid remaining in this undo log (which should be >= xmin,
+ * since we'll discard everything older).  Return InvalidTransactionId if the
+ * undo log is empty.
+ */
+static TransactionId
+UndoDiscardOneLog(UndoLogControl *log, TransactionId xmin, bool *hibernate)
+{
+	UndoRecPtr	undo_recptr, next_insert, from_urecptr;
+	UndoRecPtr	next_urecptr = InvalidUndoRecPtr;
+	UnpackedUndoRecord	*uur = NULL;
+	bool	need_discard = false;
+	bool	log_complete = false;
+	TransactionId	undoxid = InvalidTransactionId;
+	TransactionId	xid = log->oldest_xid;
+	TransactionId	latest_discardxid = InvalidTransactionId;
+	uint32	epoch = 0;
+
+	undo_recptr = log->oldest_data;
+
+	/* There might not be any undo log and hibernation might be needed. */
+	*hibernate = true;
+
+	/* Loop until we run out of discardable transactions. */
+	do
+	{
+		bool pending_abort = false;
+
+		next_insert = UndoLogGetNextInsertPtr(log->logno, xid);
+
+		/*
+		 * If the next insert location in the undo log is same as the oldest
+		 * data for the log then there is nothing more to discard in this log
+		 * so discard upto this point.
+		 */
+		if (next_insert == undo_recptr)
+		{
+			/*
+			 * If the discard location and the insert location is same then
+			 * there is nothing to discard.
+			 */
+			if (undo_recptr == log->oldest_data)
+				break;
+			else
+				log_complete = true;
+		}
+		else
+		{
+			/* Fetch the undo record for given undo_recptr. */
+			uur = UndoFetchRecord(undo_recptr, InvalidBlockNumber,
+								  InvalidOffsetNumber, InvalidTransactionId,
+								  NULL, NULL);
+
+			Assert(uur != NULL);
+
+			if (!TransactionIdDidCommit(uur->uur_xid) &&
+				TransactionIdPrecedes(uur->uur_xid, xmin) &&
+				uur->uur_progress == 0)
+			{
+				/*
+				 * At the time of recovery, we might not have a valid next undo
+				 * record pointer and in that case we'll calculate the location
+				 * of from pointer using the last record of next insert
+				 * location.
+				 */
+				if (ConditionTransactionUndoActionLock(uur->uur_xid))
+				{
+					TransactionId xid = uur->uur_xid;
+					UndoLogControl	*log = NULL;
+					UndoLogNumber 	logno;
+
+					logno = UndoRecPtrGetLogNo(undo_recptr);
+					log = UndoLogGet(logno);
+
+					/*
+					 * If the corresponding log got rewinded to a location
+					 * prior to undo_recptr, the undo actions are already
+					 * applied.
+					 */
+					if (MakeUndoRecPtr(logno, log->meta.insert) > undo_recptr)
+					{
+						UndoRecordRelease(uur);
+
+						/* Fetch the undo record under undo action lock. */
+						uur = UndoFetchRecord(undo_recptr, InvalidBlockNumber,
+											  InvalidOffsetNumber, InvalidTransactionId,
+											  NULL, NULL);
+						/*
+						 * If the undo actions for the aborted transaction is
+						 * already applied then continue discarding the undo log
+						 * otherwise discard till current point and stop processing
+						 * this undo log.
+						 * Also, check this is indeed the transaction id we're
+						 * looking for. It is possible that after rewinding
+						 * some other transaction has inserted an undo record.
+						 */
+						if (uur->uur_xid == xid && uur->uur_progress == 0)
+						{
+							from_urecptr = FetchLatestUndoPtrForXid(undo_recptr, uur, log);
+							(void)PushRollbackReq(from_urecptr, undo_recptr, uur->uur_dbid);
+							pending_abort = true;
+						}
+					}
+
+					TransactionUndoActionLockRelease(xid);
+				}
+				else
+					pending_abort = true;
+			}
+
+			next_urecptr = uur->uur_next;
+			undoxid = uur->uur_xid;
+			xid = undoxid;
+			epoch = uur->uur_xidepoch;
+		}
+
+		/* we can discard upto this point. */
+		if (TransactionIdFollowsOrEquals(undoxid, xmin) ||
+			next_urecptr == InvalidUndoRecPtr ||
+			UndoRecPtrGetLogNo(next_urecptr) != log->logno ||
+			log_complete  || pending_abort)
+		{
+			/* Hey, I got some undo log to discard, can not hibernate now. */
+			*hibernate = false;
+
+			if (uur != NULL)
+				UndoRecordRelease(uur);
+
+			/*
+			 * If Transaction id is smaller than the xmin that means this must
+			 * be the last transaction in this undo log, so we need to get the
+			 * last insert point in this undo log and discard till that point.
+			 * Also, if the transaction has pending abort, we stop discarding
+			 * undo from the same location.
+			 */
+			if (TransactionIdPrecedes(undoxid, xmin) && !pending_abort)
+			{
+				UndoRecPtr	next_insert = InvalidUndoRecPtr;
+
+				/*
+				 * Get the last insert location for this transaction Id, if it
+				 * returns invalid pointer that means there is new transaction
+				 * has started for this undolog.  So we need to refetch the undo
+				 * and continue the process.
+				 */
+				next_insert = UndoLogGetNextInsertPtr(log->logno, undoxid);
+				if (!UndoRecPtrIsValid(next_insert))
+					continue;
+
+				undo_recptr = next_insert;
+				need_discard = true;
+				epoch = 0;
+				latest_discardxid = undoxid;
+				undoxid = InvalidTransactionId;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_EXCLUSIVE);
+
+			/*
+			 * If no more pending undo logs then set the oldest transaction to
+			 * InvalidTransactionId.
+			 */
+			if (log_complete)
+			{
+				log->oldest_xid = InvalidTransactionId;
+				log->oldest_xidepoch = 0;
+			}
+			else
+			{
+				log->oldest_xid = undoxid;
+				log->oldest_xidepoch = epoch;
+			}
+
+			log->oldest_data = undo_recptr;
+			LWLockRelease(&log->discard_lock);
+
+			if (need_discard)
+				UndoLogDiscard(undo_recptr, latest_discardxid);
+
+			break;
+		}
+
+		/*
+		 * This transaction is smaller than the xmin so lets jump to the next
+		 * transaction.
+		 */
+		undo_recptr = next_urecptr;
+		latest_discardxid = undoxid;
+
+		if(uur != NULL)
+		{
+			UndoRecordRelease(uur);
+			uur = NULL;
+		}
+
+		need_discard = true;
+	} while (true);
+
+	return undoxid;
+}
+
+/*
+ * Discard the undo for all the transaction whose xid is smaller than xmin
+ *
+ *	Check the DiscardInfo memory array for each slot (every undo log) , process
+ *	the undo log for all the slot which have xid smaller than xmin or invalid
+ *	xid. Fetch the record from the undo log transaction by transaction until we
+ *	find the xid which is not smaller than xmin.
+ */
+void
+UndoDiscard(TransactionId oldestXmin, bool *hibernate)
+{
+	TransactionId	oldestXidHavingUndo = oldestXmin;
+	uint64			epoch = GetEpochForXid(oldestXmin);
+	UndoLogControl *log = NULL;
+
+	/*
+	 * TODO: Ideally we'd arrange undo logs so that we can efficiently find
+	 * those with oldest_xid < oldestXmin, but for now we'll just scan all of
+	 * them.
+	 */
+	while ((log = UndoLogNext(log)))
+	{
+		TransactionId oldest_xid = InvalidTransactionId;
+
+		/* We can't process temporary undo logs. */
+		if (log->meta.persistence == UNDO_TEMP)
+			continue;
+
+		/*
+		 * If the first xid of the undo log is smaller than the xmin the try
+		 * to discard the undo log.
+		 */
+		if (TransactionIdPrecedes(log->oldest_xid, oldestXmin))
+		{
+			/*
+			 * If the XID in the discard entry is invalid then start scanning
+			 * from the first valid undorecord in the log.
+			 */
+			if (!TransactionIdIsValid(log->oldest_xid))
+			{
+				UndoRecPtr urp = UndoLogGetFirstValidRecord(log->logno);
+
+				if (!UndoRecPtrIsValid(urp))
+					continue;
+
+				LWLockAcquire(&log->discard_lock, LW_SHARED);
+				log->oldest_data = urp;
+				LWLockRelease(&log->discard_lock);
+			}
+
+			/* Process the undo log. */
+			oldest_xid = UndoDiscardOneLog(log, oldestXmin, hibernate);
+		}
+
+		if (TransactionIdIsValid(oldest_xid) &&
+			TransactionIdPrecedes(oldest_xid, oldestXidHavingUndo))
+		{
+			oldestXidHavingUndo = oldest_xid;
+			epoch = GetEpochForXid(oldest_xid);
+		}
+	}
+
+	/*
+	 * Update the oldestXidWithEpochHavingUndo in the shared memory.
+	 *
+	 * XXX In future if multiple worker can perform discard then we may need
+	 * to use compare and swap for updating the shared memory value.
+	 */
+	pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo,
+						MakeEpochXid(epoch, oldestXidHavingUndo));
+}
+
+/*
+ * To discard all the logs. Particularly required in single user mode.
+ * At the commit time, discard all the undo logs.
+ */
+void
+UndoLogDiscardAll()
+{
+	UndoLogControl *log = NULL;
+
+	Assert(!IsUnderPostmaster);
+
+	while ((log = UndoLogNext(log)))
+	{
+		/*
+		 * Process the undo log. No locks are required for discard,
+		 * since this called only in single-user mode. Similarly,
+		 * no transaction id is required here because WAL-logging the
+		 * xid till whom the undo is discarded will not be required
+		 * for single user mode.
+		 */
+		UndoLogDiscard(MakeUndoRecPtr(log->logno, log->meta.insert),
+					   InvalidTransactionId);
+	}
+
+}
+/*
+ * Fetch the latest urec pointer for the transaction.
+ */
+UndoRecPtr
+FetchLatestUndoPtrForXid(UndoRecPtr urecptr, UnpackedUndoRecord *uur_start,
+						 UndoLogControl *log)
+{
+	UndoRecPtr next_urecptr, from_urecptr;
+	uint16	prevlen;
+	UndoLogOffset next_insert;
+	UnpackedUndoRecord *uur;
+	bool refetch = false;
+
+	uur = uur_start;
+
+	while (true)
+	{
+		/* fetch the undo record again if required. */
+		if (refetch)
+		{
+			uur = UndoFetchRecord(urecptr, InvalidBlockNumber,
+								  InvalidOffsetNumber, InvalidTransactionId,
+								  NULL, NULL);
+			refetch = false;
+		}
+
+		next_urecptr = uur->uur_next;
+		prevlen = UndoLogGetPrevLen(log->logno);
+
+		/*
+		 * If this is the last transaction in the log then calculate the latest
+		 * urec pointer using next insert location of the undo log.  Otherwise,
+		 * calculate using next transaction's start pointer.
+		 */
+		if (uur->uur_next == InvalidUndoRecPtr)
+		{
+			/*
+			 * While fetching the next insert location if the new transaction
+			 * has already started in this log then lets re-fetch the undo
+			 * record.
+			 */
+			next_insert = UndoLogGetNextInsertPtr(log->logno, uur->uur_xid);
+			if (!UndoRecPtrIsValid(next_insert))
+			{
+				if (uur != uur_start)
+					UndoRecordRelease(uur);
+				refetch = true;
+				continue;
+			}
+
+			from_urecptr = UndoGetPrevUndoRecptr(next_insert, prevlen);
+			break;
+		}
+		else if ((UndoRecPtrGetLogNo(next_urecptr) != log->logno) &&
+				UndoLogIsDiscarded(next_urecptr))
+		{
+			/*
+			 * If next_urecptr is in different undolog and its already discarded
+			 * that means the undo actions for this transaction which are in the
+			 * next log has already been executed and we only need to execute
+			 * which are remaining in this log.
+			 */
+			next_insert = UndoLogGetNextInsertPtr(log->logno, uur->uur_xid);
+
+			Assert(UndoRecPtrIsValid(next_insert));
+			from_urecptr = UndoGetPrevUndoRecptr(next_insert, prevlen);
+			break;
+		}
+		else
+		{
+			UnpackedUndoRecord	*next_uur;
+
+			next_uur = UndoFetchRecord(next_urecptr,
+										InvalidBlockNumber,
+										InvalidOffsetNumber,
+										InvalidTransactionId,
+										NULL, NULL);
+			/*
+			 * If the next_urecptr is in the same log then calculate the
+			 * from pointer using prevlen.
+			 */
+			if (UndoRecPtrGetLogNo(next_urecptr) == log->logno)
+			{
+				from_urecptr =
+					UndoGetPrevUndoRecptr(next_urecptr, next_uur->uur_prevlen);
+				UndoRecordRelease(next_uur);
+				break;
+			}
+			else
+			{
+				/*
+				 * The transaction is overflowed to the next log, so restart
+				 * the processing from then next log.
+				 */
+				log = UndoLogGet(UndoRecPtrGetLogNo(next_urecptr));
+				if (uur != uur_start)
+					UndoRecordRelease(uur);
+				uur = next_uur;
+				continue;
+			}
+
+			UndoRecordRelease(next_uur);
+		}
+	}
+
+	if (uur != uur_start)
+		UndoRecordRelease(uur);
+
+	return from_urecptr;
+}
+
+/*
+ * Discard the undo logs for temp tables.
+ */
+void
+TempUndoDiscard(UndoLogNumber logno)
+{
+	UndoLogControl *log = UndoLogGet(logno);
+
+	/*
+	 * Discard the undo log for temp table only. Ensure that there is
+	 * something to be discarded there.
+	 */
+	Assert (log->meta.persistence == UNDO_TEMP);
+
+	/* Process the undo log. */
+	UndoLogDiscard(MakeUndoRecPtr(log->logno, log->meta.insert),
+				   InvalidTransactionId);
+}
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000000..ccb8f6b351
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1245 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+static bool update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   bool log_switched);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		xact_urp = InvalidUndoRecPtr;
+	else
+		xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info.idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info.uur.uur_next = urecptr;
+	xact_urec_info.urecptr = xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Update the progress of the undo record in the transaction header.
+ */
+void
+PrepareUpdateUndoActionProgress(UndoRecPtr urecptr, int progress)
+{
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno);
+
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info.idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info.urecptr = urecptr;
+	xact_urec_info.uur.uur_progress = progress;
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info.urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno);
+	urec_ptr = xact_urec_info.urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info.idx_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence,
+				   xl_undolog_meta *undometa)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction. XXX there is also an option that instead of adding the
+		 * information to this record we can prepare a new record which only
+		 * contain transaction informations.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr));
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * We can consider the log as switched if this is the first record of the
+	 * log and not the first record of the transaction i.e. same transaction
+	 * continued from the previous log.
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	/* Copy undometa before advancing the insert location. */
+	if (undometa)
+	{
+		undometa->meta = log->meta;
+		undometa->logno = log->logno;
+		undometa->xid = log->xid;
+	}
+
+	/*
+	 * If the insertion is for temp table then register an on commit
+	 * action for discarding the undo logs.
+	 */
+	if (upersistence == UNDO_TEMP)
+	{
+		/*
+		 * We only need to register when we are inserting in temp undo logs
+		 * for the first time after the discard.
+		 */
+		if (log->meta.insert == log->meta.discard)
+		{
+			/*
+			 * XXX Here, we are overriding the first parameter of function
+			 * which is a unsigned int with an integer argument, that should
+			 * work fine because logno will always be positive.
+			 */
+			register_on_commit_action(log->logno, ONCOMMIT_TEMP_DISCARD);
+		}
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence,
+				   xl_undolog_meta *undometa)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence, undometa);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence, xl_undolog_meta *undometa)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence, undometa);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp));
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store except when
+		 * the same transaction is continuing from the previous undo log read
+		 * detailed comment atop this file.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Link the transactions in the same log so that we can discard all
+		 * the transaction's undo log in one-shot.
+		 */
+		if (UndoRecPtrIsValid(xact_urec_info.urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	xact_urec_info.urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl *prevlog,
+				   *log;
+
+		log = UndoLogGet(logno);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr(logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
new file mode 100644
index 0000000000..71be715bda
--- /dev/null
+++ b/src/backend/access/undo/undolog.c
@@ -0,0 +1,2719 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ *	  management of undo logs
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * For the code that reads and writes blocks of data, see undofile.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "pgstat.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/dsm.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/undofile.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/varlena.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+/*
+ * Number of bits of an undo log number used to identify a bank of
+ * UndoLogControl objects.  This allows us to break up our array of
+ * UndoLogControl objects into many smaller arrays, called banks, and find our
+ * way to an UndoLogControl object in O(1) complexity in two steps.
+ */
+#define UndoLogBankBits 14
+#define UndoLogBanks (1 << UndoLogBankBits)
+
+/* Extract the undo bank number from an undo log number (upper bits). */
+#define UndoLogNoGetBankNo(logno)				\
+	((logno) >> (UndoLogNumberBits - UndoLogBankBits))
+
+/* Extract the slot within a bank from an undo log number (lower bits). */
+#define UndoLogNoGetSlotNo(logno)				\
+	((logno) & ((1 << (UndoLogNumberBits - UndoLogBankBits)) - 1))
+
+/*
+ * During recovery we maintain a mapping of transaction ID to undo logs
+ * numbers.  We do this with another two-level array, so that we use memory
+ * only for chunks of the array that overlap with the range of active xids.
+ */
+#define UndoLogXidLowBits 16
+
+/*
+ * Number of high bits.
+ */
+#define UndoLogXidHighBits \
+	(sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits)
+
+/* Extract the upper bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits)
+
+/* Extract the lower bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1))
+
+/*
+ * Main control structure for undo log management in shared memory.
+ */
+typedef struct UndoLogSharedData
+{
+	UndoLogNumber free_lists[UndoPersistenceLevels];
+	int low_bankno; /* the lowest bank */
+	int high_bankno; /* one past the highest bank */
+	UndoLogNumber low_logno; /* the lowest logno */
+	UndoLogNumber high_logno; /* one past the highest logno */
+
+	/*
+	 * Array of DSM handles pointing to the arrays of UndoLogControl objects.
+	 * We don't expect there to be many banks active at a time -- usually 1 or
+	 * 2, but we need random access by log number so we arrange them into
+	 * 'banks'.
+	 */
+	dsm_handle banks[UndoLogBanks];
+} UndoLogSharedData;
+
+/*
+ * Per-backend state for the undo log module.
+ * Backend-local pointers to undo subsystem state in shared memory.
+ */
+struct
+{
+	UndoLogSharedData *shared;
+
+	/*
+	 * The control object for the undo logs that this backend is currently
+	 * attached to at each persistence level.
+	 */
+	UndoLogControl *logs[UndoPersistenceLevels];
+
+	/* The DSM segments used to hold banks of control objects. */
+	dsm_segment *bank_segments[UndoLogBanks];
+
+	/*
+	 * The address where each bank of control objects is mapped into memory in
+	 * this backend.  We map banks into memory on demand, and (for now) they
+	 * stay mapped in until every backend that mapped them exits.
+	 */
+	UndoLogControl *banks[UndoLogBanks];
+
+	/*
+	 * The lowest log number that might currently be mapped into this backend.
+	 */
+	int				low_logno;
+
+	/*
+	 * If the undo_tablespaces GUC changes we'll remember to examine it and
+	 * attach to a new undo log using this flag.
+	 */
+	bool			need_to_choose_tablespace;
+
+	/*
+	 * During recovery, the startup process maintains a mapping of xid to undo
+	 * log number, instead of using 'log' above.  This is not used in regular
+	 * backends and can be in backend-private memory so long as recovery is
+	 * single-process.  This map references UNDO_PERMANENT logs only, since
+	 * temporary and unlogged relations don't have WAL to replay.
+	 */
+	UndoLogNumber **xid_map;
+
+	/*
+	 * The slot for the oldest xids still running.  We advance this during
+	 * checkpoints to free up chunks of the map.
+	 */
+	uint16			xid_map_oldest_chunk;
+
+	/* Current dbid.  Used during recovery. */
+	Oid				dbid;
+} MyUndoLogState;
+
+/* GUC variables */
+char	   *undo_tablespaces = NULL;
+
+static UndoLogControl *get_undo_log_by_number(UndoLogNumber logno);
+static void ensure_undo_log_number(UndoLogNumber logno);
+static void attach_undo_log(UndoPersistence level, Oid tablespace);
+static void detach_current_undo_log(UndoPersistence level, bool exhausted);
+static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end);
+static void undo_log_before_exit(int code, Datum value);
+static void forget_undo_buffers(int logno, UndoLogOffset old_discard,
+								UndoLogOffset new_discard,
+								bool drop_tail);
+static bool choose_undo_tablespace(bool force_detach, Oid *oid);
+static void undolog_xid_map_gc(void);
+static void undolog_bank_gc(void);
+
+PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs);
+
+/*
+ * Return the amount of traditional smhem required for undo log management.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)
+{
+	return sizeof(UndoLogSharedData);
+}
+
+/*
+ * Initialize the undo log subsystem.  Called in each backend.
+ */
+void
+UndoLogShmemInit(void)
+{
+	bool found;
+
+	MyUndoLogState.shared = (UndoLogSharedData *)
+		ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
+
+	if (!IsUnderPostmaster)
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		Assert(!found);
+
+		/*
+		 * We start with no undo logs.  StartUpUndoLogs() will recreate undo
+		 * logs that were known at last checkpoint.
+		 */
+		memset(shared, 0, sizeof(*shared));
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+			shared->free_lists[i] = InvalidUndoLogNumber;
+		shared->low_bankno = 0;
+		shared->high_bankno = 0;
+	}
+	else
+		Assert(found);
+}
+
+void
+UndoLogInit(void)
+{
+	before_shmem_exit(undo_log_before_exit, 0);
+}
+
+/*
+ * Figure out which directory holds an undo log based on tablespace.
+ */
+static void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+	if (tablespace == DEFAULTTABLESPACE_OID ||
+		tablespace == InvalidOid)
+		snprintf(dir, MAXPGPATH, "base/undo");
+	else
+		snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+				 tablespace, TABLESPACE_VERSION_DIRECTORY);
+
+	/* XXX Should we use UndoLogDatabaseOid (9) instead of "undo"? */
+
+	/*
+	 * XXX Should we add an extra directory between log number and segment
+	 * files?  If all undo logs are in the same directory then
+	 * fsync(directory) may create contention in the OS between unrelated
+	 * backends that as they rotate segment files.
+	 */
+}
+
+/*
+ * Compute the pathname to use for an undo log segment file.
+ */
+void
+UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path)
+{
+	char		dir[MAXPGPATH];
+
+	/* Figure out which directory holds the segment, based on tablespace. */
+	UndoLogDirectory(tablespace, dir);
+
+	/*
+	 * Build the path from log number and offset.  The pathname is the
+	 * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+	 * period inserted between the components.
+	 */
+	snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+			 segno * UndoLogSegmentSize);
+}
+
+/*
+ * Iterate through the set of currently active logs.
+ *
+ * TODO: This probably needs to be replaced.  For the use of UndoDiscard,
+ * maybe we should instead have an ordered data structure organized by
+ * oldest_xid so that undo workers only have to consume logs from one end of
+ * the queue when they have an oldest xmin.  For the use of undo_file.c we'll
+ * need something completely different anyway (watch this space).  For now we
+ * just stupidly visit all undo logs in the range [log_logno, high_logno),
+ * which is obviously not ideal.
+ */
+UndoLogControl *
+UndoLogNext(UndoLogControl *log)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	if (log == NULL)
+	{
+		UndoLogNumber low_logno;
+
+		LWLockAcquire(UndoLogLock, LW_SHARED);
+		low_logno = shared->low_logno;
+		LWLockRelease(UndoLogLock);
+
+		return get_undo_log_by_number(low_logno);
+	}
+	else
+	{
+		UndoLogNumber high_logno;
+
+		LWLockAcquire(UndoLogLock, LW_SHARED);
+		high_logno = shared->high_logno;
+		LWLockRelease(UndoLogLock);
+
+		if (log->logno + 1 == high_logno)
+			return NULL;
+
+		return get_undo_log_by_number(log->logno + 1);
+	}
+}
+
+/*
+ * Check if an undo log position has been discarded.  'point' must be an undo
+ * log pointer that was allocated at some point in the past, otherwise the
+ * result is undefined.
+ */
+bool
+UndoLogIsDiscarded(UndoRecPtr point)
+{
+	UndoLogControl *log = get_undo_log_by_number(UndoRecPtrGetLogNo(point));
+	bool	result;
+
+	/*
+	 * If we don't recognize the log number, it's either entirely discarded or
+	 * it's never been allocated (ie from the future) and our result is
+	 * undefined.
+	 */
+	if (log == NULL)
+		return true;
+
+	/*
+	 * XXX For a super cheap locked operation, it's better to use LW_EXLUSIVE
+	 * even though we don't need exclusivity, right?
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	result = UndoRecPtrGetOffset(point) < log->meta.discard;
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Store latest transaction's start undo record point in undo meta data.  It
+ * will fetched by the backend when it's reusing the undo log and preparing
+ * its first undo.
+ */
+void
+UndoLogSetLastXactStartPoint(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.last_xact_start = UndoRecPtrGetOffset(point);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Fetch the previous transaction's start undo record point.  Return Invalid
+ * undo pointer if backend is not attached to any log.
+ */
+UndoRecPtr
+UndoLogGetLastXactStartPoint(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	uint64 last_xact_start = 0;
+
+	if (unlikely(log == NULL))
+		return InvalidUndoRecPtr;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	last_xact_start = log->meta.last_xact_start;
+	LWLockRelease(&log->mutex);
+
+	if (last_xact_start == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, last_xact_start);
+}
+
+/*
+ * Store last undo record's length on undo meta so that it can be persistent
+ * across restart.
+ */
+void
+UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.prevlen = prevlen;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the last undo record's length.
+ */
+uint16
+UndoLogGetPrevLen(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	uint16	prevlen;
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	return prevlen;
+}
+
+/*
+ * Is this record is the first record for any transaction.
+ */
+bool
+IsTransactionFirstRec(TransactionId xid)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	Assert(InRecovery);
+
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	log = get_undo_log_by_number(logno);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	return log->meta.is_first_rec;
+}
+
+/*
+ * Detach from the undo log we are currently attached to, returning it to the
+ * free list if it still has space.
+ */
+static void
+detach_current_undo_log(UndoPersistence persistence, bool exhausted)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+
+	MyUndoLogState.logs[persistence] = NULL;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = InvalidPid;
+	log->xid = InvalidTransactionId;
+	if (exhausted)
+		log->meta.status = UNDO_LOG_STATUS_EXHAUSTED;
+	LWLockRelease(&log->mutex);
+
+	/* Push back onto the appropriate freelist. */
+	if (!exhausted)
+	{
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		log->next_free = shared->free_lists[persistence];
+		shared->free_lists[persistence] = log->logno;
+		LWLockRelease(UndoLogLock);
+	}
+}
+
+static void
+undo_log_before_exit(int code, Datum arg)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] != NULL)
+			detach_current_undo_log(i, false);
+	}
+}
+
+/*
+ * Create a fully allocated empty segment file on disk for the byte starting
+ * at 'end'.
+ */
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+							UndoLogOffset end)
+{
+	struct stat	stat_buffer;
+	off_t	size;
+	char	path[MAXPGPATH];
+	void   *zeroes;
+	size_t	nzeroes = 8192;
+	int		fd;
+
+	UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
+
+	/*
+	 * Create and fully allocate a new file.  If we crashed and recovered
+	 * then the file might already exist, so use flags that tolerate that.
+	 * It's also possible that it exists but is too short, in which case
+	 * we'll write the rest.  We don't really care what's in the file, we
+	 * just want to make sure that the filesystem has allocated physical
+	 * blocks for it, so that non-COW filesystems will report ENOSPC now
+	 * rather than later when the space is needed and we'll avoid creating
+	 * files with holes.
+	 */
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0 && tablespace != 0)
+	{
+		char undo_path[MAXPGPATH];
+
+		/* Try creating the undo directory for this tablespace. */
+		UndoLogDirectory(tablespace, undo_path);
+		if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+		{
+			char	   *parentdir;
+
+			if (errno != ENOENT || !InRecovery)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+
+			/*
+			 * In recovery, it's possible that the tablespace directory
+			 * doesn't exist because a later WAL record removed the whole
+			 * tablespace.  In that case we create a regular directory to
+			 * stand in for it.  This is similar to the logic in
+			 * TablespaceCreateDbspace().
+			 */
+
+			/* create two parents up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			/* create one parent up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+		}
+
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	}
+	if (fd < 0)
+		elog(ERROR, "could not create new file \"%s\": %m", path);
+	if (fstat(fd, &stat_buffer) < 0)
+		elog(ERROR, "could not stat \"%s\": %m", path);
+	size = stat_buffer.st_size;
+
+	/* A buffer full of zeroes we'll use to fill up new segment files. */
+	zeroes = palloc0(nzeroes);
+
+	while (size < UndoLogSegmentSize)
+	{
+		ssize_t written;
+
+		written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size));
+		if (written < 0)
+			elog(ERROR, "cannot initialize undo log segment file \"%s\": %m",
+				 path);
+		size += written;
+	}
+
+	/* Flush the contents of the file to disk. */
+	if (pg_fsync(fd) != 0)
+		elog(ERROR, "cannot fsync file \"%s\": %m", path);
+	CloseTransientFile(fd);
+
+	pfree(zeroes);
+
+	elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */
+}
+
+/*
+ * Create a new undo segment, when it is unexpectedly not present.
+ */
+void
+UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno)
+{
+	Assert(InRecovery);
+	allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize);
+}
+
+/*
+ * Create and zero-fill a new segment for the undo log we are currently
+ * attached to.
+ */
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+	UndoLogControl *log;
+	char		dir[MAXPGPATH];
+	size_t		end;
+
+	log = get_undo_log_by_number(logno);
+
+	Assert(log != NULL);
+	Assert(log->meta.end % UndoLogSegmentSize == 0);
+	Assert(new_end % UndoLogSegmentSize == 0);
+	Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery);
+
+	/*
+	 * Create all the segments needed to increase 'end' to the requested
+	 * size.  This is quite expensive, so we will try to avoid it completely
+	 * by renaming files into place in UndoLogDiscard instead.
+	 */
+	end = log->meta.end;
+	while (end < new_end)
+	{
+		allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/*
+	 * Flush the parent dir so that the directory metadata survives a crash
+	 * after this point.
+	 */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/*
+	 * If we're not in recovery, we need to WAL-log the creation of the new
+	 * file(s).  We do that after the above filesystem modifications, in
+	 * violation of the data-before-WAL rule as exempted by
+	 * src/backend/access/transam/README.  This means that it's possible for
+	 * us to crash having made some or all of the filesystem changes but
+	 * before WAL logging, but in that case we'll eventually try to create the
+	 * same segment(s) again which is tolerated.
+	 */
+	if (!InRecovery)
+	{
+		xl_undolog_extend xlrec;
+		XLogRecPtr	ptr;
+
+		xlrec.logno = logno;
+		xlrec.end = end;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+		XLogFlush(ptr);
+	}
+
+	/*
+	 * We didn't need to acquire the mutex to read 'end' above because only
+	 * we write to it.  But we need the mutex to update it, because the
+	 * checkpointer might read it concurrently.
+	 *
+	 * XXX It's possible for meta.end to be higher already during
+	 * recovery, because of the timing of a checkpoint; in that case we did
+	 * nothing above and we shouldn't update shmem here.  That interaction
+	 * needs more analysis.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (log->meta.end < end)
+		log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get an insertion point that is guaranteed to be backed by enough space to
+ * hold 'size' bytes of data.  To actually write into the undo log, client
+ * code should call this first and then use bufmgr routines to access buffers
+ * and provide WAL logs and redo handlers.  In other words, while this module
+ * looks after making sure the undo log has sufficient space and the undo meta
+ * data is crash safe, the *contents* of the undo log and (indirectly) the
+ * insertion point are the responsibility of client code.
+ *
+ * XXX As an optimization, we could take a third argument 'discard_last'.  If
+ * the caller knows that the last transaction it committed is all visible and
+ * has its undo pointer, it could supply that value.  Then while we hold
+ * log->mutex we could check if log->meta.discard == discard_last, and if it's
+ * in the same undo log segment as the current insert then it could cheaply
+ * update it in shmem and include the value in the existing
+ * XLOG_UNDOLOG_ATTACH WAL record.  We'd be leaving the heavier lifting of
+ * dealing with segment roll-over to undo workers, but avoiding work for undo
+ * workers by folding a super cheap common case into the next foreground xact.
+ * (Not sure how we actually avoid waking up the undo work though...)
+ *
+ * XXX Problem: if foreground processes can move the discard pointer as well
+ * as background processes (undo workers), then how is the undo worker
+ * supposed to access the undo data pointed to by the discard pointer so that
+ * it can read the xid?  We certainly don't want to hold the undo log lock
+ * while doing stuff like that, because it would interfere with read-only
+ * sessions that need to check the discard pointer.  Possible solution: we may
+ * need a way to 'pin' the discard pointer while the undo worker is
+ * considering what to do.  If we add 'discard_last' as described in the
+ * previous paragraph, that optimisation would need to be skipped if the
+ * foreground process running UndoLogAllocate sees that the discard pointer is
+ * currently pinned by a background worker.  Going to sit on this thought for
+ * a little while before writing any code... need to contemplate undo workers
+ * some more.
+ *
+ * Returns an undo log insertion point that can be converted to a buffer tag
+ * and an insertion point within a buffer page using the macros above.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+	UndoLogOffset new_insert;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	TransactionId logxid;
+
+	/*
+	 * We may need to attach to an undo log, either because this is the first
+	 * time this backend as needed to write to an undo log at all or because
+	 * the undo_tablespaces GUC was changed.  When doing that, we'll need
+	 * interlocking against tablespaces being concurrently dropped.
+	 */
+
+ retry:
+	/* See if we need to check the undo_tablespaces GUC. */
+	if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL))
+	{
+		Oid		tablespace;
+		bool	need_to_unlock;
+
+		need_to_unlock =
+			choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace,
+								   &tablespace);
+		attach_undo_log(persistence, tablespace);
+		if (need_to_unlock)
+			LWLockRelease(TablespaceCreateLock);
+		log = MyUndoLogState.logs[persistence];
+		log->meta.prevlogno = prevlogno;
+		MyUndoLogState.need_to_choose_tablespace = false;
+	}
+
+	/*
+	 * If this is the first time we've allocated undo log space in this
+	 * transaction, we'll record the xid->undo log association so that it can
+	 * be replayed correctly. Before that, we set the first record flag to
+	 * false.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.is_first_rec = false;
+	logxid = log->xid;
+
+	if (logxid != GetTopTransactionId())
+	{
+		xl_undolog_attach xlrec;
+
+		/*
+		 * While we have the lock, check if we have been forcibly detached by
+		 * DROP TABLESPACE.  That can only happen between transactions (see
+		 * DetachUndoLogsInsTablespace()) so we only have to check for it
+		 * in this branch.
+		 */
+		if (log->pid == InvalidPid)
+		{
+			LWLockRelease(&log->mutex);
+			log = NULL;
+			goto retry;
+		}
+		log->xid = GetTopTransactionId();
+		log->meta.is_first_rec = true;
+		LWLockRelease(&log->mutex);
+
+		/* Skip the attach record for unlogged and temporary tables. */
+		if (persistence == UNDO_PERMANENT)
+		{
+			xlrec.xid = GetTopTransactionId();
+			xlrec.logno = log->logno;
+			xlrec.dbid = MyDatabaseId;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH);
+		}
+	}
+	else
+	{
+		LWLockRelease(&log->mutex);
+	}
+
+	/*
+	 * 'size' is expressed in usable non-header bytes.  Figure out how far we
+	 * have to move insert to create space for 'size' usable bytes (stepping
+	 * over any intervening headers).
+	 */
+	Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+	new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize);
+
+	/*
+	 * We don't need to acquire log->mutex to read log->meta.insert and
+	 * log->meta.end, because this backend is the only one that can
+	 * modify them.
+	 */
+	if (unlikely(new_insert > log->meta.end))
+	{
+		if (new_insert > UndoLogMaxSize)
+		{
+			/* This undo log is entirely full.  Get a new one. */
+			if (logxid == GetTopTransactionId())
+			{
+				/*
+				 * If the same transaction is split over two undo logs then
+				 * store the previous log number in new log.  See detailed
+				 * comments in undorecord.c file header.
+				 */
+				prevlogno = log->logno;
+			}
+			log = NULL;
+			detach_current_undo_log(persistence, true);
+			goto retry;
+		}
+		/*
+		 * Extend the end of this undo log to cover new_insert (in other words
+		 * round up to the segment size).
+		 */
+		extend_undo_log(log->logno,
+						new_insert + UndoLogSegmentSize -
+						new_insert % UndoLogSegmentSize);
+		Assert(new_insert <= log->meta.end);
+	}
+
+	return MakeUndoRecPtr(log->logno, log->meta.insert);
+}
+
+/*
+ * In recovery, we expect the xid to map to a known log which already has
+ * enough space in it.
+ */
+UndoRecPtr
+UndoLogAllocateInRecovery(TransactionId xid, size_t size,
+						  UndoPersistence level)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	/*
+	 * The sequence of calls to UndoLogAllocateRecovery during REDO (recovery)
+	 * must match the sequence of calls to UndoLogAllocate during DO, for any
+	 * given session.  The XXX_redo code for any UNDO-generating operation
+	 * must use UndoLogAllocateRecovery rather than UndoLogAllocate, because
+	 * it must supply the extra 'xid' argument so that we can find out which
+	 * undo log number to use.  During DO, that's tracked per-backend, but
+	 * during REDO the original backends/sessions are lost and we have only
+	 * the Xids.
+	 */
+	Assert(InRecovery);
+
+	/*
+	 * Look up the undo log number for this xid.  The mapping must already
+	 * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the
+	 * first call to UndoLogAllocate for this xid after the most recent
+	 * checkpoint.
+	 */
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	if (logno == InvalidUndoLogNumber)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	/*
+	 * This log must already have been created by XLOG_UNDOLOG_CREATE records
+	 * emitted by UndoLogAllocate.
+	 */
+	log = get_undo_log_by_number(logno);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/*
+	 * This log must already have been extended to cover the requested size by
+	 * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate, or by
+	 * XLOG_UNDLOG_DISCARD records recycling segments.
+	 */
+	if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size))
+		elog(ERROR,
+			 "unexpectedly couldn't allocate %zu bytes in undo log number %d",
+			 size, logno);
+
+	/*
+	 * By this time we have allocated a undo log in transaction so after this
+	 * it will not be first undo record for the transaction.
+	 */
+	log->meta.is_first_rec = false;
+
+	return MakeUndoRecPtr(logno, log->meta.insert);
+}
+
+/*
+ * Advance the insertion pointer by 'size' usable (non-header) bytes.
+ *
+ * Caller must WAL-log this operation first, and must replay it during
+ * recovery.
+ */
+void
+UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = NULL;
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insertion_point) ;
+
+	/*
+	 * During recovery, MyUndoLogState is uninitialized. Hence, we need to work
+	 * more.
+	 */
+	log = (InRecovery) ? get_undo_log_by_number(logno)
+		: MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+	Assert(InRecovery || logno == log->logno);
+	Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Advance the discard pointer in one undo log, discarding all undo data
+ * relating to one or more whole transactions.  The passed in undo pointer is
+ * the address of the oldest data that the called would like to keep, and the
+ * affected undo log is implied by this pointer, ie
+ * UndoRecPtrGetLogNo(discard_pointer).
+ *
+ * The caller asserts that there will be no attempts to access the undo log
+ * region being discarded after this moment.  This operation will cause the
+ * relevant buffers to be dropped immediately, without writing any data out to
+ * disk.  Any attempt to read the buffers (except a partial buffer at the end
+ * of this range which will remain) may result in IO errors, because the
+ * underlying segment file may physically removed.
+ *
+ * Only one backend should call this for a given undo log concurrently, or
+ * data structures will become corrupted.  It is expected that the caller will
+ * be an undo worker; only one undo worker should be working on a given undo
+ * log at a time.
+ *
+ * XXX Special case for when we wrapped past the end of an undo log, spilling
+ * into a new one.  How do we discard that?  Essentially we'll be discarding
+ * the whole undo log, but not sure how the caller should know that or deal
+ * with it and how this code should handle it.
+ */
+void
+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	UndoLogOffset old_discard;
+	UndoLogOffset discard = UndoRecPtrGetOffset(discard_point);
+	UndoLogOffset end;
+	int		segno;
+	int		new_segno;
+	bool		need_to_flush_wal = false;
+
+	if (log == NULL)
+		elog(ERROR, "cannot advance discard pointer for unknown undo log %d",
+			 logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (discard > log->meta.insert)
+		elog(ERROR, "cannot move discard point past insert point");
+	old_discard = log->meta.discard;
+	if (discard < old_discard)
+		elog(ERROR, "cannot move discard pointer backwards");
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/*
+	 * Drop all buffers holding this undo data out of the buffer pool (except
+	 * the last one, if the new location is in the middle of it somewhere), so
+	 * that the contained data doesn't ever touch the disk.  The caller
+	 * promises that this data will not be needed again.  We have to drop the
+	 * buffers from the buffer pool before removing files, otherwise a
+	 * concurrent session might try to write the block to evict the buffer.
+	 */
+	forget_undo_buffers(logno, old_discard, discard, false);
+
+	/*
+	 * Check if we crossed a segment boundary and need to do some synchronous
+	 * filesystem operations.
+	 */
+	segno = old_discard / UndoLogSegmentSize;
+	new_segno = discard / UndoLogSegmentSize;
+	if (segno < new_segno)
+	{
+		int		recycle;
+		UndoLogOffset pointer;
+
+		/*
+		 * We always WAL-log discards, but we only need to flush the WAL if we
+		 * have performed a filesystem operation.
+		 */
+		need_to_flush_wal = true;
+
+		/*
+		 * XXX When we rename or unlink a file, it's possible that some
+		 * backend still has it open because it has recently read a page from
+		 * it.  smgr/undofile.c in any such backend will eventually close it,
+		 * because it considers that fd to belong to the file with the name
+		 * that we're unlinking or renaming and it doesn't like to keep more
+		 * than one open at a time.  No backend should ever try to read from
+		 * such a file descriptor; that is what it means when we say that the
+		 * caller of UndoLogDiscard() asserts that there will be no attempts
+		 * to access the discarded range of undo log!  In the case of a
+		 * rename, if a backend were to attempt to read undo data in the range
+		 * being discarded, it would read entirely the wrong data.
+		 *
+		 * XXX What defenses could we build against that happening due to
+		 * bugs/corruption?  One way would be for undofile.c to refuse to read
+		 * buffers from before the current discard point, but currently
+		 * undofile.c doesn't need to deal with shmem/locks.  That may be
+		 * false economy, but we really don't want reader to have to wait to
+		 * acquire the undo log lock just to read undo data while we are doing
+		 * filesystem stuff in here.
+		 */
+
+		/*
+		 * XXX Decide how many segments to recycle (= rename from tail
+		 * position to head position).
+		 *
+		 * XXX For now it's always 1 unless there is already a spare one, but
+		 * we could have an adaptive algorithm with the following goals:
+		 *
+		 * (1) handle future workload without having to create new segment
+		 * files from scratch
+		 *
+		 * (2) reduce the rate of fsyncs require for recycling by doing
+		 * several at once
+		 */
+		if (log->meta.end - log->meta.insert < UndoLogSegmentSize)
+			recycle = 1;
+		else
+			recycle = 0;
+
+		/* Rewind to the start of the segment. */
+		pointer = segno * UndoLogSegmentSize;
+
+		while (pointer < new_segno * UndoLogSegmentSize)
+		{
+			char	discard_path[MAXPGPATH];
+
+			/*
+			 * Before removing the file, make sure that undofile_sync knows
+			 * that it might be missing.
+			 */
+			undofile_forgetsync(log->logno,
+								log->meta.tablespace,
+								pointer / UndoLogSegmentSize);
+
+			UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize,
+							   log->meta.tablespace, discard_path);
+
+			/* Can we recycle the oldest segment? */
+			if (recycle > 0)
+			{
+				char	recycle_path[MAXPGPATH];
+
+				/*
+				 * End points one byte past the end of the current undo space,
+				 * ie to the first byte of the segment file we want to create.
+				 */
+				UndoLogSegmentPath(logno, end / UndoLogSegmentSize,
+								   log->meta.tablespace, recycle_path);
+				if (rename(discard_path, recycle_path) == 0)
+				{
+					elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+					end += UndoLogSegmentSize;
+					--recycle;
+				}
+				else
+				{
+					elog(ERROR, "could not rename \"%s\" to \"%s\": %m",
+						 discard_path, recycle_path);
+				}
+			}
+			else
+			{
+				if (unlink(discard_path) == 0)
+					elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+				else
+					elog(ERROR, "could not unlink \"%s\": %m", discard_path);
+			}
+			pointer += UndoLogSegmentSize;
+		}
+	}
+
+	/* WAL log the discard. */
+	{
+		xl_undolog_discard xlrec;
+		XLogRecPtr ptr;
+
+		xlrec.logno = logno;
+		xlrec.discard = discard;
+		xlrec.end = end;
+		xlrec.latestxid = xid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD);
+
+		if (need_to_flush_wal)
+			XLogFlush(ptr);
+	}
+
+	/* Update shmem to show the new discard and end pointers. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+Oid
+UndoRecPtrGetTablespace(UndoRecPtr ptr)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(ptr);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	/*
+	 * XXX What should the behaviour of this function be if you ask for the
+	 * tablespace of a discarded log, where even the shmem bank is gone?
+	 */
+
+	/*
+	 * No need to acquire log->mutex, because log->meta.tablespace is constant
+	 * for the lifetime of the log.  TODO:  will it always be?  No I'm going to change that!
+	 */
+	if (log != NULL)
+		return log->meta.tablespace;
+	else
+		return InvalidOid;
+}
+
+/*
+ * Return first valid UndoRecPtr for a given undo logno.  If logno is invalid
+ * then return InvalidUndoRecPtr.
+ */
+UndoRecPtr
+UndoLogGetFirstValidRecord(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	if (log == NULL || log->meta.discard == log->meta.insert)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, log->meta.discard);
+}
+
+/*
+ * Return the Next insert location.  This will also validate the input xid
+ * if latest insert point is not for the same transaction id then this will
+ * return Invalid Undo pointer.
+ */
+UndoRecPtr
+UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	TransactionId	logxid;
+	UndoRecPtr	insert;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert);
+}
+
+/*
+ * Rewind the undo log insert position also set the prevlen in the mata
+ */
+void
+UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen)
+{
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insert_urp);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	UndoLogOffset	insert = UndoRecPtrGetOffset(insert_urp);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = insert;
+	log->meta.prevlen = prevlen;
+
+	/*
+	 * Force the wal log on next undo allocation. So that during recovery undo
+	 * insert location is consistent with normal allocation.
+	 */
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	/* WAL log the rewind. */
+	{
+		xl_undolog_rewind xlrec;
+
+		xlrec.logno = logno;
+		xlrec.insert = insert;
+		xlrec.prevlen = prevlen;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND);
+	}
+}
+
+/*
+ * Delete unreachable files under pg_undo.  Any files corresponding to LSN
+ * positions before the previous checkpoint are no longer needed.
+ */
+static void
+CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo)
+{
+	DIR	   *dir;
+	struct dirent *de;
+	char	path[MAXPGPATH];
+	char	oldest_path[MAXPGPATH];
+
+	/*
+	 * If a base backup is in progress, we can't delete any checkpoint
+	 * snapshot files because one of them corresponds to the backup label but
+	 * there could be any number of checkpoints during the backup.
+	 */
+	if (BackupInProgress())
+		return;
+
+	/* Otherwise keep only those >= the previous checkpoint's redo point. */
+	snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	dir = AllocateDir("pg_undo");
+	while ((de = ReadDir(dir, "pg_undo")) != NULL)
+	{
+		/*
+		 * Assume that fixed width uppercase hex strings sort the same way as
+		 * the values they represent, so we can use strcmp to identify undo
+		 * log snapshot files corresponding to checkpoints that we don't need
+		 * anymore.  This assumption holds for ASCII.
+		 */
+		if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path))
+		{
+			snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name);
+			if (unlink(path) != 0)
+				elog(ERROR, "could not unlink file \"%s\": %m", path);
+		}
+	}
+	FreeDir(dir);
+}
+
+/*
+ * Write out the undo log meta data to the pg_undo directory.  The actual
+ * contents of undo logs is in shared buffers and therefore handled by
+ * CheckPointBuffers(), but here we record the table of undo logs and their
+ * properties.
+ */
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogMetaData *serialized = NULL;
+	UndoLogNumber low_logno;
+	UndoLogNumber high_logno;
+	UndoLogNumber logno;
+	size_t	serialized_size = 0;
+	char   *data;
+	char	path[MAXPGPATH];
+	int		num_logs;
+	int		fd;
+	pg_crc32c crc;
+
+	/*
+	 * Take this opportunity to check if we can free up any DSM segments and
+	 * also some entries in the checkpoint file by forgetting about entirely
+	 * discarded undo logs.  Otherwise both would eventually grow large.
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	while (shared->low_logno < shared->high_logno)
+	{
+		UndoLogControl *log;
+
+		log = get_undo_log_by_number(shared->low_logno);
+		if (log->meta.status != UNDO_LOG_STATUS_DISCARDED)
+			break;
+
+		/*
+		 * If this was the last slot in a bank, the bank is no longer needed.
+		 * The shared memory will be given back to the operating system once
+		 * every attached backend runs undolog_bank_gc().
+		 */
+		if (UndoLogNoGetSlotNo(shared->low_logno + 1) == 0)
+			shared->banks[UndoLogNoGetBankNo(shared->low_logno)] =
+				DSM_HANDLE_INVALID;
+
+		++shared->low_logno;
+	}
+	LWLockRelease(UndoLogLock);
+
+	/* Detach from any banks that we don't need if low_logno advanced. */
+	undolog_bank_gc();
+
+	/*
+	 * We acquire UndoLogLock to prevent any undo logs from being created or
+	 * discarded while we build a snapshot of them.  This isn't expected to
+	 * take long on a healthy system because the number of active logs should
+	 * be around the number of backends.  Holding this lock won't prevent
+	 * concurrent access to the undo log, except when segments need to be
+	 * added or removed.
+	 */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+
+	low_logno = shared->low_logno;
+	high_logno = shared->high_logno;
+	num_logs = high_logno - low_logno;
+
+	/*
+	 * Rather than doing the file IO while we hold the lock, we'll copy it
+	 * into a palloc'd buffer.
+	 */
+	if (num_logs > 0)
+	{
+		serialized_size = sizeof(UndoLogMetaData) * num_logs;
+		serialized = (UndoLogMetaData *) palloc0(serialized_size);
+
+		for (logno = low_logno; logno != high_logno; ++logno)
+		{
+			UndoLogControl *log;
+
+			log = get_undo_log_by_number(logno);
+			if (log == NULL) /* XXX can this happen? */
+				continue;
+
+			/* Capture snapshot while holding the mutex. */
+			LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+			log->need_attach_wal_record = true;
+			memcpy(&serialized[logno], &log->meta, sizeof(UndoLogMetaData));
+			LWLockRelease(&log->mutex);
+		}
+	}
+
+	LWLockRelease(UndoLogLock);
+
+	/* Dump into a file under pg_undo. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", path)));
+
+	/* Compute header checksum. */
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &low_logno, sizeof(low_logno));
+	COMP_CRC32C(crc, &high_logno, sizeof(high_logno));
+	FIN_CRC32C(crc);
+
+	/* Write out range of active log numbers + crc. */
+	if ((write(fd, &low_logno, sizeof(low_logno)) != sizeof(low_logno)) ||
+		(write(fd, &high_logno, sizeof(high_logno)) != sizeof(high_logno)) ||
+		(write(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+	/* Write out the meta data for all undo logs in that range. */
+	data = (char *) serialized;
+	INIT_CRC32C(crc);
+	while (serialized_size > 0)
+	{
+		ssize_t written;
+
+		written = write(fd, data, serialized_size);
+		if (written < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m", path)));
+		COMP_CRC32C(crc, data, written);
+		serialized_size -= written;
+		data += written;
+	}
+	FIN_CRC32C(crc);
+
+	if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+
+	/* Flush file and directory entry. */
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
+	pg_fsync(fd);
+	CloseTransientFile(fd);
+	fsync_fname("pg_undo", true);
+	pgstat_report_wait_end();
+
+	if (serialized)
+		pfree(serialized);
+
+	CleanUpUndoCheckPointFiles(priorCheckPointRedo);
+	undolog_xid_map_gc();
+}
+
+void
+StartupUndoLogs(XLogRecPtr checkPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char	path[MAXPGPATH];
+	int		logno;
+	int		fd;
+	pg_crc32c crc;
+	pg_crc32c new_crc;
+
+	/* If initdb is calling, there is no file to read yet. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/* Open the pg_undo file corresponding to the given checkpoint. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
+
+	/* Read the active log number range. */
+	if ((read(fd, &shared->low_logno, sizeof(shared->low_logno))
+		 != sizeof(shared->low_logno)) ||
+		(read(fd, &shared->high_logno, sizeof(shared->high_logno))
+		 != sizeof(shared->high_logno)) ||
+		(read(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+
+	/* Verify the header checksum. */
+	INIT_CRC32C(new_crc);
+	COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(new_crc, &shared->high_logno, sizeof(shared->high_logno));
+	FIN_CRC32C(new_crc);
+
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	/* Initialize all the logs and set up the freelist. */
+	INIT_CRC32C(new_crc);
+	for (logno = shared->low_logno; logno < shared->high_logno; ++logno)
+	{
+		UndoLogControl *log;
+
+		/* Get a zero-initialized control objects. */
+		ensure_undo_log_number(logno);
+		log = get_undo_log_by_number(logno);
+
+		/* Read in the meta data for this undo log. */
+		if (read(fd, &log->meta, sizeof(log->meta)) != sizeof(log->meta))
+			elog(ERROR, "corrupted pg_undo meta data in file \"%s\": %m",
+				 path);
+		COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta));
+
+		/*
+		 * At normal start-up, or during recovery, all active undo logs start
+		 * out on the appropriate free list.
+		 */
+		log->pid = InvalidPid;
+		log->xid = InvalidTransactionId;
+		if (log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+		{
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = logno;
+		}
+	}
+	FIN_CRC32C(new_crc);
+
+	/* Verify body checksum. */
+	if (read(fd, &crc, sizeof(crc)) != sizeof(crc))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	CloseTransientFile(fd);
+	pgstat_report_wait_end();
+}
+
+/*
+ * WAL-LOG undo log meta data information before inserting the first WAL after
+ * the checkpoint for any undo log.
+ */
+void
+LogUndoMetaData(xl_undolog_meta *xlrec)
+{
+	XLogRecPtr	RedoRecPtr;
+	bool		doPageWrites;
+	XLogRecPtr	recptr;
+
+prepare_xlog:
+	GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+	if (NeedUndoMetaLog(RedoRecPtr))
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) xlrec, sizeof(xl_undolog_meta));
+		recptr = XLogInsertExtended(RM_UNDOLOG_ID, XLOG_UNDOLOG_META,
+									RedoRecPtr, doPageWrites);
+		if (recptr == InvalidXLogRecPtr)
+			goto prepare_xlog;
+
+		UndoLogSetLSN(recptr);
+	}
+}
+
+/*
+ * Check whether we need to log undolog meta or not.
+ */
+bool
+NeedUndoMetaLog(XLogRecPtr redo_point)
+{
+	UndoLogControl *log = MyUndoLogState.logs[UNDO_PERMANENT];
+
+	/*
+	 * If the current session is not attached to any undo log then we don't
+	 * need to log meta.  It is quite possible that some operations skip
+	 * writing undo, so those won't be attached to any undo log.
+	 */
+	if (log == NULL)
+		return false;
+
+	Assert(AmAttachedToUndoLog(log));
+
+	if (log->lsn <= redo_point)
+		return true;
+
+	return false;
+}
+
+/*
+ * Update the WAL lsn in the undo.  This is to test whether we need to include
+ * the xid to logno mapping information in the next WAL or not.
+ */
+void
+UndoLogSetLSN(XLogRecPtr lsn)
+{
+	UndoLogControl *log = MyUndoLogState.logs[UNDO_PERMANENT];
+
+	Assert(AmAttachedToUndoLog(log));
+	log->lsn = lsn;
+}
+
+/*
+ * Get an UndoLogControl pointer for a given logno.  This may require
+ * attaching to a DSM segment if it isn't already attached in this backend.
+ * Return NULL if there is no such logno because it has been entirely
+ * discarded.
+ */
+static UndoLogControl *
+get_undo_log_by_number(UndoLogNumber logno)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	int bankno = UndoLogNoGetBankNo(logno);
+	int slotno = UndoLogNoGetSlotNo(logno);
+
+	/* See if we need to attach to the bank that holds logno. */
+	if (unlikely(MyUndoLogState.banks[bankno] == NULL))
+	{
+		dsm_segment *segment;
+
+		if (shared->banks[bankno] != DSM_HANDLE_INVALID)
+		{
+			segment = dsm_attach(shared->banks[bankno]);
+			if (segment != NULL)
+			{
+				MyUndoLogState.bank_segments[bankno] = segment;
+				MyUndoLogState.banks[bankno] = dsm_segment_address(segment);
+				dsm_pin_mapping(segment);
+			}
+		}
+
+		if (unlikely(MyUndoLogState.banks[bankno] == NULL))
+			return NULL;
+	}
+
+	return &MyUndoLogState.banks[bankno][slotno];
+}
+
+UndoLogControl *
+UndoLogGet(UndoLogNumber logno)
+{
+	/* TODO just rename the above function */
+	return get_undo_log_by_number(logno);
+}
+
+/*
+ * We write the undo log number into each UndoLogControl object.
+ */
+static void
+initialize_undo_log_bank(int bankno, UndoLogControl *bank)
+{
+	int		i;
+	int		logs_per_bank = 1 << (UndoLogNumberBits - UndoLogBankBits);
+
+	for (i = 0; i < logs_per_bank; ++i)
+	{
+		bank[i].logno = logs_per_bank * bankno + i;
+		LWLockInitialize(&bank[i].mutex, LWTRANCHE_UNDOLOG);
+		LWLockInitialize(&bank[i].discard_lock, LWTRANCHE_UNDODISCARD);
+		LWLockInitialize(&bank[i].rewind_lock, LWTRANCHE_REWIND);
+	}
+}
+
+/*
+ * Create shared memory space for a given undo log number, if it doesn't exist
+ * already.
+ */
+static void
+ensure_undo_log_number(UndoLogNumber logno)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	int		bankno = UndoLogNoGetBankNo(logno);
+
+	/* In single-user mode, we have to use backend-private memory. */
+	if (!IsUnderPostmaster)
+	{
+			if (MyUndoLogState.banks[bankno] == NULL)
+			{
+				size_t size;
+
+				size = sizeof(UndoLogControl) * (1 << UndoLogBankBits);
+				MyUndoLogState.banks[bankno] =
+					MemoryContextAllocZero(TopMemoryContext, size);
+				initialize_undo_log_bank(bankno, MyUndoLogState.banks[bankno]);
+			}
+			return;
+	}
+
+	/* Do we need to create a bank in shared memory for this undo log number? */
+	if (shared->banks[bankno] == DSM_HANDLE_INVALID)
+	{
+		dsm_segment *segment;
+		size_t size;
+
+		size = sizeof(UndoLogControl) * (1 << UndoLogBankBits);
+		segment = dsm_create(size, 0);
+		dsm_pin_mapping(segment);
+		dsm_pin_segment(segment);
+		memset(dsm_segment_address(segment), 0, size);
+		shared->banks[bankno] = dsm_segment_handle(segment);
+		MyUndoLogState.banks[bankno] = dsm_segment_address(segment);
+		initialize_undo_log_bank(bankno, MyUndoLogState.banks[bankno]);
+	}
+}
+
+/*
+ * Attach to an undo log, possibly creating or recycling one.
+ */
+static void
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = NULL;
+	UndoLogNumber logno;
+	UndoLogNumber *place;
+
+	Assert(!InRecovery);
+	Assert(MyUndoLogState.logs[persistence] == NULL);
+
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/*
+	 * For now we have a simple linked list of unattached undo logs for each
+	 * persistence level.  We'll grovel though it to find something for the
+	 * tablespace you asked for.  If you're not using multiple tablespaces
+	 * it'll be able to pop one off the front.  We might need a hash table
+	 * keyed by tablespace if this simple scheme turns out to be too slow when
+	 * using many tablespaces and many undo logs, but that seems like an
+	 * unusual use case not worth optimizing for.
+	 */
+	place = &shared->free_lists[persistence];
+	while (*place != InvalidUndoLogNumber)
+	{
+		UndoLogControl *candidate = get_undo_log_by_number(*place);
+
+		if (candidate == NULL)
+			elog(ERROR, "corrupted undo log freelist");
+		if (candidate->meta.tablespace == tablespace)
+		{
+			logno = *place;
+			log = candidate;
+			*place = candidate->next_free;
+			break;
+		}
+		place = &candidate->next_free;
+	}
+
+	/*
+	 * All existing undo logs for this tablespace and persistence level are
+	 * busy, so we'll have to create a new one.
+	 */
+	if (log == NULL)
+	{
+		if (shared->high_logno > (1 << UndoLogNumberBits))
+		{
+			/*
+			 * You've used up all 16 exabytes of undo log addressing space.
+			 * This is a difficult state to reach using only 16 exabytes of
+			 * WAL.
+			 */
+			elog(ERROR, "cannot create new undo log");
+		}
+
+		logno = shared->high_logno;
+		ensure_undo_log_number(logno);
+
+		/* Get new zero-filled UndoLogControl object. */
+		log = get_undo_log_by_number(logno);
+
+		Assert(log->meta.persistence == 0);
+		Assert(log->meta.tablespace == InvalidOid);
+		Assert(log->meta.discard == 0);
+		Assert(log->meta.insert == 0);
+		Assert(log->meta.end == 0);
+		Assert(log->pid == 0);
+		Assert(log->xid == 0);
+
+		/*
+		 * The insert and discard pointers start after the first block's
+		 * header.  XXX That means that insert is > end for a short time in a
+		 * newly created undo log.  Is there any problem with that?
+		 */
+		log->meta.insert = UndoLogBlockHeaderSize;
+		log->meta.discard = UndoLogBlockHeaderSize;
+
+		log->meta.tablespace = tablespace;
+		log->meta.persistence = persistence;
+		log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+
+		/* Move the high log number pointer past this one. */
+		++shared->high_logno;
+
+		/* WAL-log the creation of this new undo log. */
+		{
+			xl_undolog_create xlrec;
+
+			xlrec.logno = logno;
+			xlrec.tablespace = log->meta.tablespace;
+			xlrec.persistence = log->meta.persistence;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+		}
+
+		/*
+		 * This undo log has no segments.  UndoLogAllocate will create the
+		 * first one on demand.
+		 */
+	}
+	LWLockRelease(UndoLogLock);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = MyProcPid;
+	log->xid = InvalidTransactionId;
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[persistence] = log;
+}
+
+/*
+ * Free chunks of the xid/undo log map that relate to transactions that are no
+ * longer running.  This is run at each checkpoint.
+ */
+static void
+undolog_xid_map_gc(void)
+{
+	UndoLogNumber **xid_map = MyUndoLogState.xid_map;
+	TransactionId oldest_xid;
+	uint16 new_oldest_chunk;
+	uint16 oldest_chunk;
+
+	if (xid_map == NULL)
+		return;
+
+	/*
+	 * During crash recovery, it may not be possible to call GetOldestXmin()
+	 * yet because latestCompletedXid is invalid.
+	 */
+	if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid))
+		return;
+
+	oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	new_oldest_chunk = UndoLogGetXidHigh(oldest_xid);
+	oldest_chunk = MyUndoLogState.xid_map_oldest_chunk;
+
+	while (oldest_chunk != new_oldest_chunk)
+	{
+		if (xid_map[oldest_chunk])
+		{
+			pfree(xid_map[oldest_chunk]);
+			xid_map[oldest_chunk] = NULL;
+		}
+		oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits);
+	}
+	MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk;
+}
+
+/*
+ * Detach from shared memory banks that are no longer needed because they hold
+ * undo logs that are entirely discarded.  This should ideally be called
+ * periodically in any backend that accesses undo data, so that they have a
+ * chance to detach from DSM segments that hold banks of entirely discarded
+ * undo log control objects.
+ */
+static void
+undolog_bank_gc(void)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogNumber low_logno = shared->low_logno;
+
+	if (unlikely(MyUndoLogState.low_logno < low_logno))
+	{
+		int low_bank = UndoLogNoGetBankNo(low_logno);
+		int bank = UndoLogNoGetBankNo(MyUndoLogState.low_logno);
+
+		while (bank < low_bank)
+		{
+			Assert(shared->banks[bank] == DSM_HANDLE_INVALID);
+			if (MyUndoLogState.banks[bank] != NULL)
+			{
+				dsm_detach(MyUndoLogState.bank_segments[bank]);
+				MyUndoLogState.bank_segments[bank] = NULL;
+				MyUndoLogState.banks[bank] = NULL;
+			}
+			++bank;
+		}
+	}
+
+	MyUndoLogState.low_logno = low_logno;
+}
+
+/*
+ * Associate a xid with an undo log, during recovery.  In a primary server,
+ * this isn't necessary because backends know which undo log they're attached
+ * to.  During recovery, the natural association between backends and xids is
+ * lost, so we need to manage that explicitly.
+ */
+static void
+undolog_xid_map_add(TransactionId xid, UndoLogNumber logno)
+{
+	uint16		high_bits;
+	uint16		low_bits;
+
+	high_bits = UndoLogGetXidHigh(xid);
+	low_bits = UndoLogGetXidLow(xid);
+
+	if (unlikely(MyUndoLogState.xid_map == NULL))
+	{
+		/* First time through.  Create mapping array. */
+		MyUndoLogState.xid_map =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber *) *
+								   (1 << (32 - UndoLogXidLowBits)));
+		MyUndoLogState.xid_map_oldest_chunk = high_bits;
+	}
+
+	if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL))
+	{
+		/* This bank of mappings doesn't exist yet.  Create it. */
+		MyUndoLogState.xid_map[high_bits] =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber) *
+								   (1 << UndoLogXidLowBits));
+	}
+
+	/* Associate this xid with this undo log number. */
+	MyUndoLogState.xid_map[high_bits][low_bits] = logno;
+}
+
+/* check_hook: validate new undo_tablespaces */
+bool
+check_undo_tablespaces(char **newval, void **extra, GucSource source)
+{
+	char	   *rawname;
+	List	   *namelist;
+
+	/* Need a modifiable copy of string */
+	rawname = pstrdup(*newval);
+
+	/*
+	 * Parse string into list of identifiers, just to check for
+	 * well-formedness (unfortunateley we can't validate the names in the
+	 * catalog yet).
+	 */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+	{
+		/* syntax error in name list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawname);
+		list_free(namelist);
+		return false;
+	}
+
+	/*
+	 * Make sure we aren't already in a transaction that has been assigned an
+	 * XID.  This ensures we don't detach from an undo log that we might have
+	 * started writing undo data into for this transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));
+	list_free(namelist);
+
+	return true;
+}
+
+/* assign_hook: do extra actions as needed */
+void
+assign_undo_tablespaces(const char *newval, void *extra)
+{
+	/*
+	 * This is normally called only when GetTopTransactionIdIfAny() ==
+	 * InvalidTransactionId (because you can't change undo_tablespaces in the
+	 * middle of a transaction that's been asigned an xid), but we can't
+	 * assert that because it's also called at the end of a transaction that's
+	 * rolling back, to reset the GUC if it was set inside the transaction.
+	 */
+
+	/* Tell UndoLogAllocate() to reexamine undo_tablespaces. */
+	MyUndoLogState.need_to_choose_tablespace = true;
+}
+
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+	char   *rawname;
+	List   *namelist;
+	bool	need_to_unlock;
+	int		length;
+	int		i;
+
+	/* We need a modifiable copy of string. */
+	rawname = pstrdup(undo_tablespaces);
+
+	/* Break string into list of identifiers. */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+		elog(ERROR, "undo_tablespaces is unexpectedly malformed");
+
+	length = list_length(namelist);
+	if (length == 0 ||
+		(length == 1 && ((char *) linitial(namelist))[0] == '\0'))
+	{
+		/*
+		 * If it's an empty string, then we'll use the default tablespace.  No
+		 * locking is required because it can't be dropped.
+		 */
+		*tablespace = DEFAULTTABLESPACE_OID;
+		need_to_unlock = false;
+	}
+	else
+	{
+		/*
+		 * Choose an OID using our pid, so that if several backends have the
+		 * same multi-tablespace setting they'll spread out.  We could easily
+		 * do better than this if more serious load balancing is judged
+		 * useful.
+		 */
+		int		index = MyProcPid % length;
+		int		first_index = index;
+		Oid		oid = InvalidOid;
+
+		/*
+		 * Take the tablespace create/drop lock while we look the name up.
+		 * This prevents the tablespace from being dropped while we're trying
+		 * to resolve the name, or while the called is trying to create an
+		 * undo log in it.  The caller will have to release this lock.
+		 */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		for (;;)
+		{
+			const char *name = list_nth(namelist, index);
+
+			oid = get_tablespace_oid(name, true);
+			if (oid == InvalidOid)
+			{
+				/* Unknown tablespace, try the next one. */
+				index = (index + 1) % length;
+				/*
+				 * But if we've tried them all, it's time to complain.  We'll
+				 * arbitrarily complain about the last one we tried in the
+				 * error message.
+				 */
+				if (index == first_index)
+					ereport(ERROR,
+							(errcode(ERRCODE_UNDEFINED_OBJECT),
+							 errmsg("tablespace \"%s\" does not exist", name),
+							 errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
+				continue;
+			}
+			if (oid == GLOBALTABLESPACE_OID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("undo logs cannot be placed in pg_global tablespace")));
+			/* If we got here we succeeded in finding one. */
+			break;
+		}
+
+		Assert(oid != InvalidOid);
+		*tablespace = oid;
+		need_to_unlock = true;
+	}
+
+	/*
+	 * If we came here because the user changed undo_tablesaces, then detach
+	 * from any undo logs we happen to be attached to.
+	 */
+	if (force_detach)
+	{
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+		{
+			UndoLogControl *log = MyUndoLogState.logs[i];
+			UndoLogSharedData *shared = MyUndoLogState.shared;
+
+			if (log != NULL)
+			{
+				LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+				log->pid = InvalidPid;
+				log->xid = InvalidTransactionId;
+				LWLockRelease(&log->mutex);
+
+				LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+				log->next_free = shared->free_lists[i];
+				shared->free_lists[i] = log->logno;
+				LWLockRelease(UndoLogLock);
+
+				MyUndoLogState.logs[i] = NULL;
+			}
+		}
+	}
+
+	return need_to_unlock;
+}
+
+bool
+DropUndoLogsInTablespace(Oid tablespace)
+{
+	DIR *dir;
+	char undo_path[MAXPGPATH];
+	UndoLogNumber low_logno;
+	UndoLogNumber high_logno;
+	UndoLogNumber logno;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	int		i;
+
+	Assert(LWLockHeldByMe(TablespaceCreateLock));
+	Assert(tablespace != DEFAULTTABLESPACE_OID);
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	low_logno = shared->low_logno;
+	high_logno = shared->high_logno;
+	LWLockRelease(UndoLogLock);
+
+	/* First, try to kick everyone off any undo logs in this tablespace. */
+	for (logno = low_logno; logno < high_logno; ++logno)
+	{
+		UndoLogControl *log = get_undo_log_by_number(logno);
+		bool ok;
+		bool return_to_freelist = false;
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/* Check if this undo log can be forcibly detached. */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		if (log->meta.discard == log->meta.insert &&
+			(log->xid == InvalidTransactionId ||
+			 !TransactionIdIsInProgress(log->xid)))
+		{
+			log->xid = InvalidTransactionId;
+			if (log->pid != InvalidPid)
+			{
+				log->pid = InvalidPid;
+				return_to_freelist = true;
+			}
+			ok = true;
+		}
+		else
+		{
+			/*
+			 * There is data we need in this undo log.  We can't force it to
+			 * be detached.
+			 */
+			ok = false;
+		}
+		LWLockRelease(&log->mutex);
+
+		/* If we failed, then give up now and report failure. */
+		if (!ok)
+			return false;
+
+		/*
+		 * Put this undo log back on the appropriate free-list.  No one can
+		 * attach to it while we hold TablespaceCreateLock, but if we return
+		 * earlier in a future go around this loop, we need the undo log to
+		 * remain usable.  We'll remove all appropriate logs from the
+		 * free-lists in a separate step below.
+		 */
+		if (return_to_freelist)
+		{
+			LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = logno;
+			LWLockRelease(UndoLogLock);
+		}
+	}
+
+	/*
+	 * We detached all backends from undo logs in this tablespace, and no one
+	 * can attach to any non-default-tablespace undo logs while we hold
+	 * TablespaceCreateLock.  We can now drop the undo logs.
+	 */
+	for (logno = low_logno; logno < high_logno; ++logno)
+	{
+		UndoLogControl *log = get_undo_log_by_number(logno);
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/*
+		 * Make sure no buffers remain.  When that is done by UndoDiscard(),
+		 * the final page is left in shared_buffers because it may contain
+		 * data, or at least be needed again very soon.  Here we need to drop
+		 * even that page from the buffer pool.
+		 */
+		forget_undo_buffers(logno, log->meta.discard, log->meta.discard, true);
+
+		/*
+		 * TODO: For now we drop the undo log, meaning that it will never be
+		 * used again.  That wastes the rest of its address space.  Instead,
+		 * we should put it onto a special list of 'offline' undo logs, ready
+		 * to be reactivated in some other tablespace.  Then we can keep the
+		 * unused portion of its address space.
+		 */
+
+		/* Log the dropping operation.  TODO: WAL */
+
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		log->meta.status = UNDO_LOG_STATUS_DISCARDED;
+		LWLockRelease(&log->mutex);
+	}
+
+	/* TODO: flush WAL?  revisit */
+	/* Unlink all undo segment files in this tablespace. */
+	UndoLogDirectory(tablespace, undo_path);
+
+	dir = AllocateDir(undo_path);
+	if (dir != NULL)
+	{
+		struct dirent *de;
+
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 ||
+				strcmp(de->d_name, "..") == 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+	}
+
+	/* Remove all dropped undo logs from the free-lists. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		UndoLogControl *log;
+		UndoLogNumber *place;
+
+		place = &shared->free_lists[i];
+		while (*place != InvalidUndoLogNumber)
+		{
+			log = get_undo_log_by_number(*place);
+			if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+				*place = log->next_free;
+			else
+				place = &log->next_free;
+		}
+	}
+	LWLockRelease(UndoLogLock);
+
+	return true;
+}
+
+void
+ResetUndoLogs(UndoPersistence persistence)
+{
+	UndoLogNumber low_logno;
+	UndoLogNumber high_logno;
+	UndoLogNumber logno;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	low_logno = shared->low_logno;
+	high_logno = shared->high_logno;
+	LWLockRelease(UndoLogLock);
+
+	/* TODO: figure out if locking is needed here */
+
+	for (logno = low_logno; logno < high_logno; ++logno)
+	{
+		UndoLogControl *log = get_undo_log_by_number(logno);
+		DIR	   *dir;
+		struct dirent *de;
+		char	undo_path[MAXPGPATH];
+		char	segment_prefix[MAXPGPATH];
+		size_t	segment_prefix_size;
+
+		if (log->meta.persistence != persistence)
+			continue;
+
+		/* Scan the directory for files belonging to this undo log. */
+		snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", logno);
+		segment_prefix_size = strlen(segment_prefix);
+		UndoLogDirectory(log->meta.tablespace, undo_path);
+		dir = AllocateDir(undo_path);
+		if (dir == NULL)
+			continue;
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+
+		/*
+		 * We have no segment files.  Set the pointers to indicate that there
+		 * is no data.  The discard and insert pointers point to the first
+		 * usable byte in the segment we will create when we next try to
+		 * allocate.  This is a bit strange, because it means that they are
+		 * past the end pointer.  That's the same as when new undo logs are
+		 * created.
+		 *
+		 * TODO: Should we rewind to zero instead, so we can reuse that (now)
+		 * unreferenced address space?
+		 */
+		log->meta.insert = log->meta.discard = log->meta.end +
+			UndoLogBlockHeaderSize;
+
+		/*
+		 * TODO: Here we need to call forget_undo_buffers() to nuke anything
+		 * in shared buffers that might have resulted from replaying WAL,
+		 * which will cause later checkpoints to fail when they can't find a
+		 * file to write buffers to.  But we can't, because we don't know the
+		 * true discard and end pointers here.  Ahh, that's not right.  There
+		 * can be no such WAL, because unlogged relations shouldn't be logging
+		 * anything.  So the fact that they are is a bug elsewhere in zheap
+		 * code?
+		 */
+	}
+}
+
+Datum
+pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_UNDO_LOGS_COLS 9
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	UndoLogNumber low_logno;
+	UndoLogNumber high_logno;
+	UndoLogNumber logno;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char *tablespace_name = NULL;
+	Oid last_tablespace = InvalidOid;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Find the range of active log numbers. */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	low_logno = shared->low_logno;
+	high_logno = shared->high_logno;
+	LWLockRelease(UndoLogLock);
+
+	/* Scan all undo logs to build the results. */
+	for (logno = low_logno; logno < high_logno; ++logno)
+	{
+		UndoLogControl *log = get_undo_log_by_number(logno);
+		char buffer[17];
+		Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
+		bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
+		Oid tablespace;
+
+		if (log == NULL)
+			continue;
+
+		/*
+		 * This won't be a consistent result overall, but the values for each
+		 * log will be consistent because we'll take the per-log lock while
+		 * copying them.
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+
+		if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+		{
+			LWLockRelease(&log->mutex);
+			continue;
+		}
+
+		values[0] = ObjectIdGetDatum((Oid) logno);
+		values[1] = CStringGetTextDatum(
+			log->meta.persistence == UNDO_PERMANENT ? "permanent" :
+			log->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
+			log->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");
+		tablespace = log->meta.tablespace;
+
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(logno, log->meta.discard));
+		values[3] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(logno, log->meta.insert));
+		values[4] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(logno, log->meta.end));
+		values[5] = CStringGetTextDatum(buffer);
+		if (log->xid == InvalidTransactionId)
+			nulls[6] = true;
+		else
+			values[6] = TransactionIdGetDatum(log->xid);
+		if (log->pid == InvalidPid)
+			nulls[7] = true;
+		else
+			values[7] = Int32GetDatum((int64) log->pid);
+		LWLockRelease(&log->mutex);
+
+		/*
+		 * Deal with potentially slow tablespace name lookup without the lock.
+		 * Avoid making multiple calls to that expensive function for the
+		 * common case of repeating tablespace.
+		 */
+		if (tablespace != last_tablespace)
+		{
+			if (tablespace_name)
+				pfree(tablespace_name);
+			tablespace_name = get_tablespace_name(tablespace);
+			last_tablespace = tablespace;
+		}
+		if (tablespace_name)
+		{
+			values[2] = CStringGetTextDatum(tablespace_name);
+			nulls[2] = false;
+		}
+		else
+			nulls[2] = true;
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+	if (tablespace_name)
+		pfree(tablespace_name);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * replay the creation of a new undo log
+ */
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+	xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	/* Create meta-data space in shared memory. */
+	ensure_undo_log_number(xlrec->logno);
+
+	log = get_undo_log_by_number(xlrec->logno);
+	log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+	log->meta.persistence = xlrec->persistence;
+	log->meta.tablespace = xlrec->tablespace;
+	log->meta.insert = UndoLogBlockHeaderSize;
+	log->meta.discard = UndoLogBlockHeaderSize;
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	shared->high_logno = Max(xlrec->logno + 1, shared->high_logno);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * replay the addition of a new segment to an undo log
+ */
+static void
+undolog_xlog_extend(XLogReaderState *record)
+{
+	xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record);
+
+	/* Extend exactly as we would during DO phase. */
+	extend_undo_log(xlrec->logno, xlrec->end);
+}
+
+/*
+ * replay the association of an xid with a specific undo log
+ */
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+	xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	/* Restore current dbid */
+	MyUndoLogState.dbid = xlrec->dbid;
+
+	/*
+	 * Whatever follows is the first record for this transaction.  Zheap will
+	 * use this to add UREC_INFO_TRANSACTION.
+	 */
+	log = get_undo_log_by_number(xlrec->logno);
+	log->meta.is_first_rec = true;
+	log->xid = xlrec->xid;
+}
+
+/*
+ * replay undo log meta-data image
+ */
+static void
+undolog_xlog_meta(XLogReaderState *record)
+{
+	xl_undolog_meta *xlrec = (xl_undolog_meta *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	log = get_undo_log_by_number(xlrec->logno);
+	if (log == NULL)
+		elog(ERROR, "cannot attach to unknown undo log %u", xlrec->logno);
+
+	/*
+	 * Update the insertion point.  While this races against a checkpoint,
+	 * XLOG_UNDOLOG_META always wins because it must be correct for any
+	 * subsequent data appended by this transaction, so we can simply
+	 * overwrite it here.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta = xlrec->meta;
+	log->xid = xlrec->xid;
+	log->pid = MyProcPid; /* show as recovery process */
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Drop all buffers for the given undo log, from the old_discard to up
+ * new_discard.  If drop_tail is true, also drop the buffer that holds
+ * new_discard; this is used when dropping undo logs completely via DROP
+ * TABLESPACE.  If it is false, then the final buffer is not dropped because
+ * it may contain data.
+ *
+ */
+static void
+forget_undo_buffers(int logno, UndoLogOffset old_discard,
+					UndoLogOffset new_discard, bool drop_tail)
+{
+	BlockNumber old_blockno;
+	BlockNumber new_blockno;
+	RelFileNode	rnode;
+
+	UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
+	old_blockno = old_discard / BLCKSZ;
+	new_blockno = new_discard / BLCKSZ;
+	if (drop_tail)
+		++new_blockno;
+	while (old_blockno < new_blockno)
+	{
+		ForgetBuffer(rnode, UndoLogForkNum, old_blockno);
+		ForgetLocalBuffer(rnode, UndoLogForkNum, old_blockno++);
+	}
+}
+/*
+ * replay an undo segment discard record
+ */
+static void
+undolog_xlog_discard(XLogReaderState *record)
+{
+	xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	UndoLogOffset old_segment_begin;
+	UndoLogOffset new_segment_begin;
+	RelFileNode rnode = {0};
+	char	dir[MAXPGPATH];
+
+	log = get_undo_log_by_number(xlrec->logno);
+	if (log == NULL)
+		elog(ERROR, "unknown undo log %d", xlrec->logno);
+
+	/*
+	 * We're about to discard undologs. In Hot Standby mode, ensure that
+	 * there's no queries running which need to get tuple from discarded undo.
+	 *
+	 * XXX we are passing empty rnode to the conflict function so that it can
+	 * check conflict in all the backend regardless of which database the
+	 * backend is connected.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);
+
+	/*
+	 * See if we need to unlink or rename any files, but don't consider it an
+	 * error if we find that files are missing.  Since UndoLogDiscard()
+	 * performs filesystem operations before WAL logging or updating shmem
+	 * which could be checkpointed, a crash could have left files already
+	 * deleted, but we could replay WAL that expects the files to be there.
+	 */
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	discard = log->meta.discard;
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/* Drop buffers before we remove/recycle any files. */
+	forget_undo_buffers(xlrec->logno, discard, xlrec->discard, false);
+
+	/* Rewind to the start of the segment. */
+	old_segment_begin = discard - discard % UndoLogSegmentSize;
+	new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize;
+
+	/* Unlink or rename segments that are no longer in range. */
+	while (old_segment_begin < new_segment_begin)
+	{
+		char	discard_path[MAXPGPATH];
+
+		/*
+		 * Before removing the file, make sure that undofile_sync knows that
+		 * it might be missing.
+		 */
+		undofile_forgetsync(log->logno,
+							log->meta.tablespace,
+							old_segment_begin / UndoLogSegmentSize);
+
+		UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
+						   log->meta.tablespace, discard_path);
+
+		/* Can we recycle the oldest segment? */
+		if (end < xlrec->end)
+		{
+			char	recycle_path[MAXPGPATH];
+
+			UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
+							   log->meta.tablespace, recycle_path);
+			if (rename(discard_path, recycle_path) == 0)
+			{
+				elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+				end += UndoLogSegmentSize;
+			}
+			else
+			{
+				elog(LOG, "could not rename \"%s\" to \"%s\": %m",
+					 discard_path, recycle_path);
+			}
+		}
+		else
+		{
+			if (unlink(discard_path) == 0)
+				elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+			else
+				elog(LOG, "could not unlink \"%s\": %m", discard_path);
+		}
+		old_segment_begin += UndoLogSegmentSize;
+	}
+
+	/* Create any further new segments that are needed the slow way. */
+	while (end < xlrec->end)
+	{
+		allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/* Flush the directory entries. */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/* Update shmem. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = xlrec->discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * replay the rewind of a undo log
+ */
+static void
+undolog_xlog_rewind(XLogReaderState *record)
+{
+	xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	log = get_undo_log_by_number(xlrec->logno);
+	log->meta.insert = xlrec->insert;
+	log->meta.prevlen = xlrec->prevlen;
+}
+
+void
+undolog_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			undolog_xlog_create(record);
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			undolog_xlog_extend(record);
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			undolog_xlog_attach(record);
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			undolog_xlog_discard(record);
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			undolog_xlog_rewind(record);
+			break;
+		case XLOG_UNDOLOG_META:
+			undolog_xlog_meta(record);
+			break;
+		default:
+			elog(PANIC, "undo_redo: unknown op code %u", info);
+	}
+}
+
+/*
+ * For assertions only.
+ */
+bool
+AmAttachedToUndoLog(UndoLogControl *log)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] == log)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Fetch database id from the undo log state
+ */
+Oid
+UndoLogStateGetDatabaseId()
+{
+	Assert(InRecovery);
+	return MyUndoLogState.dbid;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000000..73076dc5f4
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,451 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/backend/access/zheap/Makefile b/src/backend/access/zheap/Makefile
new file mode 100644
index 0000000000..c997807d74
--- /dev/null
+++ b/src/backend/access/zheap/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/zheap
+#
+# IDENTIFICATION
+#    src/backend/access/zheap/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/zheap
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = prunetpd.o prunezheap.o rewritezheap.o tpd.o tpdxlog.o zheapam.o \
+	zheapam_handler.o zheapamutils.o zheapamxlog.o zhio.o zmultilocker.o \
+	ztqual.o zvacuumlazy.o ztuptoaster.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/zheap/README b/src/backend/access/zheap/README
new file mode 100644
index 0000000000..0dce27092a
--- /dev/null
+++ b/src/backend/access/zheap/README
@@ -0,0 +1,602 @@
+src/backend/access/zheap/README
+
+Zheap
+=====
+
+The main purpose of this README is to provide an overview of the current
+design of zheap, a new storage format for PostgreSQL.  This project has three
+major objectives:
+
+1. Provide better control over bloat.  In the existing  heap, we always create
+a new version of tuple when it is  updated. These new versions are later
+removed by vacuum or hot-pruning, but this only frees up space for reuse by
+future inserts or updates; nothing is returned to the operating system.  A
+similar problem occurs for tuples that are deleted. zheap will prevent bloat
+(a) by allowing in-place updates in common cases and (b) by reusing space as
+soon as a transaction that has performed a delete or non-in-place-update has
+committed.  In short, with this new storage, whenever possible, well avoid
+creating bloat in the first place.
+
+2. Reduce write amplification both by avoiding rewrites of heap pages and by
+making it possible to do an update that touches indexed columns without
+updating every index.
+
+3. Reduce the tuple size by (a) shrinking the  tuple header and
+(b) eliminating most alignment padding.
+
+In-place updates will be supported except when (a) the new tuple is larger
+than the old tuple and the increase in size makes it impossible to fit the
+larger tuple onto the same page or (b) some column is modified which is
+covered by an index that has not been modified to support delete-marking.
+We have not begun work on delete-marking support for indexes yet, but intend
+to support it at least for btree indexes.
+
+General idea of zheap with undo
+--------------------------------
+Each backend is attached to a separate undo log to which it writes undo
+records.  Each undo record is identified by a 64-bit undo record pointer of
+which the first 24 bits are used for the log number and the remaining 40 bits
+are used for an offset within that undo log.  Only one transaction at a time
+can write to any given undo log, so the undo records for any given transaction
+are always consecutive.
+
+Each zheap page has fixed set of transaction slots each of which contains the
+transaction information (transaction id and epoch) and the latest undo record
+pointer for that transaction.  As of now, we have four transaction slots per
+page, but this can be changed.  Currently, this is a compile-time option;  we
+can decide later whether such an option is desirable in general for users.
+Each transaction slot occupies 16 bytes. We allow the transaction slots to be
+reused after the transaction is committed which allows us to operate without
+needing too many slots.  We can allow slots to be reused after a transaction
+abort as well, once undo actions are complete.  We have observed that smaller
+tables say having very few pages typically need more slots; for larger tables,
+four slots are enough.  In our internal testing, we have found that 16 slots
+give a very good performance, but more tests are needed to identify the right
+number of slots.  The one known problem with the fixed number of slots is that
+it can lead to deadlock, so we are planning to add  a mechanism to allow the
+array of transactions slots to be continued on a separate overflow page.  We
+also need such a mechanism to support cases where a large number of
+transactions acquire SHARE or KEY SHARE locks on a single page.  The overflow
+pages will be stored in the zheap itself, interleaved with regular pages.
+These overflow pages will be marked in such a way that sequential scans will
+ignore them.  We will have a meta page in zheap from which all overflow pages
+will be tracked.
+
+Typically, each zheap operation that modifies a page needs to first allocate a
+transaction slot on that page and then prepare an undo record for the operation.
+Then, in a critical section, it must write the undo record, perform the
+operation on heap page, update the transaction slot in a page, and finally
+write a WAL record for the operation.  What we write as part of undo record
+and WAL depends on the operation.
+
+Insert: Apart from the generic info, we write the TID (block number and offset
+number) of the tuple in undo record to identify the record during undo replay.
+In WAL, we write the offset number and the tuple, plus some minimal
+information which will be needed to regenerate the undo record during replay.
+
+Delete: We write the complete tuple in the undo record even though we could get
+away with just  writing the TID as we do for an insert operation.  This allows
+us to reuse the space occupied by the deleted record as soon as the transaction
+that has performed the operation commits.  In WAL, we need to write the tuple
+only if full page writes are not enabled.  If full page writes are enabled, we
+can rely on the page state to be same during recovery as it is during the
+actual operation, so we can retrieve the tuple from page to copy it into the
+undo record.
+
+Update: For in-place updates, we have to write the old tuple in the undo log
+and the new tuple in the zheap.  We could optimize and write the diff tuple
+instead of the complete tuple in undo, but as of now, we are writing the
+complete tuple.  For non-in-place updates, we write the old tuple and the new
+TID in undo; essentially this is equivalent to DELETE+INSERT.  As for DELETE,
+this allows space to be recycled as soon as the updating transaction commits.
+In the WAL, we write a copy of the old tuple only if full pages writes are off
+and we write diff tuple for the new tuple (irrespective of the value of
+full-page writes) as we do in a current heap.  In the case where a
+non-in-place-update happens to insert new tuple on a separate page, we write
+two undo records, one for old page and another for the new page.  One can
+imagine that writing one undo record would be sufficient as we generally reach
+to a new tuple from old tuple if required, but we want to maintain a separate
+undo chain for each page.
+
+Select .. For [Key] Share/Update
+Tuple locking will work much like a DML operation: reserve a transaction slot,
+update the tuple header with the lock information, write UNDO and WAL for the
+operation.  To detect conflicts, we sometimes need to traverse the undo chains
+of all the active transactions on a page.  We will always mark the tuple with
+the strongest lock mode that might be present, just as is done in the current
+heap, so that we can cheaply detect whether there is a potential conflict.  If
+there is, we must get information about all the locks from undo in order to
+decide whether there is an actual conflict.  The tuple will always contain
+either the strongest locker information or if all the lockers are of same
+strength, then it will contain the latest locker information.  Whenever there
+is more than one locker operating on a tuple, we set the multi-locker bit on a
+tuple to indicate that the tuple has multiple lockers. Note, that we clear the
+multi-locker bit lazily (which means when we decide to wait for all the
+lockers to go away and there is no more locker alive on the tuple).  During
+Rollback operation, we retain the strongest locker information on the tuple
+if there are multiple lockers on a tuple. This is because the conflict
+detection mechanism works based on strongest locker.  Now, even if we want to
+remove strongest locker information, we don't have second strongest locker
+information handy.
+
+Copy: Similar to insert, we need to store the corresponding TID (block number,
+offset number) for a tuple in undo to identify the same during undo replay. But,
+we can minimize the number of undo records written for a page. First, we
+identify the unused offset ranges for a page, then insert one undo record for
+each offset range. For example, if were about to insert in offsets
+(2,3,5,9,10,11), we insert three undo records covering offset ranges (2,3),
+(5,5), and (9,11), respectively. For recovery, we insert a single WAL record
+containing the above-mentioned offset ranges along with some minimal
+information to regenerate the undo records and tuples.
+
+Scans: During scans, we need to make a copy of the tuple instead of just
+holding the pin on a page.  In the current heap, holding a pin on the buffer
+containing the tuple is sufficient because operations like vacuum which can
+rearrange the page always take a cleanup lock on a buffer. In zheap, however,
+in-place-updates work with just a exclusive lock on a buffer, so a tuple to
+which we hold a pointer might be updated under us.
+
+Insert .. On Conflict: The design is similar to current heap such that we use the
+speculative token to detect conflicts.  We store the speculative token in undo
+instead of in the tuple header (CTID) simply because zheaps tuple header
+doesnt have CTID. Additionally, we set a bit in tuple header to indicate
+speculative insertion.  ZheapTupleSatisfiesDirty routine checks this bit and
+fetches a speculative token from undo.
+
+Toast Tables: Toast tables can use zheap, too.  Since zheap uses shorter tuple
+headers, this saves space. In the future, someone might want to support
+in-place updates for toast table data instead of doing delete+insert as we do
+today.
+
+SQL Operations: All SQL operations that either need to interact with a heap
+(scans, ALTER TABLE, etc.) or require a HeapTuple (like joins, ORDER BY,
+ANALYZE, COPY, etc.) need to be changed to interact with zheap pages or zheap
+tuples.  For now, we have taken the approach of  writing converter functions
+for tuples (i.e.  zheap_to_heap, heap_to_zheap) to avoid changing the whole
+backend to accept to Zheap pages and tuples.  Operations which need to access
+pages still need to be modified.   For all performance-critical operations, we
+operate directly on zheap pages and zheap tuples to avoid the cost of
+conversion.  We think that some of this needs to be changed in response to
+whatever conclusions are reached regarding the proposed storage API.
+
+Transaction slot reuse
+-----------------------
+Transaction slots can be freely reused if the transaction is committed and
+all-visible, or if the transaction is aborted and undo actions for that
+transaction, at least relating to that page, have been performed.  If the
+transaction is committed but not yet all-visible, we can reuse the slot after
+writing an additional, special undo record that lets us make subsequent tuple
+visibility decisions correctly.
+
+For committed transactions, there are two possibilities.  If the transaction
+slot is not referenced by any tuple in the page, we simply clear the xid from
+the transaction slot. The undo record pointer is kept as it is to ensure that
+we don't break the undo chain for that slot.  Otherwise, we write an undo
+record for each tuple that points to one of the committed transactions.  We
+also mark the tuple indicating that the associated slot has been reused.  In
+such a case, it is quite possible that the tuple has not been modified, but it
+is still pointing to transaction slot which has been reused for a new
+transaction which is not yet all-visible.  During the visibility check for
+such a tuple, it might appear that the tuple is modified by a current
+transaction which is clearly wrong and can lead to wrong results.
+
+Subtransactions
+----------------
+zheap only uses the toplevel transaction ID; subtransactions that modify a
+zheap do not need separate transaction IDs.  In the regular heap, when
+subtransactions are present, the subtransactions XID is used to make tuple
+visibility decisions correctly.  In a zheap, subtransaction abort is instead
+handled by using undo to reverse changes to the zheap pages. This design
+minimizes consumption of transaction slots and pg_xact space, and ensures that
+all undo records for a toplevel transaction remain consecutive in the undo
+log.
+
+Reclaiming space within a page
+-------------------------------
+Space can be reclaimed within a page after (a) a delete, (b) a non-in-place
+update, or (c) an in-place update that reduces the width of the tuple. We can
+reuse the space when as soon as  the transaction that has performed the
+operation has committed.  We can also reclaim space after inserts or
+non-in-place updates have been undone.  There is some difference between the
+way space is reclaimed for transactions that are committed and all-visible vs.
+the transactions that are committed but still not all-visible. In the former
+case, we can just indicate in the line pointer that the corresponding item is
+dead whereas for later we need the capability to fetch the prior version of a
+tuple for transactions to which the delete is not visible. To allow that, we
+copy the transaction slot information into the line pointer so that we can
+easily reach the prior version of the tuple.  As a net result, the space for a
+deleted tuple can be reclaimed immediately after the delete commits, but the
+space consumed by line pointer can only be freed once we delete the
+corresponding index tuples.  For an aborted transaction, space can be
+reclaimed once undo is complete.  We set the prune xid in page header during
+delete or update operations and during rollback of inserts to permit pruning
+to happen only when there is a possible benefit.  When we try to prune, we
+first check if the prune xid is in progress; only if not will we attempt to
+prune the page.
+
+Pruning will be attempted when update operation lands to a page where there is
+not enough space to accommodate a new tuple.  We can also allow pruning to
+occur when we evict the page from shared buffers or read the page from disk as
+those are I/O intensive operations, so doing some CPU intensive operation
+doesn't cost much.
+
+With the above idea, it is quite possible that sometimes we try to prune the
+page when there is no immediate benefit of doing so. For example, even after
+pruning, the page might still not have enough space in the page to accommodate
+new tuple.  One idea is to track the space at the transaction slot level, so
+that we can know exactly how much space can be freed in page after pruning,
+but that will lead to increase in a space used by each transaction slot.
+
+We can also reuse space if a transaction frees up space on the page (e.g. by
+delete) and then tries to use additional space (e.g. by a subsequent insert).
+We cant in general reuse space freed up by a transaction until it commits,
+because if it aborts well need that space during undo; but an insert or
+update could reuse space freed up by earlier operations in the same
+transaction, since all or none of them will roll back. This is a good
+optimization, but this needs some more thought.
+
+Free Space Map
+---------------
+We can optimistically update the freespace map when we remove the tuples from
+a page in the hope that eventually most of the transactions will commit and
+space will be available. Additionally, we might want to update FSM during
+aborts when space-consuming actions like inserts are rolled back.  When
+requesting free space, we would need to adjust things so that we continue the
+search from the previous block instead of repeatedly returning the same block.
+
+I think updating it on every such operation can be costly, so we can perform
+it only after some threshold number, so later we might want to add a facility
+to track potentially available freespace and merge into the main data
+structure.  We also want to make FSM crash-safe, since we cant count on
+VACUUM to recover free space that we neglect to record.
+
+Page format
+------------
+zheap uses a standard page header,  stores transaction slots in the special
+space.
+
+Tuple format
+-------------
+The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment):
+2 bytes each for informask and infomask2, and one byte for t_hoff.  I think we
+might be able to squeeze some space from t_infomask, but for now, I have kept
+it as two bytes.  All transactional information is stored in undo, so fields
+that store such information are not needed here.
+
+The idea is that we occupy somewhat more space at the page level, but save
+much more at tuple level, so we come out ahead overall.
+
+Alignment padding
+------------------
+We omit all alignment padding for pass-by-value types. Even in the current heap,
+we never point directly to such values, so the alignment padding doesnt help
+much; it lets us fetch the value using a single instruction, but that is all.
+Pass-by-reference types will work as they do in the heap.  Many pass-by-reference
+data types will be varlena data types (typlen = -1) with short varlena headers so
+no alignment padding will be introduced in that case anyway, but if we have varlenas
+with 4-byte headers or if we have fixed-length pass-by-reference types (e.g. interval,
+box) then we'll still end up with padding.  We can't directly access unaligned values;
+instead, we need to use memcpy.  We believe that the space savings will more than pay
+for the additional CPU costs.
+
+We dont need alignment padding between the tuple header and the tuple data as
+we always make a copy of the tuple to support in-place updates. Likewise, we ideally
+don't need any alignment padding between tuples. However, there are places in zheap
+code where we access tuple header directly from page (ex. zheap_delete, zheap_update,
+etc.) for which we them to be aligned at two-byte boundary).
+
+Undo chain
+-----------
+Each undo record header contains the location of previous undo record pointer
+of the transaction that is performing the operation.  For example, if
+transaction T1 has updated the tuple two times, the undo record for the last
+update will have a link for undo record of the previous update.  Thus, the
+undo records for a particular page in a particular transaction form a single,
+linked chain.
+
+Snapshots and visibility
+-------------------------
+Given a TID and a snapshot, there are three possibilities: (a) the tuple
+currently stored at the given TID; (b) some tuple previously stored at the
+given TID and subsequently written to the undo log might be visible; or
+(c) there might be nothing visible at all.  To check the visibility of a
+tuple, we fetch the transaction slot number stored in the tuple header, and
+then get the transaction id and undo record pointer from transaction slot.
+Next, we check the current tuples visibility based on transaction id fetched
+from transaction slot and the last operation performed on the tuple.  For
+example, if the last operation on tuple is a delete and the xid is visible to
+our snapshot, then we return NULL indicating no visible tuple. But if the xid
+that has last operated on tuple is not visible to the snapshot, then we use
+the undo record pointer to fetch the prior tuple from undo and similarly check
+its visibility.  The only difference in checking the visibility for the undo
+tuple is that the xid that previously operated on undo tuple is present in the
+undo record, so we can use that instead of relying on the transaction slot.
+If the tuple from undo is also not visible, then we fetch the prior tuple from
+the undo chain.  We need to traverse undo chains until we find a visible tuple
+or reach theinitially inserted tuple; if that is also not visible, we can
+return NULL.
+
+During visibility checking of a tuple in a zheap page or an undo chain, if we
+find that the tuples transaction slot has been reused, we retrieve the
+transaction information (xid and cid that has modified the tuple) of that
+tuple from undo.
+
+EvalPlanQual mechanism
+-----------------------
+This works in basically the same way as for the existing heap. The only
+special consideration is that the updated tuple could have the same TID as the
+original one if it was updated in place, so we might want to optimize such
+that we need not release the buffer lock and again refetch the tuple.
+However, at this stage, we are not sure if there is any big advantage in such
+an optimization.
+
+64-bit transaction ids
+-----------------------
+Transaction slots in zheap pages store both the epoch and the XID; this
+eliminates the confusion between a use of a given XID in the current epoch and
+a use in some previous epoch, which means that we never need to freeze tuples.
+The difference between the oldest running XID and the newest XID is still
+limited to 2 billion because of the way that snapshots work.  Moreover, the
+oldest XID that still has undo must have an XID age less than 2 billion: among
+other problems, this is currently the limit for how long commit status data
+can be retained, and it would be bad if we had undo data but didnt know
+whether or not to apply the undo actions.  Currently, this limitation is
+enforced by piggybacking on the existing wraparound machinery.
+
+Indexing
+---------
+Current index AMs are not prepared to cope with multiple tuples at the same
+TID with different values stored in the index column.  We plan to introduce
+special index AM support for in-place updates; when an index lacks such
+support, any modification to the value stored in a column covered by that
+index will prevent the use of in-place update.  Additionally, indexes lacking
+such support will still require routine vacuuming, which we believe can be
+avoided when such support is present.
+
+The basic idea is that we need to delete-mark index entries when they might no
+longer be valid, either because of a delete or because of an update affecting
+the indexed column.  An in-place update that does not modify the indexed
+column need not delete-mark the corresponding index entries.  Note that an
+entry which is delete-marked might still be valid for some snapshots; once no
+relevant snapshots remain, we can remove the entry.  In some cases, we may
+remove a delete-mark from an entry rather than removing the entry, either
+because the transaction which applied the delete-mark has rolled back, or
+because the indexed column was changed from value A to value B and then
+eventually back to value A.
+It is very desirable for performance reasons to have be able to distinguish
+from the index page whether or not the corresponding heap tuple is definitely
+all-visible, but the delete-marking approach is not quite sufficient for this
+purpose unless recently-inserted tuples are also delete-marked -- and that is
+undesirable, since the delete-markings would have to be cleared after the
+inserting transaction committed, which might end up dirtying many or all
+index pages.  An alternative approach is to write undo for index insertions;
+then, the undo pointers in the index page tells us whether any index entries
+on that page may be recently-inserted, and the presence or absence of a
+delete-mark tells us whether any index entries on that page may no longer be
+valid.  We intend to adopt this approach; it should allow index-only scans in
+most cases without the need for a separately-maintained visibility map.
+
+With this approach, an in-place update touches each index whose indexed
+columns are modified twice -- once to delete-mark the old entry (or entries)
+and once to insert the new entries.  In some use cases, this will compare
+favorably with the existing approach, which touches every index exactly once.
+Specifically, it figures to reduce write amplification and index bloat when
+only one or a few indexed columns are updated at a time.
+
+Indexes that don't have delete-marking
+---------------------------------------
+Although indexes which lack delete-marking support still require vacuum, we
+can use undo to reduce the current three-pass approach to just two passes,
+avoiding the final heap scan.  When a row is deleted, the vacuum will directly
+mark the line pointer as unused, writing an undo record as it does,  and then
+mark the corresponding index entries as dead.  If vacuum fails midway through
+the undo can ensure that changes to the heap page are rolled back.  If the
+vacuum goes on to commit, we don't need to revisit the heap page after index
+cleanup.
+
+We must be careful about  TID reuse: we will only allow a TID to be reused
+when the transaction that has marked it as unused has committed. At that
+point, we can be assured that all the index entries corresponding to dead
+tuples will be marked as dead.
+
+Undo actions
+-------------
+We need to apply undo actions during explicit ROLLBACK or ROLLBACK TO
+SAVEPOINT operations and when an error causes a transaction or subtransaction
+abort.  These actions reverse whatever work was done when the operation was
+performed; for example, if an update aborts, we must restore the old version
+of the tuple.  During an explicit ROLLBACK or ROLLBACK TO SAVEPOINT, the
+transaction is in a good state and we have relevant locks on objects, so
+applying undo actions is straightforward, but the same is not true in error
+paths.  In the case of a subtransaction abort, undo actions are performed
+after rolling back the subtransaction; the parent transaction is still good.
+In the case of a top-level abort, we begin an entirely new transaction to
+perform the undo actions.  If this new transaction aborts, it can be retried
+later.  For short transactions (say, one which generates only few kB of undo
+data), it is okay to apply the actions in the foreground but for longer
+transactions, it is advisable to delegate the work to an undo worker running
+in the background.  The user is provided with a knob to control this behavior.
+
+Just like the DML operations to which they correspond, undo actions require us
+to write WAL.  Otherwise, we would be unable to recover after a crash, and
+standby servers would not be properly updated.
+
+Applying undo actions
+----------------------
+In many cases, the same page will be modified multiple times by the same
+transaction.  We can save locking and reduce WAL generation by collecting all
+of the undo records for a given page and then applying them all at once.
+However, its difficult to collect all of the records that might apply to a
+page from an arbitrarily large undo log in an efficient manner; in particular,
+we want to avoid rereading the same undo pages multiple times.  Currently, we
+collect all consecutive records which apply to the same page and then apply
+them at one shot.  This will cover the cases where most of the changes to heap
+pages are performed together.  This algorithm could be improved.  For example,
+we could do something like this:
+
+1. Read the last 32MB of undo for the transaction being undone (or all of the
+undo for the transaction, if there is less than 32MB).
+2. For each block that is touched by at least one record in the 32MB chunk,
+consolidate all records from this chunk that apply to that block.
+3. Sort the blocks by buffertag and apply the changes in ascending
+block-number order within each relation.  Do this even for incomplete chains,
+so nothing is saved for later.
+4. Go to step 1.
+
+After applying undo actions for a page, we clear the transaction slot on a
+page if the oldest undo record we applied is the oldest undo record for that
+block generated by that transaction. Otherwise, we rewind the undo pointer in
+the page slot to the last record for that block that precedes the last undo
+record we applied.  Because applying undo also always updates the transaction
+slot on the page, either rewinding it or clearlying it completely, we can
+always skip applying undo if we find that its already been applied
+previously.  This could happen if the application of undo for a given
+transaction is interrupted a crash, or if it fails for some reason and is
+retried later.
+
+This also prevents us from getting confused when the relation is (a) dropped,
+(b) rewritten using a new relfilenode, or (c) truncated to a shorter length
+(and perhaps subsequently re-extended).  We apply the undo action only if the
+page contains the effect of the transaction for which we are applying undo
+actions, which can always be determined by examining the undo pointer in the
+transaction slot. If there is no transaction slot for the current transaction
+or if it is present but the undo record pointer in the slot is less than the
+undo record pointer of the undo record under consideration, the undo record
+can be ignored; it has already been applied or is no longer relevant.  After a
+toplevel transaction abort, undo space is not recycled.  However, after a
+subtransaction abort, we rewind the insert pointer to wherever it was at the
+start of the subtransaction, so that the undo for the toplevel transaction
+remains contiguous.  We cant do the same for toplevel aborts as that might
+contain special undo records related to transaction slots that were reused and
+we cant afford to lose those.  We write these special undo records only for
+toplevel transaction when it doesnt find any free transaction slot or there
+is no transaction slot which contains transaction that is all-visible.  In
+such cases, we reuse the committed transaction slots and write undo record
+which contains transaction information for them as we might need that
+information for transaction which still cant see the committed transaction.
+We mark all such slots (that belongs to committed transactions) as available
+for reuse in one shot as doing it one slot at a time is quite costly.  Since
+we might still need the special undo records for the transaction slots other
+than the current transaction, we cant simply rewind the insert pointer.  Note
+that we do this only for toplevel transactions; if we need the new slot when
+in a subtransaction, we reclaim only a single transaction slot.
+
+WAL consideration
+------------------
+Undo records are critical data and must be protected via WAL.  Because an undo
+record must be written if and only if a page modification occurs, the undo
+record and the record for the page modification must be one and the same.
+Moreover, it is very important not to duplicate any information or store any
+unnecessary information, since WAL volume has a significant impact on overall
+system performance.  In particular, there is no need to log the undo record
+pointer.  We only need to ensure that after crash recovery undo record pointer
+is set correctly for each of the undo logs.  To ensure that, we log a WAL
+record after XID change or at the first operation after checkpoint on undo
+log.  The WAL record contains the information of insert point, log number, and
+Xid.  This is enough to form an XID->(Log no. + Log insertion point) map which
+will be used to calculate the location of undo insertion during recovery.
+
+Another important consideration is that we don't need to have full page images
+for data in undo logs. Because the undo logs are always written serially, torn
+pages are not an issue.  Suppose that  some block in one of the undo log is
+half filled and synced properly to disk; now, a checkpoint occurs  Next, we
+add some more data to the block.  During the following checkpoint, the system
+crashes while flushing the block.  The block could be in a condition such that
+first few bytes of it say 512 bytes are flushed appropriately and rest are
+old, but this won't cause problem because anyway old bytes will be intact and
+we can always start inserting new records at insert location in undo
+reconstructed during recovery.
+
+Undo Worker
+------------
+Currently, we have one background undo worker which performs undo actions as
+required and discards undo logs when they are no longer needed.  Typically, it
+performs undo actions in response to a notification from a backend that has
+just aborted a transaction, but it will eventually detect and perform undo
+actions for any aborted transaction that does not otherwise get cleaned up.
+
+We allow the undo worker to hibernate when there is no activity in the system.
+It hibernates for a minimum of 100ms and maximum of 10s, based on the time
+the system has remained idle.  The undo worker mechanism will be extended to
+multiple undo workers to perform various jobs related to undo logs. For
+example, if there are many pending rollback requests, then we can spawn a new
+undo worker which can help in processing the requests.
+
+UndoDiscard routine will be called by the undo worker for discarding the old
+undo records. UndoDiscard will process all the active undo logs.   It reads
+each undo log and checks whether the log corresponding to the first
+transaction in a log can be discarded (committed and all visible or aborted
+and undo already applied). If so, it moves to the next transaction in that
+undo log and continues in the same way. When it finds the first transaction
+whose undo  can't be discard yet, it first discards the undo log prior to that
+point and then remembers the transaction ID and undo location in shared memory.
+We consider undo for a transaction to be discardable once its XID  is smaller
+than oldestXmin.
+
+Ideally, for the aborted transactions once the undo actions are replayed, we
+should be able to discard its undo, however, it might contain the undo records
+for reused transaction slots, so we cant discard them until it becomes
+smaller than oldestXmin.  Also, we cant discard the undo for the aborted
+transaction if there is a preceding transaction which is committed and not
+all-visible.  We can allow undo for aborted transactions to be discarded
+immediately if we remember in the first undo record of the transaction whether
+it contains undo of reused transaction slot.  This will help the cases where
+the aborted transaction is the last transaction in undo log which is smaller
+than oldestXmin.
+
+In Hot Standby mode, undo is discarded via WAL replay.  Before discarding
+undo, we ensure that there are no queries running which need to get tuple from
+discarded undo.  If there are any, a recovery conflict will occur, similar to
+what happens in other cases where a resource held by a particular backend
+prevents replay from advancing.
+
+For each undo log, the undo discard module maintains in memory array to hold
+the latest undiscarded xid and its start undo record pointer.  The first XID
+in the undo log will be compared against GlobalXmin, if the xid is greater
+than GlobalXmin then nothing can be discarded;  otherwise, scan  the undo log
+starting with the oldest transaction it contains. To avoid processing every
+record in the undo log, we maintain a transaction start header in the first
+undo record written by any given transaction with space to store a pointer to
+the next transaction start undo record in that same undo log. This allows us
+to read an undo log transaction by transaction.  When discarding undo, the
+background worker will read all active undo logs transaction by transaction
+until it finds a transaction with an XID greater than equal to the GlobalXmin.
+Once it finds such a transaction, it will discard all earlier undo records in
+that undo log, without even writing unflushed buffers to disk.
+
+Avoid fetching discarded undo record
+-------------------------------------
+The system must never attempt to fetch undo records which have already been
+discarded.  Undo is generally discarded in the background by the undo worker,
+so we must account for the possibility that undo could be discarded at any
+time.  We do maintain the oldest xid that have undo (oldestXidHavingUndo).
+Undo worker updates the value of oldestXidHavingUndo after discarding all the
+undo.  Backends consider all transactions that precede oldestXidHavingUndo as
+all-visible, so they normally dont try to fetch the undo which is already
+discarded.  However, there is a race condition where backend decides that the
+transaction is greater than oldestXidHavingUndo and it needs to fetch the undo
+record and in the meantime undo worker discards the corresponding undo record.
+To handle such race conditions, we need to maintain some synchronization
+between backends and undo worker so that backends dont try to access already
+discarded undo.  So whenever undo fetch is trying to read a undo record from
+an undo log, first it needs to acquire a log->discard_lock in SHARED mode for
+the undo log and check that the undo record pointer is not less than
+log->oldest_data, if so, then don't fetch that undo record and return
+NULL (that means the previous version is all visible).  And undo worker will
+take log->discard_lock in EXCLUSIVE mode for updating the
+log->oldest_data. We hold this lock just to update the value in shared
+memory, the actual discard happens outside this lock.
+
+Undo Log Storage
+-----------------
+This subsystem is responsible for lifecycle management of undo logs and
+backing files, associating undo logs with backends, allocating and managing
+space within undo logs.  It provides access to undo log contents via shared
+buffers. The list of available undo logs is maintained in shared memory.
+Whenever a backend request for undo log allocation, it attaches a first free
+undo log to a backend, and if all existing undo logs are busy, it will create
+a new one. A set of APIs is provided by this subsystem to efficiently allocate
+and discard undo logs.
+
+During a checkpoint, all the undo segment files and undo metadata files will
+be flushed to the disk.
diff --git a/src/backend/access/zheap/prunetpd.c b/src/backend/access/zheap/prunetpd.c
new file mode 100644
index 0000000000..ce7c8576a5
--- /dev/null
+++ b/src/backend/access/zheap/prunetpd.c
@@ -0,0 +1,507 @@
+/*-------------------------------------------------------------------------
+ *
+ * prunetpd.c
+ *	  TPD page pruning
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/prunetpd.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/tpd_xlog.h"
+#include "miscadmin.h"
+#include "storage/bufpage.h"
+#include "storage/freespace.h"
+#include "storage/proc.h"
+
+typedef struct TPDPruneState
+{
+	int			nunused;
+	OffsetNumber nowunused[MaxTPDTuplesPerPage];
+} TPDPruneState;
+
+static void TPDEntryPrune(Buffer buf, OffsetNumber offnum, TPDPruneState *prstate,
+				Size *space_freed);
+static XLogRecPtr LogTPDClean(Relation rel, Buffer tpdbuf,
+					OffsetNumber *nowunused, int nunused,
+					OffsetNumber target_offnum, Size space_required);
+static int TPDPruneEntirePage(Relation rel, Buffer tpdbuf);
+
+/*
+ * TPDPagePrune - Prune the TPD page.
+ *
+ * Process all the TPD entries in the page and remove the old entries which
+ * are all-visible.  We first collect all such entries and then process them
+ * in one-shot.
+ *
+ * We expect caller must have an exclusive lock on the page.
+ *
+ * Returns the number of entries pruned.
+ */
+int
+TPDPagePrune(Relation rel, Buffer tpdbuf, BufferAccessStrategy strategy,
+			 OffsetNumber target_offnum, Size space_required, bool can_free,
+			 bool *update_tpd_inplace, bool *tpd_e_pruned)
+{
+	Page	tpdpage, tmppage = NULL;
+	TPDPageOpaque	tpdopaque;
+	TPDPruneState	prstate;
+	OffsetNumber	offnum, maxoff;
+	ItemId	itemId;
+	uint64	epoch_xid;
+	uint64	epoch;
+	Size	space_freed;
+
+	prstate.nunused = 0;
+	tpdpage = BufferGetPage(tpdbuf);
+
+	/* Initialise the out variables. */
+	if (update_tpd_inplace)
+		*update_tpd_inplace = false;
+	if (tpd_e_pruned)
+		*tpd_e_pruned = false;
+
+	/* Can we prune the entire page? */
+	tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+	epoch = tpdopaque->tpd_latest_xid_epoch;
+	epoch_xid = MakeEpochXid(epoch, tpdopaque->tpd_latest_xid);
+	if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+	{
+		prstate.nunused = TPDPruneEntirePage(rel, tpdbuf);
+		goto free_tpd_page;
+	}
+
+	/* initialize the space_free with already existing free space in page */
+	space_freed = PageGetExactFreeSpace(tpdpage);
+
+	/* Scan the page */
+	maxoff = PageGetMaxOffsetNumber(tpdpage);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{	
+		itemId = PageGetItemId(tpdpage, offnum);
+
+		/* Nothing to do if slot is empty. */
+		if (!ItemIdIsUsed(itemId))
+			continue;
+
+		TPDEntryPrune(tpdbuf, offnum, &prstate, &space_freed);
+	}
+
+	/*
+	 * There is not much advantage in continuing, if we can't free the space
+	 * required by the caller or we are not asked to forcefully prune the
+	 * page.
+	 *
+	 * XXX - In theory, we can still continue and perform pruning in the hope
+	 * that some future update in this page will be able to use that space.
+	 * However, it will lead to additional writes without any guaranteed
+	 * benefit, so we skip the pruning for now.
+	 */
+	if (space_freed < space_required)
+		return 0;
+
+	/*
+	 * We prepare the temporary copy of the page so that during page
+	 * repair fragmentation we can use it to copy the actual tuples.
+	 */
+	if (prstate.nunused > 0 || OffsetNumberIsValid(target_offnum))
+		tmppage = PageGetTempPageCopy(tpdpage);
+
+	/* Any error while applying the changes is critical */
+	START_CRIT_SECTION();
+
+	/*
+	 * Have we found any prunable items or caller has asked us to make space
+	 * next to target_offnum?
+	 */
+	if (prstate.nunused > 0 || OffsetNumberIsValid(target_offnum))
+	{
+		/*
+		 * Apply the planned item changes, then repair page fragmentation, and
+		 * update the page's hint bit about whether it has free line pointers.
+		 */
+		TPDPagePruneExecute(tpdbuf, prstate.nowunused, prstate.nunused);
+
+		/*
+		 * Finally, repair any fragmentation, and update the page's hint bit about
+		 * whether it has free pointers.  It is quite possible that there are no
+		 * prunable items on the page in which case it will rearrange the page to
+		 * make the space at the required offset.
+		 */
+		TPDPageRepairFragmentation(tpdpage, tmppage, target_offnum,
+								   space_required);
+
+		MarkBufferDirty(tpdbuf);
+
+		/*
+		 * Emit a WAL TPD_CLEAN record showing what we did.
+		 *
+		 * XXX Unlike heap pruning, we don't need to remember latestRemovedXid
+		 * for the purpose of generating conflicts on standby.  We use
+		 * oldestXidHavingUndo as the horizon to prune the TPD entries which
+		 * means all the prior undo must have discarded and during undo discard
+		 * we already generate such xid (see undolog_xlog_discard) which should
+		 * serve our purpose as this WAL must reach after that.
+		 */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = LogTPDClean(rel, tpdbuf, prstate.nowunused,
+								 prstate.nunused, target_offnum,
+								 space_required);
+
+			PageSetLSN(tpdpage, recptr);
+		}
+
+		if (update_tpd_inplace)
+			*update_tpd_inplace = true;
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy. */
+	if (tmppage)
+		pfree(tmppage);
+
+free_tpd_page:
+	if (can_free && PageIsEmpty(tpdpage))
+	{
+		Size	freespace;
+
+		/*
+		 * If the page is empty, we have certainly pruned all the tpd
+		 * entries.
+		 */
+		if (tpd_e_pruned)
+			*tpd_e_pruned = true;
+		/*
+		 * We can reuse empty page as either a heap page or a TPD
+		 * page, so no need to consider opaque space.
+		 */
+		freespace = BLCKSZ - SizeOfPageHeaderData;
+
+		/*
+		 * TPD page is empty, remove it from TPD used page list and
+		 * record it in FSM.
+		 */
+		if (TPDFreePage(rel, tpdbuf, strategy))
+			RecordPageWithFreeSpace(rel, BufferGetBlockNumber(tpdbuf),
+									freespace);
+	}
+
+	return prstate.nunused;
+}
+
+/*
+ * TPDEntryPrune - Check whether the TPD entry is prunable.
+ *
+ * Process all the transaction slots of a TPD entry present at a given offset.
+ * TPD entry will be considered prunable, if all the transaction slots either
+ * contains transaction that is older than oldestXidHavingUndo or
+ * doesn't have a valid transaction.
+ */
+static void
+TPDEntryPrune(Buffer tpdbuf, OffsetNumber offnum, TPDPruneState *prstate,
+			  Size *space_freed)
+{
+	Page	tpdpage;
+	TPDEntryHeaderData	tpd_e_hdr;
+	TransInfo	*trans_slots;
+	ItemId	itemId;
+	Size	size_tpd_e_slots, size_tpd_e_map;
+	Size	size_tpd_entry;
+	int		num_trans_slots, slot_no;
+	int		loc_trans_slots;
+	uint16	tpd_e_offset;
+	bool	prune_entry = true;
+
+	tpdpage = BufferGetPage(tpdbuf);
+	itemId = PageGetItemId(tpdpage, offnum);
+	tpd_e_offset = ItemIdGetOffset(itemId);
+	size_tpd_entry = ItemIdGetLength(itemId);
+
+	memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/*
+	 * We can prune the deleted entries as no one will be referring to such
+	 * entries.
+	 */
+	if (TPDEntryIsDeleted(tpd_e_hdr))
+		goto prune_tpd_entry;
+
+	if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint8);
+	else
+	{
+		Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE);
+		size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint32);
+	}
+
+	num_trans_slots = tpd_e_hdr.tpe_num_slots;
+	size_tpd_e_slots = num_trans_slots * sizeof(TransInfo);
+	loc_trans_slots = tpd_e_offset + SizeofTPDEntryHeader + size_tpd_e_map;
+
+	trans_slots = (TransInfo *) palloc(size_tpd_e_slots);
+	memcpy((char *) trans_slots, tpdpage + loc_trans_slots, size_tpd_e_slots);
+
+	for (slot_no = 0; slot_no < num_trans_slots; slot_no++)
+	{
+		uint64	epoch_xid;
+		TransactionId	xid;
+		uint64	epoch;
+		UndoRecPtr	urec_ptr = trans_slots[slot_no].urec_ptr;
+
+		epoch = trans_slots[slot_no].xid_epoch;
+		xid = trans_slots[slot_no].xid;
+		epoch_xid = MakeEpochXid(epoch, xid);
+		/*
+		 * Check whether transaction slot can be considered frozen?
+		 * If both transaction id and undo record pointer are invalid or
+		 * xid is invalid and its undo has been discarded or xid is older than
+		 * the oldest xid with undo.
+		 */
+		if ((!TransactionIdIsValid(xid) &&
+			(!UndoRecPtrIsValid(urec_ptr) || UndoLogIsDiscarded(urec_ptr))) ||
+			(TransactionIdIsValid(xid) &&
+	 		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)))
+			continue;
+		else
+		{
+			prune_entry = false;
+			break;
+		}
+	}
+
+	pfree(trans_slots);
+
+prune_tpd_entry:
+	if (prune_entry)
+	{
+		Assert (prstate->nunused < MaxTPDTuplesPerPage);
+		prstate->nowunused[prstate->nunused] = offnum;
+		prstate->nunused++;
+
+		*space_freed += size_tpd_entry;
+	}
+}
+
+/*
+ * TPDPagePruneExecute - Guts of the TPD page pruning.
+ *
+ * Here, we mark all the entries that can be pruned as unused and then call page
+ * repair fragmentation to compact the page.
+ */
+void
+TPDPagePruneExecute(Buffer tpdbuf, OffsetNumber *nowunused, int nunused)
+{
+	Page	tpdpage;
+	OffsetNumber *offnum;
+	int		i;
+
+	tpdpage = BufferGetPage(tpdbuf);
+
+	/* Update all now-unused line pointers */
+	offnum = nowunused;
+	for (i = 0; i < nunused; i++)
+	{
+		OffsetNumber off = *offnum++;
+		ItemId		lp = PageGetItemId(tpdpage, off);
+
+		ItemIdSetUnused(lp);
+	}
+}
+
+/*
+ * TPDPageRepairFragmentation - Frees fragmented space on a tpd page.
+ *
+ * It doesn't remove unused line pointers because some heappage might
+ * still point to the line pointer.  If we remove the line pointer, then
+ * the same space could be occupied by actual TPD entry in which case somebody
+ * trying to access that line pointer will get unpredictable behavior.
+ */
+void
+TPDPageRepairFragmentation(Page page, Page tmppage, OffsetNumber target_offnum,
+						   Size space_required)
+{
+	Offset		pd_lower = ((PageHeader) page)->pd_lower;
+	Offset		pd_upper = ((PageHeader) page)->pd_upper;
+	Offset		pd_special = ((PageHeader) page)->pd_special;
+	itemIdSortData itemidbase[MaxTPDTuplesPerPage];
+	itemIdSort	itemidptr;
+	ItemId		lp;
+	int			nline,
+				nstorage,
+				nunused;
+	int			i;
+	Size		totallen;
+
+	/*
+	 * It's worth the trouble to be more paranoid here than in most places,
+	 * because we are about to reshuffle data in (what is usually) a shared
+	 * disk buffer.  If we aren't careful then corrupted pointers, lengths,
+	 * etc could cause us to clobber adjacent disk buffers, spreading the data
+	 * loss further.  So, check everything.
+	 */
+	if (pd_lower < SizeOfPageHeaderData ||
+		pd_lower > pd_upper ||
+		pd_upper > pd_special ||
+		pd_special > BLCKSZ)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+						pd_lower, pd_upper, pd_special)));
+
+	/*
+	 * Run through the line pointer array and collect data about live items.
+	 */
+	nline = PageGetMaxOffsetNumber(page);
+	itemidptr = itemidbase;
+	nunused = totallen = 0;
+	for (i = FirstOffsetNumber; i <= nline; i++)
+	{
+		lp = PageGetItemId(page, i);
+		if (ItemIdIsUsed(lp))
+		{
+			if (ItemIdHasStorage(lp))
+			{
+				itemidptr->offsetindex = i - 1;
+				itemidptr->itemoff = ItemIdGetOffset(lp);
+				if (unlikely(itemidptr->itemoff < (int) pd_upper ||
+							 itemidptr->itemoff >= (int) pd_special))
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("corrupted item pointer: %u",
+									itemidptr->itemoff)));
+				if (i == target_offnum)
+					itemidptr->alignedlen = ItemIdGetLength(lp) +
+														space_required;
+				else
+					itemidptr->alignedlen = ItemIdGetLength(lp);
+				totallen += itemidptr->alignedlen;
+				itemidptr++;
+			}
+		}
+		else
+		{
+			/* Unused entries should have lp_len = 0, but make sure */
+			ItemIdSetUnused(lp);
+			nunused++;
+		}
+	}
+
+	nstorage = itemidptr - itemidbase;
+	if (nstorage == 0)
+	{
+		/* Page is completely empty, so just reset it quickly */
+		((PageHeader) page)->pd_lower = SizeOfPageHeaderData;
+		((PageHeader) page)->pd_upper = pd_special;
+	}
+	else
+	{
+		/* Need to compact the page the hard way */
+		if (totallen > (Size) (pd_special - pd_lower))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("corrupted item lengths: total %u, available space %u",
+							(unsigned int) totallen, pd_special - pd_lower)));
+
+		compactify_ztuples(itemidbase, nstorage, page, tmppage);
+	}
+
+	/* Set hint bit for TPDPageAddEntry */
+	if (nunused > 0)
+		PageSetHasFreeLinePointers(page);
+	else
+		PageClearHasFreeLinePointers(page);
+}
+
+/*
+ * LogTPDClean - Write WAL for TPD entries that can be pruned.
+ */
+XLogRecPtr
+LogTPDClean(Relation rel, Buffer tpdbuf,
+			OffsetNumber *nowunused, int nunused,
+			OffsetNumber target_offnum, Size space_required)
+{
+	XLogRecPtr      recptr;
+	xl_tpd_clean	xl_rec;
+
+	/* Caller should not call me on a non-WAL-logged relation */
+	Assert(RelationNeedsWAL(rel));
+
+	xl_rec.flags = 0;
+	XLogBeginInsert();
+
+	if (target_offnum != InvalidOffsetNumber)
+		xl_rec.flags |= XL_TPD_CONTAINS_OFFSET;
+	XLogRegisterData((char *) &xl_rec, SizeOfTPDClean);
+
+	/* Register the offset information. */
+	if (target_offnum != InvalidOffsetNumber)
+	{
+		XLogRegisterData((char *) &target_offnum, sizeof(OffsetNumber));
+		XLogRegisterData((char *) &space_required, sizeof(space_required));
+	}
+
+	XLogRegisterBuffer(0, tpdbuf, REGBUF_STANDARD);
+
+	/*
+	 * The OffsetNumber array is not actually in the buffer, but we pretend
+	 * it is.  When XLogInsert stores the whole buffer, the offset array need
+	 * not be stored too.  Note that even if the array is empty, we want to
+	 * expose the buffer as a candidate for whole-page storage, since this
+	 * record type implies a defragmentation operation even if no item pointers
+	 * changed state.
+	 */
+	if (nunused > 0)
+		XLogRegisterBufData(0, (char *) nowunused,
+							nunused * sizeof(OffsetNumber));
+
+	recptr = XLogInsert(RM_TPD_ID, XLOG_TPD_CLEAN);
+
+	return recptr;
+}
+
+/*
+ * TPDPruneEntirePage
+ */
+static int
+TPDPruneEntirePage(Relation rel, Buffer tpdbuf)
+{
+	Page	page = BufferGetPage(tpdbuf);
+	int		entries_removed = PageGetMaxOffsetNumber(page);
+
+	START_CRIT_SECTION();
+
+	/* Page is completely empty, so just reset it quickly */
+	((PageHeader) page)->pd_lower = SizeOfPageHeaderData;
+	((PageHeader) page)->pd_upper = ((PageHeader) page)->pd_special;
+
+	MarkBufferDirty(tpdbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr      recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, tpdbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_TPD_ID, XLOG_TPD_CLEAN_ALL_ENTRIES);
+
+		PageSetLSN(BufferGetPage(tpdbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	return entries_removed;
+}
diff --git a/src/backend/access/zheap/prunezheap.c b/src/backend/access/zheap/prunezheap.c
new file mode 100644
index 0000000000..cff20551e8
--- /dev/null
+++ b/src/backend/access/zheap/prunezheap.c
@@ -0,0 +1,943 @@
+/*-------------------------------------------------------------------------
+ *
+ * prunezheap.c
+ *	  zheap page pruning
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/heap/prunezheap.c
+ *
+ * In Zheap, we can reclaim space on following operations
+ * a. non-inplace updates, when committed or rolled back.
+ * b. inplace updates that reduces the tuple length, when commited.
+ * c. deletes, when committed.
+ * d. inserts, when rolled back.
+ *
+ * Since we only store xid which changed the page in pd_prune_xid, to prune
+ * the page, we can check if pd_prune_xid is in progress.  This can sometimes
+ * lead to unwanted page pruning calls as a side effect, example in case of
+ * rolled back deletes.  If there is nothing to prune, then the call to prune
+ * is cheap, so we don't want to optimize it at this stage.
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/zheap.h"
+#include "access/zheapam_xlog.h"
+#include "access/zheaputils.h"
+#include "utils/ztqual.h"
+#include "catalog/catalog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/procarray.h"
+
+/* Working data for zheap_page_prune and subroutines */
+typedef struct
+{
+	TransactionId new_prune_xid;	/* new prune hint value for page */
+	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
+	int			ndeleted;		/* numbers of entries in arrays below */
+	int			ndead;
+	int			nunused;
+	/* arrays that accumulate indexes of items to be changed */
+
+	/*
+	 * Fixme - arrays must use MaxZHeapTuplesPerPage, once we have constant
+	 * value for the same.
+	 */
+	OffsetNumber nowdeleted[MaxZHeapTuplesPerPage];
+	OffsetNumber nowdead[MaxZHeapTuplesPerPage];
+	OffsetNumber nowunused[MaxZHeapTuplesPerPage];
+	/* marked[i] is TRUE if item i is entered in one of the above arrays */
+	bool		marked[MaxZHeapTuplesPerPage + 1];
+}			ZPruneState;
+
+static int zheap_prune_item(Relation relation, Buffer buffer,
+				 OffsetNumber rootoffnum, TransactionId OldestXmin,
+				 ZPruneState *prstate, int *space_freed);
+static void zheap_prune_record_prunable(ZPruneState * prstate,
+							TransactionId xid);
+static void zheap_prune_record_dead(ZPruneState * prstate, OffsetNumber offnum);
+static void zheap_prune_record_deleted(ZPruneState * prstate,
+						   OffsetNumber offnum);
+
+/*
+ * Optionally prune and repair fragmentation in the specified page.
+ *
+ * Caller must have exclusive lock on the page.
+ *
+ * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
+ * or RECENTLY_DEAD (see ZHeapTupleSatisfiesOldestXmin).
+ *
+ * This is an opportunistic function.  It will perform housekeeping only if
+ * the page has effect of transaction thas has modified data which can be
+ * pruned.
+ *
+ * Note: This is called only when we need some space in page to perform the
+ * action which otherwise would need a different page.  It is called when an
+ * update statement has to update the existing tuple such that new tuple is
+ * bigger than old tuple and the same can't fit on page.
+ *
+ * Returns true, if we are able to free up the space such that the new tuple
+ * can fit into same page, otherwise, false.
+ */
+bool
+zheap_page_prune_opt(Relation relation, Buffer buffer,
+					 OffsetNumber offnum, Size space_required)
+{
+	Page	page;
+	TransactionId OldestXmin;
+	TransactionId ignore = InvalidTransactionId;
+	Size	pagefree;
+	bool	force_prune = false;
+	bool	pruned;
+
+	page = BufferGetPage(buffer);
+
+	/*
+	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * clean the page. The master will likely issue a cleaning WAL record soon
+	 * anyway, so this is no particular loss.
+	 */
+	if (RecoveryInProgress())
+		return false;
+
+	/*
+	 * Use the appropriate xmin horizon for this relation. If it's a proper
+	 * catalog relation or a user defined, additional, catalog relation, we
+	 * need to use the horizon that includes slots, otherwise the data-only
+	 * horizon can be used. Note that the toast relation of user defined
+	 * relations are *not* considered catalog relations.
+	 *
+	 * It is OK to apply the old snapshot limit before acquiring the cleanup
+	 * lock because the worst that can happen is that we are not quite as
+	 * aggressive about the cleanup (by however many transaction IDs are
+	 * consumed between this point and acquiring the lock).  This allows us to
+	 * save significant overhead in the case where the page is found not to be
+	 * prunable.
+	 */
+	if (IsCatalogRelation(relation) ||
+		RelationIsAccessibleInLogicalDecoding(relation))
+		OldestXmin = RecentGlobalXmin;
+	else
+		OldestXmin = RecentGlobalDataXmin;
+
+	Assert(TransactionIdIsValid(OldestXmin));
+
+	if (OffsetNumberIsValid(offnum))
+	{
+		pagefree = PageGetExactFreeSpace(page);
+
+		/*
+		 * We want to forcefully prune the page if we are sure that the
+		 * required space is available.  This will help in rearranging the
+		 * page such that we will be able to make space adjacent to required
+		 * offset number.
+		 */
+		if (space_required < pagefree)
+			force_prune = true;
+	}
+
+
+	/*
+	 * Let's see if we really need pruning.
+	 *
+	 * Forget it if page is not hinted to contain something prunable that's
+	 * committed and we don't want to forcefully prune the page.
+	 */
+	if (!ZPageIsPrunable(page) && !force_prune)
+		return false;
+
+	zheap_page_prune_guts(relation, buffer, OldestXmin, offnum,
+									 space_required, true, force_prune,
+									 &ignore, &pruned);
+	if (pruned)
+		return true;
+
+	return false;
+}
+
+/*
+ * Prune and repair fragmentation in the specified page.
+ *
+ * Caller must have pin and buffer cleanup lock on the page.
+ *
+ * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
+ * or RECENTLY_DEAD (see ZHeapTupleSatisfiesVacuum).
+ *
+ * To perform pruning, we make the copy of the page.  We don't scribble on
+ * that copy, rather it is only used during repair fragmentation to copy
+ * the tuples.  So, we need to ensure that after making the copy, we operate
+ * on tuples, otherwise, the temporary copy will become useless.  It is okay
+ * scribble on itemid's or special space of page.
+ *
+ * If report_stats is true then we send the number of reclaimed tuples to
+ * pgstats.  (This must be false during vacuum, since vacuum will send its own
+ * own new total to pgstats, and we don't want this delta applied on top of
+ * that.)
+ *
+ * Returns the number of tuples deleted from the page and sets
+ * latestRemovedXid.  It returns 0, when removed the dead tuples can't free up
+ * the space required.
+ */
+int
+zheap_page_prune_guts(Relation relation, Buffer buffer,
+					  TransactionId OldestXmin, OffsetNumber target_offnum,
+					  Size space_required, bool report_stats,
+					  bool force_prune, TransactionId *latestRemovedXid,
+					  bool *pruned)
+{
+	int			ndeleted = 0;
+	int			space_freed = 0;
+	Page		page = BufferGetPage(buffer);
+	Page		tmppage = NULL;
+	OffsetNumber offnum,
+				maxoff;
+	ZPruneState prstate;
+	bool		execute_pruning = false;
+
+	if (pruned)
+		*pruned = false;
+
+	/* initialize the space_free with already existing free space in page */
+	space_freed = PageGetExactFreeSpace(page);
+
+	/*
+	 * Our strategy is to scan the page and make lists of items to change,
+	 * then apply the changes within a critical section.  This keeps as much
+	 * logic as possible out of the critical section, and also ensures that
+	 * WAL replay will work the same as the normal case.
+	 *
+	 * First, initialize the new pd_prune_xid value to zero (indicating no
+	 * prunable tuples).  If we find any tuples which may soon become
+	 * prunable, we will save the lowest relevant XID in new_prune_xid. Also
+	 * initialize the rest of our working state.
+	 */
+	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.ndeleted = prstate.ndead = prstate.nunused = 0;
+	memset(prstate.marked, 0, sizeof(prstate.marked));
+
+	/*
+	 * If caller has asked to rearrange the page and page is not marked for
+	 * pruning, then skip scanning the page.
+	 *
+	 * XXX We might want to remove this check once we have some optimal
+	 * strategy to rearrange the page where we anyway need to traverse all
+	 * rows.
+	 */
+	if (force_prune && !ZPageIsPrunable(page))
+	{
+		; /* no need to scan */
+	}
+	else
+	{
+		/* Scan the page */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (offnum = FirstOffsetNumber;
+			offnum <= maxoff;
+			offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			/* Ignore items already processed as part of an earlier chain */
+			if (prstate.marked[offnum])
+				continue;
+
+			/* Nothing to do if slot is empty, already dead or marked as deleted */
+			itemid = PageGetItemId(page, offnum);
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsDeleted(itemid))
+				continue;
+
+			/* Process this item */
+			ndeleted += zheap_prune_item(relation, buffer, offnum,
+				OldestXmin,
+				&prstate,
+				&space_freed);
+		}
+	}
+
+	/*
+	 * There is not much advantage in continuing, if we can't free the space
+	 * required by the caller or we are not asked to forcefully prune the
+	 * page.
+	 *
+	 * XXX - In theory, we can still continue and perform pruning in the hope
+	 * that some future update in this page will be able to use that space.
+	 * However, it will lead to additional writes without any guaranteed
+	 * benefit, so we skip the pruning for now.
+	 */
+	if (space_freed < space_required)
+		return 0;
+
+	/* Do we want to prune? */
+	if (prstate.ndeleted > 0 || prstate.ndead > 0 ||
+		prstate.nunused > 0 || force_prune)
+	{
+		PageHeader	phdr;
+
+		execute_pruning = true;
+
+		/*
+		 * We prepare the temporary copy of the page so that during page
+		 * repair fragmentation we can use it to copy the actual tuples.
+		 */
+		tmppage = PageGetTempPageCopy(page);
+
+		/*
+		 * Lock the TPD page before starting critical section.  We might need
+		 * to access it during page repair fragmentation.
+		 */
+		phdr = (PageHeader) page;
+		if (ZHeapPageHasTPDSlot(phdr))
+			TPDPageLock(relation, buffer);
+	}
+
+	/* Any error while applying the changes is critical */
+	START_CRIT_SECTION();
+
+	if (execute_pruning)
+	{
+		bool	has_pruned = false;
+
+		/*
+		 * Apply the planned item changes, then repair page fragmentation, and
+		 * update the page's hint bit about whether it has free line pointers.
+		 */
+		zheap_page_prune_execute(buffer, target_offnum,
+								 prstate.nowdeleted, prstate.ndeleted,
+								 prstate.nowdead, prstate.ndead,
+								 prstate.nowunused, prstate.nunused);
+
+		/*
+		 * Finally, repair any fragmentation, and update the page's hint bit about
+		 * whether it has free pointers.
+		 */
+		ZPageRepairFragmentation(buffer, tmppage, target_offnum,
+								 space_required, false, &has_pruned);
+
+		/*
+		 * Update the page's pd_prune_xid field to either zero, or the lowest
+		 * XID of any soon-prunable tuple.
+		 */
+		((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
+
+		/*
+		 * Also clear the "page is full" flag, since there's no point in
+		 * repeating the prune/defrag process until something else happens to
+		 * the page.
+		 */
+		PageClearFull(page);
+
+		MarkBufferDirty(buffer);
+
+		/*
+		 * Emit a WAL ZHEAP_CLEAN record showing what we did
+		 */
+		if (RelationNeedsWAL(relation))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_zheap_clean(relation, buffer, target_offnum,
+									 space_required, prstate.nowdeleted,
+									 prstate.ndeleted, prstate.nowdead,
+									 prstate.ndead, prstate.nowunused,
+									 prstate.nunused,
+									 prstate.latestRemovedXid, has_pruned);
+
+			PageSetLSN(BufferGetPage(buffer), recptr);
+		}
+
+		if (pruned)
+			*pruned = has_pruned;
+	}
+	else
+	{
+		/*
+		 * If we didn't prune anything, but have found a new value for the
+		 * pd_prune_xid field, update it and mark the buffer dirty. This is
+		 * treated as a non-WAL-logged hint.
+		 *
+		 * Also clear the "page is full" flag if it is set, since there's no
+		 * point in repeating the prune/defrag process until something else
+		 * happens to the page.
+		 */
+		if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
+			PageIsFull(page))
+		{
+			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
+			PageClearFull(page);
+			MarkBufferDirtyHint(buffer, true);
+		}
+	}
+
+	END_CRIT_SECTION();
+
+	/*
+	 * Report the number of tuples reclaimed to pgstats. This is ndeleted
+	 * minus ndead, because we don't want to count a now-DEAD item or a
+	 * now-DELETED item as a deletion for this purpose.
+	 */
+	if (report_stats && ndeleted > (prstate.ndead + prstate.ndeleted))
+		pgstat_update_heap_dead_tuples(relation, ndeleted - (prstate.ndead + prstate.ndeleted));
+
+	*latestRemovedXid = prstate.latestRemovedXid;
+
+	/* be tidy. */
+	if (tmppage)
+		pfree(tmppage);
+	UnlockReleaseTPDBuffers();
+
+	/*
+	 * XXX Should we update FSM information for this?  Not doing so will
+	 * increase the chances of in-place updates.  See heap_page_prune for a
+	 * detailed reason.
+	 */
+
+	return ndeleted;
+}
+
+/*
+ * Perform the actual page changes needed by zheap_page_prune_guts.
+ * It is expected that the caller has suitable pin and lock on the
+ * buffer, and is inside a critical section.
+ */
+void
+zheap_page_prune_execute(Buffer buffer, OffsetNumber target_offnum,
+						 OffsetNumber *deleted, int ndeleted,
+						 OffsetNumber *nowdead, int ndead,
+						 OffsetNumber *nowunused, int nunused)
+{
+	Page		page = (Page) BufferGetPage(buffer);
+	OffsetNumber *offnum;
+	int			i;
+
+	/* Update all deleted line pointers */
+	offnum = deleted;
+	for (i = 0; i < ndeleted; i++)
+	{
+		ZHeapTupleHeader tup;
+		int			trans_slot;
+		uint8		vis_info = 0;
+		OffsetNumber off = *offnum++;
+		ItemId		lp;
+
+		/* The target offset must not be deleted. */
+		Assert(target_offnum != off);
+
+		lp = PageGetItemId(page, off);
+
+		tup = (ZHeapTupleHeader) PageGetItem(page, lp);
+		trans_slot = ZHeapTupleHeaderGetXactSlot(tup);
+
+		/*
+		 * The frozen slot indicates tuple is dead, so we must not see them in
+		 * the array of tuples to be marked as deleted.
+		 */
+		Assert(trans_slot != ZHTUP_SLOT_FROZEN);
+
+		if (ZHeapTupleDeleted(tup))
+			vis_info = ITEMID_DELETED;
+		if (ZHeapTupleHasInvalidXact(tup->t_infomask))
+			vis_info |= ITEMID_XACT_INVALID;
+
+		/*
+		 * Mark the Item as deleted and copy the visibility info and
+		 * transaction slot information from tuple to ItemId.
+		 */
+		ItemIdSetDeleted(lp, trans_slot, vis_info);
+	}
+
+	/* Update all now-dead line pointers */
+	offnum = nowdead;
+	for (i = 0; i < ndead; i++)
+	{
+		OffsetNumber off = *offnum++;
+		ItemId		lp;
+
+		/* The target offset must not be dead. */
+		Assert(target_offnum != off);
+
+		lp = PageGetItemId(page, off);
+
+		ItemIdSetDead(lp);
+	}
+
+	/* Update all now-unused line pointers */
+	offnum = nowunused;
+	for (i = 0; i < nunused; i++)
+	{
+		OffsetNumber off = *offnum++;
+		ItemId		lp;
+
+		/* The target offset must not be unused. */
+		Assert(target_offnum != off);
+
+		lp = PageGetItemId(page, off);
+
+		ItemIdSetUnused(lp);
+	}
+}
+
+/*
+ * Prune specified item pointer.
+ *
+ * OldestXmin is the cutoff XID used to identify dead tuples.
+ *
+ * We don't actually change the page here.  We just add entries to the arrays in
+ * prstate showing the changes to be made.  Items to be set to LP_DEAD state are
+ * added to nowdead[]; items to be set to LP_DELETED are added to nowdeleted[];
+ * and items to be set to LP_UNUSED state are added to nowunused[].
+ *
+ * Returns the number of tuples (to be) deleted from the page.
+ */
+static int
+zheap_prune_item(Relation relation, Buffer buffer, OffsetNumber offnum,
+				 TransactionId OldestXmin, ZPruneState *prstate,
+				 int *space_freed)
+{
+	ZHeapTupleData tup;
+	ItemId		lp;
+	Page		dp = (Page) BufferGetPage(buffer);
+	int			ndeleted = 0;
+	TransactionId xid;
+	bool		tupdead,
+				recent_dead;
+
+	lp = PageGetItemId(dp, offnum);
+
+	Assert(ItemIdIsNormal(lp));
+
+	tup.t_data = (ZHeapTupleHeader) PageGetItem(dp, lp);
+	tup.t_len = ItemIdGetLength(lp);
+	ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
+	tup.t_tableOid = RelationGetRelid(relation);
+
+	/*
+	 * Check tuple's visibility status.
+	 */
+	tupdead = recent_dead = false;
+
+	switch (ZHeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer, &xid))
+	{
+		case ZHEAPTUPLE_DEAD:
+			tupdead = true;
+			break;
+
+		case ZHEAPTUPLE_RECENTLY_DEAD:
+			recent_dead = true;
+			break;
+
+		case ZHEAPTUPLE_DELETE_IN_PROGRESS:
+
+			/*
+			 * This tuple may soon become DEAD.  Update the hint field so that
+			 * the page is reconsidered for pruning in future.
+			 */
+			zheap_prune_record_prunable(prstate, xid);
+			break;
+
+		case ZHEAPTUPLE_LIVE:
+		case ZHEAPTUPLE_INSERT_IN_PROGRESS:
+
+			/*
+			 * If we wanted to optimize for aborts, we might consider marking
+			 * the page prunable when we see INSERT_IN_PROGRESS. But we don't.
+			 * See related decisions about when to mark the page prunable in
+			 * heapam.c.
+			 */
+			break;
+
+		case ZHEAPTUPLE_ABORT_IN_PROGRESS:
+			/*
+			 * We can simply skip the tuple if it has inserted/operated by
+			 * some aborted transaction and its rollback is still pending. It'll
+			 * be taken care of by future prune calls.
+			 */
+			break;
+		default:
+			elog(ERROR, "unexpected ZHeapTupleSatisfiesVacuum result");
+			break;
+	}
+
+	if (tupdead)
+		ZHeapTupleHeaderAdvanceLatestRemovedXid(tup.t_data, xid, &prstate->latestRemovedXid);
+
+	if (tupdead || recent_dead)
+	{
+		/*
+		 * Count dead or recently dead tuple in result and update the space
+		 * that can be freed.
+		 */
+		ndeleted++;
+
+		/* short aligned */
+		*space_freed += SHORTALIGN(tup.t_len);
+	}
+
+	/* Record dead item */
+	if (tupdead)
+		zheap_prune_record_dead(prstate, offnum);
+
+	/* Record deleted item */
+	if (recent_dead)
+		zheap_prune_record_deleted(prstate, offnum);
+
+	return ndeleted;
+}
+
+/* Record lowest soon-prunable XID */
+static void
+zheap_prune_record_prunable(ZPruneState * prstate, TransactionId xid)
+{
+	/*
+	 * This should exactly match the PageSetPrunable macro.  We can't store
+	 * directly into the page header yet, so we update working state.
+	 */
+	Assert(TransactionIdIsNormal(xid));
+	if (!TransactionIdIsValid(prstate->new_prune_xid) ||
+		TransactionIdPrecedes(xid, prstate->new_prune_xid))
+		prstate->new_prune_xid = xid;
+}
+
+/* Record item pointer to be marked dead */
+static void
+zheap_prune_record_dead(ZPruneState * prstate, OffsetNumber offnum)
+{
+	Assert(prstate->ndead < MaxZHeapTuplesPerPage);
+	prstate->nowdead[prstate->ndead] = offnum;
+	prstate->ndead++;
+	Assert(!prstate->marked[offnum]);
+	prstate->marked[offnum] = true;
+}
+
+/* Record item pointer to be deleted */
+static void
+zheap_prune_record_deleted(ZPruneState * prstate, OffsetNumber offnum)
+{
+	Assert(prstate->ndead < MaxZHeapTuplesPerPage);
+	prstate->nowdeleted[prstate->ndeleted] = offnum;
+	prstate->ndeleted++;
+	Assert(!prstate->marked[offnum]);
+	prstate->marked[offnum] = true;
+}
+
+/*
+ * log_zheap_clean - Perform XLogInsert for a zheap-clean operation.
+ *
+ * Caller must already have modified the buffer and marked it dirty.
+ *
+ * We also include latestRemovedXid, which is the greatest XID present in
+ * the removed tuples. That allows recovery processing to cancel or wait
+ * for long standby queries that can still see these tuples.
+ */
+XLogRecPtr
+log_zheap_clean(Relation reln, Buffer buffer, OffsetNumber target_offnum,
+				Size space_required, OffsetNumber *nowdeleted, int ndeleted,
+				OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused,
+				int nunused, TransactionId latestRemovedXid, bool pruned)
+{
+	XLogRecPtr      recptr;
+	xl_zheap_clean	xl_rec;
+
+	/* Caller should not call me on a non-WAL-logged relation */
+	Assert(RelationNeedsWAL(reln));
+
+	xl_rec.latestRemovedXid = latestRemovedXid;
+	xl_rec.ndeleted = ndeleted;
+	xl_rec.ndead = ndead;
+	xl_rec.flags = 0;
+	XLogBeginInsert();
+
+	if (pruned)
+		xl_rec.flags |= XLZ_CLEAN_ALLOW_PRUNING;
+	XLogRegisterData((char *) &xl_rec, SizeOfZHeapClean);
+
+	/* Register the offset information. */
+	if (target_offnum != InvalidOffsetNumber)
+	{
+		xl_rec.flags |= XLZ_CLEAN_CONTAINS_OFFSET;
+		XLogRegisterData((char *) &target_offnum, sizeof(OffsetNumber));
+		XLogRegisterData((char *) &space_required, sizeof(space_required));
+	}
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+	/*
+	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
+	 * that they are.  When XLogInsert stores the whole buffer, the offset
+	 * arrays need not be stored too.  Note that even if all three arrays are
+	 * empty, we want to expose the buffer as a candidate for whole-page
+	 * storage, since this record type implies a defragmentation operation
+	 * even if no item pointers changed state.
+	 */
+	if (ndeleted > 0)
+		XLogRegisterBufData(0, (char *) nowdeleted,
+					ndeleted * sizeof(OffsetNumber) * 2);
+
+	if (ndead > 0)
+		XLogRegisterBufData(0, (char *) nowdead,
+					ndead * sizeof(OffsetNumber));
+
+	if (nunused > 0)
+		XLogRegisterBufData(0, (char *) nowunused,
+					nunused * sizeof(OffsetNumber));
+
+	recptr = XLogInsert(RM_ZHEAP_ID, XLOG_ZHEAP_CLEAN);
+
+	return recptr;
+}
+
+/*
+ * After removing or marking some line pointers unused, move the tuples to
+ * remove the gaps caused by the removed items.  Here, we are rearranging
+ * the page such that tuples will be placed in itemid order.  It will help
+ * in the speedup of future sequential scans.
+ *
+ * Note that we use the temporary copy of the page to copy the tuples as
+ * writing in itemid order will overwrite some tuples.
+ */
+void
+compactify_ztuples(itemIdSort itemidbase, int nitems, Page page, Page tmppage)
+{
+	PageHeader	phdr = (PageHeader) page;
+	Offset		upper;
+	int			i;
+
+	Assert(PageIsValid(tmppage));
+	upper = phdr->pd_special;
+	for (i = nitems - 1; i >= 0; i--)
+	{
+		itemIdSort	itemidptr = &itemidbase[i];
+		ItemId		lp;
+
+		lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+		upper -= itemidptr->alignedlen;
+		memcpy((char *) page + upper,
+			   (char *) tmppage + itemidptr->itemoff,
+			   lp->lp_len);
+		lp->lp_off = upper;
+	}
+
+	phdr->pd_upper = upper;
+}
+
+/*
+ * ZPageRepairFragmentation
+ *
+ * Frees fragmented space on a page.
+ *
+ * The basic idea is same as PageRepairFragmentation, but here we additionally
+ * deal with unused items that can't be immediately reclaimed.  We don't allow
+ * page to be pruned, if there is an inplace update from an open transaction.
+ * The reason is that we don't know the size of previous row in undo which
+ * could be bigger in which case we might not be able to perform rollback once
+ * the page is repaired.  Now, we can always traverse the undo chain to find
+ * the size of largest tuple in the chain, but we don't do that for now as it
+ * can take time especially if there are many such tuples on the page.
+ */
+void
+ZPageRepairFragmentation(Buffer buffer, Page tmppage,
+						 OffsetNumber target_offnum, Size space_required,
+						 bool NoTPDBufLock, bool *pruned)
+{
+	Page		page = BufferGetPage(buffer);
+	Offset		pd_lower = ((PageHeader)page)->pd_lower;
+	Offset		pd_upper = ((PageHeader)page)->pd_upper;
+	Offset		pd_special = ((PageHeader)page)->pd_special;
+	itemIdSortData itemidbase[MaxZHeapTuplesPerPage];
+	itemIdSort	itemidptr;
+	ItemId		lp;
+	TransactionId	xid;
+	uint32			epoch;
+	UndoRecPtr		urec_ptr;
+	int			nline,
+				nstorage,
+				nunused;
+	int			i;
+	Size		totallen;
+
+	/*
+	 * It's worth the trouble to be more paranoid here than in most places,
+	 * because we are about to reshuffle data in (what is usually) a shared
+	 * disk buffer.  If we aren't careful then corrupted pointers, lengths,
+	 * etc could cause us to clobber adjacent disk buffers, spreading the data
+	 * loss further.  So, check everything.
+	 */
+	if (pd_lower < SizeOfPageHeaderData ||
+		pd_lower > pd_upper ||
+		pd_upper > pd_special ||
+		pd_special > BLCKSZ ||
+		pd_special != MAXALIGN(pd_special))
+		ereport(ERROR,
+		(errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+				pd_lower, pd_upper, pd_special)));
+
+	nline = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * If there are any tuples which are inplace updated by any open
+	 * transactions we shall not compactify the page contents, otherwise,
+	 * rollback of those transactions will not be possible.  There could be
+	 * a case, where within a transaction tuple is first inplace updated
+	 * and then, either updated or deleted. So for now avoid compaction if
+	 * there are any tuples which are marked inplace updated, updated or
+	 * deleted by an open transaction.
+	 */
+	for (i = FirstOffsetNumber; i <= nline; i++)
+	{
+		lp = PageGetItemId(page, i);
+		if (ItemIdIsUsed(lp) && ItemIdHasStorage(lp))
+		{
+			ZHeapTupleHeader tup;
+
+			tup = (ZHeapTupleHeader) PageGetItem(page, lp);
+
+			if (!(tup->t_infomask & (ZHEAP_INPLACE_UPDATED |
+									 ZHEAP_UPDATED | ZHEAP_DELETED)))
+				continue;
+
+			if (!ZHeapTupleHasInvalidXact(tup->t_infomask))
+			{
+				int			trans_slot;
+
+				trans_slot = ZHeapTupleHeaderGetXactSlot(tup);
+				if (trans_slot == ZHTUP_SLOT_FROZEN)
+					continue;
+
+				/*
+				 * XXX There is possibility that the updater's slot got reused by a
+				 * locker in such a case the INVALID_XACT will be moved to lockers
+				 * undo.  Now, we will find that the tuple has in-place update flag
+				 * but it doesn't have INVALID_XACT flag and the slot transaction is
+				 * also running, in such case we will not prune this page.  Ideally
+				 * if the multi-locker is set we can get the actual transaction and
+				 * check the status of the transaction.
+				 */
+				trans_slot = GetTransactionSlotInfo(buffer, i, trans_slot,
+													&epoch, &xid, &urec_ptr,
+													NoTPDBufLock, false);
+				/*
+				 * It is quite possible that the item is showing some
+				 * valid transaction slot, but actual slot has been frozen.
+				 * This can happen when the slot belongs to TPD entry and
+				 * the corresponding TPD entry is pruned.
+				 */
+				if (trans_slot == ZHTUP_SLOT_FROZEN)
+					continue;
+
+				if (!TransactionIdDidCommit(xid))
+					return;
+			}
+		}
+	}
+
+	/*
+	 * Run through the line pointer array and collect data about live items.
+	 */
+	itemidptr = itemidbase;
+	nunused = totallen = 0;
+	for (i = FirstOffsetNumber; i <= nline; i++)
+	{
+		lp = PageGetItemId(page, i);
+		if (ItemIdIsUsed(lp))
+		{
+			if (ItemIdHasStorage(lp))
+			{
+				itemidptr->offsetindex = i - 1;
+				itemidptr->itemoff = ItemIdGetOffset(lp);
+				if (unlikely(itemidptr->itemoff < (int)pd_upper ||
+					itemidptr->itemoff >= (int)pd_special))
+					ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+						errmsg("corrupted item pointer: %u",
+							itemidptr->itemoff)));
+				/*
+				 * We need to save additional space for the target offset, so
+				 * that we can save the space for new tuple.
+				 */
+				if (i == target_offnum)
+					itemidptr->alignedlen = SHORTALIGN(ItemIdGetLength(lp) + space_required);
+				else
+					itemidptr->alignedlen = SHORTALIGN(ItemIdGetLength(lp));
+				totallen += itemidptr->alignedlen;
+				itemidptr++;
+			}
+		}
+		else
+		{
+			nunused++;
+
+			/*
+			 * We allow Unused entries to be reused only if there is no
+			 * transaction information for the entry or the transaction
+			 * is committed.
+			 */
+			if (ItemIdHasPendingXact(lp))
+			{
+				int		trans_slot = ItemIdGetTransactionSlot(lp);
+
+				/*
+				 * Here, we are relying on the transaction information in
+				 * slot as if the corresponding slot has been reused, then
+				 * transaction information from the entry would have been
+				 * cleared.  See PageFreezeTransSlots.
+				 */
+				if (trans_slot != ZHTUP_SLOT_FROZEN)
+				{
+					trans_slot = GetTransactionSlotInfo(buffer, i, trans_slot,
+														&epoch, &xid,
+														&urec_ptr, NoTPDBufLock,
+														false);
+					/*
+					 * It is quite possible that the item is showing some
+					 * valid transaction slot, but actual slot has been frozen.
+					 * This can happen when the slot belongs to TPD entry and
+					 * the corresponding TPD entry is pruned.
+					 */
+					if (trans_slot != ZHTUP_SLOT_FROZEN &&
+						!TransactionIdDidCommit(xid))
+						continue;
+				}
+			}
+
+			/* Unused entries should have lp_len = 0, but make sure */
+			ItemIdSetUnused(lp);
+		}
+	}
+
+	nstorage = itemidptr - itemidbase;
+	if (nstorage == 0)
+	{
+		/* Page is completely empty, so just reset it quickly */
+		((PageHeader)page)->pd_upper = pd_special;
+	}
+	else
+	{
+		/* Need to compact the page the hard way */
+		if (totallen > (Size)(pd_special - pd_lower))
+			ereport(ERROR,
+			(errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("corrupted item lengths: total %u, available space %u",
+				(unsigned int)totallen, pd_special - pd_lower)));
+
+		compactify_ztuples(itemidbase, nstorage, page, tmppage);
+	}
+
+	/* Set hint bit for PageAddItem */
+	if (nunused > 0)
+		PageSetHasFreeLinePointers(page);
+	else
+		PageClearHasFreeLinePointers(page);
+
+	/* indicate that the page has been pruned */
+	if (pruned)
+		*pruned = true;
+}
diff --git a/src/backend/access/zheap/rewritezheap.c b/src/backend/access/zheap/rewritezheap.c
new file mode 100644
index 0000000000..4f3dac46c5
--- /dev/null
+++ b/src/backend/access/zheap/rewritezheap.c
@@ -0,0 +1,373 @@
+/*-------------------------------------------------------------------------
+ *
+ * rewritezheap.c
+ *	  Support functions to rewrite zheap tables.
+ *
+ * These functions provide a facility to completely rewrite a heap.
+ *
+ * INTERFACE
+ *
+ * The caller is responsible for creating the new heap, all catalog
+ * changes, supplying the tuples to be written to the new heap, and
+ * rebuilding indexes.  The caller must hold AccessExclusiveLock on the
+ * target table, because we assume no one else is writing into it.
+ *
+ * To use the facility:
+ *
+ * begin_heap_rewrite
+ * while (fetch next tuple)
+ * {
+ *	   if (tuple is dead)
+ *		   rewrite_heap_dead_tuple
+ *	   else
+ *	   {
+ *		   // do any transformations here if required
+ *		   rewrite_heap_tuple
+ *	   }
+ * }
+ * end_zheap_rewrite
+ *
+ * The contents of the new relation shouldn't be relied on until after
+ * end_zheap_rewrite is called.
+ *
+ *
+ * IMPLEMENTATION
+ *
+ * As of now, this layer gets only LIVE tuples and we freeze them before
+ * storing in new heap.  This is not a good idea as we lose all the
+ * visibility information of tuples, but OTOH, the same can't be copied
+ * from the original tuple as that is maintained in undo and we don't have
+ * facility to modify undorecords.
+ *
+ * One idea to capture the visibility information is that we should write a
+ * special undo record such that it stores previous version's visibility
+ * information and later if the current version is not visible as per latest
+ * xid (which is of cluster/vacuum full command), then we should get previous
+ * xid information from undo.  It seems along with previous versions xid, we
+ * need to write previous version tuples as well and somehow need to fix the
+ * ctid information in the undo records.
+ *
+ * We can't use the normal zheap_insert function to insert into the new
+ * heap, because heap_insert overwrites the visibility information and
+ * it uses buffer management layer to process the tuples which is bit
+ * slower.  We use a special-purpose raw_zheap_insert function instead, which
+ * is optimized for bulk inserting a lot of tuples, knowing that we have
+ * exclusive access to the heap.  raw_zheap_insert builds new pages in
+ * local storage.  When a page is full, or at the end of the process,
+ * we insert it to WAL as a single record and then write it to disk
+ * directly through smgr.  Note, however, that any data sent to the new
+ * heap's TOAST table will go through the normal bufmgr.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994-5, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/rewritezheap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/rewritezheap.h"
+#include "access/tuptoaster.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "storage/procarray.h"
+#include "utils/memutils.h"
+
+
+/*
+ * State associated with a rewrite operation. This is opaque to the user
+ * of the rewrite facility.
+ */
+typedef struct RewriteZheapStateData
+{
+	Relation	rs_new_rel;		/* destination heap */
+	Page		rs_buffer;		/* page currently being built */
+	BlockNumber rs_blockno;		/* block where page will go */
+	bool		rs_buffer_valid;	/* T if any tuples in buffer */
+	bool		rs_use_wal;		/* must we WAL-log inserts? */
+	MemoryContext rs_cxt;		/* for hash tables and entries and tuples in
+								 * them */
+}			RewriteZheapStateData;
+
+
+/* prototypes for internal functions */
+static void raw_zheap_insert(RewriteZheapState state, ZHeapTuple tup);
+
+/*
+ * Begin a rewrite of a table
+ *
+ * old_heap		old, locked heap relation tuples will be read from
+ * new_heap		new, locked heap relation to insert tuples to
+ * oldest_xmin	xid used by the caller to determine which tuples are dead
+ * freeze_xid	this is kept for API compatability with heap, it's value will
+ *				be InvalidTransactionId.
+ * min_multi	this is kept for API compatability with heap, it's value will
+ *				will be InvalidMultiXactId
+ * use_wal		should the inserts to the new heap be WAL-logged?
+ *
+ * Returns an opaque RewriteState, allocated in current memory context,
+ * to be used in subsequent calls to the other functions.
+ */
+RewriteZheapState
+begin_zheap_rewrite(Relation old_heap, Relation new_heap,
+					TransactionId oldest_xmin, TransactionId freeze_xid,
+					MultiXactId cutoff_multi, bool use_wal)
+{
+	RewriteZheapState state;
+	MemoryContext rw_cxt;
+	MemoryContext old_cxt;
+
+	/*
+	 * To ease cleanup, make a separate context that will contain the
+	 * RewriteState struct itself plus all subsidiary data.
+	 */
+	rw_cxt = AllocSetContextCreate(CurrentMemoryContext,
+								   "Table rewrite",
+								   ALLOCSET_DEFAULT_SIZES);
+	old_cxt = MemoryContextSwitchTo(rw_cxt);
+
+	/* Create and fill in the state struct */
+	state = palloc0(sizeof(RewriteZheapStateData));
+
+	state->rs_new_rel = new_heap;
+	state->rs_buffer = (Page) palloc(BLCKSZ);
+	/* new_heap needn't be empty, just locked */
+	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
+	state->rs_buffer_valid = false;
+	state->rs_use_wal = use_wal;
+	state->rs_cxt = rw_cxt;
+
+	MemoryContextSwitchTo(old_cxt);
+
+	return state;
+}
+
+/*
+ * End a rewrite.
+ *
+ * state and any other resources are freed.
+ */
+void
+end_zheap_rewrite(RewriteZheapState state)
+{
+	/* Write the last page, if any */
+	if (state->rs_buffer_valid)
+	{
+		if (state->rs_use_wal)
+			log_newpage(&state->rs_new_rel->rd_node,
+						MAIN_FORKNUM,
+						state->rs_blockno,
+						state->rs_buffer,
+						true);
+		RelationOpenSmgr(state->rs_new_rel);
+
+		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
+
+		smgrextend(state->rs_new_rel->rd_smgr, MAIN_FORKNUM, state->rs_blockno,
+				   (char *) state->rs_buffer, true);
+	}
+
+	/*
+	 * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
+	 * to ensure that the toast table gets fsync'd too.
+	 *
+	 * It's obvious that we must do this when not WAL-logging. It's less
+	 * obvious that we have to do it even if we did WAL-log the pages. The
+	 * reason is the same as in tablecmds.c's copy_relation_data(): we're
+	 * writing data that's not in shared buffers, and so a CHECKPOINT
+	 * occurring during the rewriteheap operation won't have fsync'd data we
+	 * wrote before the checkpoint.
+	 */
+	if (RelationNeedsWAL(state->rs_new_rel))
+		heap_sync(state->rs_new_rel);
+
+	/* Deleting the context frees everything */
+	MemoryContextDelete(state->rs_cxt);
+}
+
+/*
+ * Reconstruct and rewrite the given tuple
+ *
+ * We cannot simply copy the tuple as-is, see reform_and_rewrite_tuple for
+ * reasons.
+ */
+void
+reform_and_rewrite_ztuple(ZHeapTuple tuple, TupleDesc oldTupDesc,
+						  TupleDesc newTupDesc, Datum *values, bool *isnull,
+						  RewriteZheapState rwstate)
+{
+	ZHeapTuple	copiedTuple;
+	int			i;
+
+	zheap_deform_tuple(tuple, oldTupDesc, values, isnull);
+
+	/* Be sure to null out any dropped columns */
+	for (i = 0; i < newTupDesc->natts; i++)
+	{
+		if (TupleDescAttr(newTupDesc, i)->attisdropped)
+			isnull[i] = true;
+	}
+
+	copiedTuple = zheap_form_tuple(newTupDesc, values, isnull);
+
+	rewrite_zheap_tuple(rwstate, tuple, copiedTuple);
+
+	zheap_freetuple(copiedTuple);
+}
+
+/*
+ * Add a tuple to the new heap.
+ *
+ * Maintaining previous version's visibility information needs much more work
+ * (see atop of this file), so for now, we freeze all the tuples.  We only get
+ * LIVE versions of the tuple as input.
+ *
+ * Note that since we scribble on new_tuple, it had better be temp storage
+ * not a pointer to the original tuple.
+ *
+ * state		opaque state as returned by begin_heap_rewrite
+ * old_tuple	original tuple in the old heap
+ * new_tuple	new, rewritten tuple to be inserted to new heap
+ */
+void
+rewrite_zheap_tuple(RewriteZheapState state, ZHeapTuple old_tuple,
+					ZHeapTuple new_tuple)
+{
+	MemoryContext old_cxt;
+
+	old_cxt = MemoryContextSwitchTo(state->rs_cxt);
+
+	/*
+	 * As of now, we copy only LIVE tuples in zheap, so we can mark them as
+	 * frozen.
+	 */
+	new_tuple->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+	new_tuple->t_data->t_infomask2 &= ~ZHEAP_XACT_SLOT;
+	ZHeapTupleHeaderSetXactSlot(new_tuple->t_data, ZHTUP_SLOT_FROZEN);
+
+	raw_zheap_insert(state, new_tuple);
+
+	MemoryContextSwitchTo(old_cxt);
+}
+
+/*
+ * Insert a tuple to the new relation.  This has to track zheap_insert
+ * and its subsidiary functions!
+ *
+ * t_self of the tuple is set to the new TID of the tuple.
+ */
+static void
+raw_zheap_insert(RewriteZheapState state, ZHeapTuple tup)
+{
+	Page		page = state->rs_buffer;
+	Size		pageFreeSpace,
+				saveFreeSpace;
+	Size		len;
+	OffsetNumber newoff;
+	ZHeapTuple	heaptup;
+
+	/*
+	 * If the new tuple is too big for storage or contains already toasted
+	 * out-of-line attributes from some other relation, invoke the toaster.
+	 *
+	 * Note: below this point, heaptup is the data we actually intend to store
+	 * into the relation; tup is the caller's original untoasted data.
+	 */
+	if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE)
+	{
+		/* toast table entries should never be recursively toasted */
+		Assert(!ZHeapTupleHasExternal(tup));
+		heaptup = tup;
+	}
+	else if (ZHeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
+	{
+		/*
+		 * As of now, we copy only LIVE tuples in zheap, so we can mark them as
+		 * frozen.
+		 */
+		heaptup = ztoast_insert_or_update(state->rs_new_rel, tup, NULL,
+										  HEAP_INSERT_FROZEN |
+										  HEAP_INSERT_SKIP_FSM |
+										  (state->rs_use_wal ?
+										   0 : HEAP_INSERT_SKIP_WAL));
+	}
+	else
+		heaptup = tup;
+
+	len = SHORTALIGN(heaptup->t_len);
+
+	/*
+	 * If we're gonna fail for oversize tuple, do it right away
+	 */
+	if (len > MaxZHeapTupleSize)
+		ereport(ERROR,
+		(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			errmsg("row is too big: size %zu, maximum size %zu",
+				len, MaxZHeapTupleSize)));
+
+	/* Compute desired extra freespace due to fillfactor option */
+	saveFreeSpace = RelationGetTargetPageFreeSpace(state->rs_new_rel,
+												   HEAP_DEFAULT_FILLFACTOR);
+
+	/* Now we can check to see if there's enough free space already. */
+	if (state->rs_buffer_valid)
+	{
+		pageFreeSpace = PageGetHeapFreeSpace(page);
+
+		if (len + saveFreeSpace > pageFreeSpace)
+		{
+			/* Doesn't fit, so write out the existing page */
+
+			/* XLOG stuff */
+			if (state->rs_use_wal)
+				log_newpage(&state->rs_new_rel->rd_node,
+							MAIN_FORKNUM,
+							state->rs_blockno,
+							page,
+							true);
+
+			/*
+			 * Now write the page. We say isTemp = true even if it's not a
+			 * temp table, because there's no need for smgr to schedule an
+			 * fsync for this write; we'll do it ourselves in
+			 * end_zheap_rewrite.
+			 */
+			RelationOpenSmgr(state->rs_new_rel);
+
+			PageSetChecksumInplace(page, state->rs_blockno);
+
+			smgrextend(state->rs_new_rel->rd_smgr, MAIN_FORKNUM,
+					   state->rs_blockno, (char *) page, true);
+
+			state->rs_blockno++;
+			state->rs_buffer_valid = false;
+		}
+	}
+
+	if (!state->rs_buffer_valid)
+	{
+		/* Initialize a new empty page */
+		ZheapInitPage(page, BLCKSZ);
+		state->rs_buffer_valid = true;
+	}
+
+	/* And now we can insert the tuple into the page */
+	newoff = ZPageAddItem(InvalidBuffer, page, (Item) heaptup->t_data,
+						  heaptup->t_len, InvalidOffsetNumber, false, true,
+						  true);
+	if (newoff == InvalidOffsetNumber)
+		elog(ERROR, "failed to add tuple");
+
+	/* Update caller's t_self to the actual position where it was stored */
+	ItemPointerSet(&(tup->t_self), state->rs_blockno, newoff);
+
+	/* If heaptup is a private copy, release it. */
+	if (heaptup != tup)
+		zheap_freetuple(heaptup);
+}
diff --git a/src/backend/access/zheap/tpd.c b/src/backend/access/zheap/tpd.c
new file mode 100644
index 0000000000..99fb59dc37
--- /dev/null
+++ b/src/backend/access/zheap/tpd.c
@@ -0,0 +1,3148 @@
+/*-------------------------------------------------------------------------
+ *
+ * tpd.c
+ *	  zheap transaction overflow pages code
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * TPD is nothing but temporary data page consisting of extended transaction
+ * slots from heap pages.  There are two primary reasons for having TPD (a) In
+ * the heap page, we have fixed number of transaction slots which can lead to
+ * deadlock, (b) To support cases where a large number of transactions acquire
+ * SHARE or KEY SHARE locks on a single page.
+ *
+ * The TPD overflow pages will be stored in the zheap itself, interleaved with
+ * regular pages.  We have a meta page in zheap from which all overflow pages
+ * are tracked.
+ *
+ * TPD Entry acts like an extension of the transaction slot array in heap
+ * page.  Tuple headers normally point to the transaction slot responsible for
+ * the last modification, but since there aren't enough bits available to do
+ * this in the case where a TPD is used, an offset -> slot mapping is stored
+ * in the TPD entry itself.  This array can be used to get the slot for tuples
+ * in heap page, but for undo tuples we can't use it because we can't track
+ * multiple slots that have updated the same tuple.  So for undo records, we
+ * record the TPD transaction slot number along with the undo record.
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/tpd.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/tpd_xlog.h"
+#include "access/zheap.h"
+#include "access/zheapam_xlog.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "utils/lsyscache.h"
+#include "utils/relfilenodemap.h"
+
+/*
+ * We never need more than two TPD buffers per zheap page, so the maximum
+ * number of TPD buffers required will be four.  This can happen for
+ * non-inplace updates that insert new record to a different zheap page.  In
+ * general, we require one tpd page for zheap page, but for the cases when
+ * we need to extend the tpd entry to a different page, we will operate on
+ * two tpd buffers.
+ */
+#define MAX_TPD_BUFFERS	4
+
+/* Undo block number to buffer mapping. */
+typedef struct TPDBuffers
+{
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+} TPDBuffers;
+
+/*
+ * GetTPDBuffer operations
+ *
+ * TPD_BUF_FIND - Find the buffer in existing array of tpd buffers.
+ * TPD_BUF_FIND_OR_ENTER - Like previous, but if not found then allocate a new
+ * buffer and add it to tpd buffers array for future use.
+ * TPD_BUF_FIND_OR_KNOWN_ENTER - Like TPD_BUF_FIND, but if not found, then add
+ * the already known buffer to tpd buffers array for future use.
+ * TPD_BUF_ENTER - Allocate a new TPD buffer and add it to tpd buffers array
+ * for future use.
+ */
+typedef enum
+{
+	TPD_BUF_FIND,
+	TPD_BUF_FIND_OR_ENTER,
+	TPD_BUF_FIND_OR_KNOWN_ENTER,
+	TPD_BUF_ENTER
+} TPDACTION;
+
+static	Buffer registered_tpd_buffers[MAX_TPD_BUFFERS];
+static	TPDBuffers tpd_buffers[MAX_TPD_BUFFERS];
+static	int tpd_buf_idx;
+static	int registered_tpd_buf_idx;
+static int GetTPDBuffer(Relation rel, BlockNumber blk, Buffer tpd_buf,
+						TPDACTION tpd_action, bool *already_exists);
+static void TPDEntryUpdate(Relation relation, Buffer tpd_buf,
+			   uint16 tpd_e_offset, OffsetNumber tpd_item_off,
+			   char *tpd_entry, Size size_tpd_entry);
+static void TPDAllocatePageAndAddEntry(Relation relation, Buffer metabuf,
+						Buffer pagebuf, Buffer old_tpd_buf,
+						OffsetNumber old_off_num, char *tpd_entry,
+						Size size_tpd_entry, bool add_new_tpd_page,
+						bool delete_old_entry);
+static bool TPDBufferAlreadyRegistered(Buffer tpd_buf);
+static void ReleaseLastTPDBuffer(Buffer buf);
+static void LogAndClearTPDLocation(Relation relation, Buffer heapbuf,
+								   bool *tpd_e_pruned);
+void
+ResetRegisteredTPDBuffers()
+{
+	registered_tpd_buf_idx = 0;
+}
+
+/*
+ * GetTPDBuffer - Get the tpd buffer corresponding to give block number.
+ *
+ * Returns -1, if the tpd_action is TPD_BUF_FIND and buffer for the required
+ * block is not present in tpd buffers array, otherwise returns the index of
+ * buffer in the array.
+ *
+ * rel can be NULL, if user intends to just search for existing buffer.
+ */
+static int
+GetTPDBuffer(Relation rel, BlockNumber blk, Buffer tpd_buf,
+			 TPDACTION tpd_action, bool *already_exists)
+{
+	int		i;
+	Buffer	buf;
+
+	/* The number of active TPD buffers must be less than MAX_TPD_BUFFERS. */
+	Assert(tpd_buf_idx <= MAX_TPD_BUFFERS);
+	*already_exists = false;
+
+	/*
+	 * If new block needs to be allocated, then we don't need to search
+	 * existing set of buffers.
+	 */
+	if (tpd_action != TPD_BUF_ENTER)
+	{
+		/*
+		 * Don't do anything, if we already have a buffer pinned for the required
+		 * block.
+		 */
+		for (i = 0; i < tpd_buf_idx; i++)
+		{
+			if (blk == tpd_buffers[i].blk)
+			{
+				*already_exists = true;
+				return i;
+			}
+		}
+	}
+	else
+		i = tpd_buf_idx;
+
+	/*
+	 * If the buffer doesn't exist and caller doesn't intend to allocate new
+	 * buffer, then we are done.
+	 */
+	if (tpd_action == TPD_BUF_FIND && !(*already_exists))
+		return -1;
+
+	if (tpd_action == TPD_BUF_FIND_OR_KNOWN_ENTER)
+	{
+		Assert (i == tpd_buf_idx);
+		Assert (BufferIsValid(tpd_buf));
+
+		tpd_buffers[tpd_buf_idx].blk = BufferGetBlockNumber(tpd_buf);
+		tpd_buffers[tpd_buf_idx].buf = tpd_buf;
+		tpd_buf_idx++;
+
+		return i;
+	}
+
+	/*
+	 * Caller must have passed relation, if it intends to read a block that is
+	 * not already read.
+	 */
+	Assert(rel != NULL);
+
+	/*
+	 * We don't have the required buffer, so read it and remember in the TPD
+	 * buffer array.
+	 */
+	if (i == tpd_buf_idx)
+	{
+		buf = ReadBuffer(rel, blk);
+		tpd_buffers[tpd_buf_idx].blk = BufferGetBlockNumber(buf);
+		tpd_buffers[tpd_buf_idx].buf = buf;
+		tpd_buf_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * TPDBufferAlreadyRegistered - Check whether the buffer is already registered.
+ *
+ * Returns true if the buffer is already registered, otherwise add it to the
+ * registered buffer array and return false.
+ */
+static bool
+TPDBufferAlreadyRegistered(Buffer tpd_buf)
+{
+	int i;
+
+	for (i = 0; i < registered_tpd_buf_idx; i++)
+	{
+		if (tpd_buf == registered_tpd_buffers[i])
+			return true;
+	}
+
+	registered_tpd_buffers[registered_tpd_buf_idx++] = tpd_buf;
+
+	return false;
+}
+
+/*
+ * ReleaseLastTPDBuffer - Release last tpd buffer
+ */
+static void
+ReleaseLastTPDBuffer(Buffer buf)
+{
+	Buffer	last_tpd_buf PG_USED_FOR_ASSERTS_ONLY;
+
+	last_tpd_buf = tpd_buffers[tpd_buf_idx - 1].buf;
+	Assert(buf == last_tpd_buf);
+	UnlockReleaseBuffer(buf);
+	tpd_buffers[tpd_buf_idx - 1].buf = InvalidBuffer;
+	tpd_buffers[tpd_buf_idx - 1].blk = InvalidBlockNumber;
+	tpd_buf_idx--;
+}
+
+/*
+ * AllocateAndFormTPDEntry - Allocate and form the new TPD entry.
+ *
+ * We initialize the TPD entry and also move the last transaction slot
+ * information from heap page to first slot in TPD entry.
+ *
+ * reserved_slot - returns the first available slot.
+ */
+static char *
+AllocateAndFormTPDEntry(Buffer buf, OffsetNumber offset,
+						Size *size_tpd_entry, int *reserved_slot)
+{
+	Size		size_tpd_e_map;
+	Size		size_tpd_e_slots;
+	int		i;
+	OffsetNumber offnum, max_required_offset;
+	char	*tpd_entry;
+	char	*tpd_entry_data;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	last_trans_slot_info;
+	TransInfo	*tpd_e_trans_slots;
+	Page		page;
+	TPDEntryHeaderData	tpe_header;
+	uint16		num_map_entries;
+
+	page = BufferGetPage(buf);
+	if (OffsetNumberIsValid(offset))
+		max_required_offset = offset;
+	else
+		max_required_offset = PageGetMaxOffsetNumber(page);
+
+	num_map_entries = max_required_offset + ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY;
+
+	/* form tpd entry header */
+	tpe_header.blkno = BufferGetBlockNumber(buf);
+	tpe_header.tpe_num_map_entries = num_map_entries;
+	tpe_header.tpe_num_slots = INITIAL_TRANS_SLOTS_IN_TPD_ENTRY;
+	tpe_header.tpe_flags = TPE_ONE_BYTE;
+
+	size_tpd_e_map = num_map_entries * sizeof(uint8);
+	size_tpd_e_slots = INITIAL_TRANS_SLOTS_IN_TPD_ENTRY * sizeof(TransInfo);
+
+	/* form transaction slots for tpd entry */
+	tpd_e_trans_slots = (TransInfo *) palloc(size_tpd_e_slots);
+
+	for (i = 0; i < INITIAL_TRANS_SLOTS_IN_TPD_ENTRY; i++)
+	{
+		tpd_e_trans_slots[i].xid_epoch = 0;
+		tpd_e_trans_slots[i].xid = InvalidTransactionId;
+		tpd_e_trans_slots[i].urec_ptr = InvalidUndoRecPtr;
+	}
+
+	/*
+	 * Move the last transaction slot information from heap page to first
+	 * transaction slot in TPD entry.
+	 */
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpd_e_trans_slots[0].xid_epoch = last_trans_slot_info.xid_epoch;
+	tpd_e_trans_slots[0].xid = last_trans_slot_info.xid;
+	tpd_e_trans_slots[0].urec_ptr = last_trans_slot_info.urec_ptr;
+
+	/* form tpd entry */
+	*size_tpd_entry = SizeofTPDEntryHeader + size_tpd_e_map +
+										size_tpd_e_slots;
+
+	tpd_entry = (char *) palloc0(*size_tpd_entry);
+
+	memcpy(tpd_entry, (char *) &tpe_header, SizeofTPDEntryHeader);
+
+	tpd_entry_data = tpd_entry + SizeofTPDEntryHeader;
+
+	/*
+	 * Update the itemid to slot map for all the itemid's that point to last
+	 * transaction slot in the heap page.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= PageGetMaxOffsetNumber(page);
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ZHeapTupleHeader	tup_hdr;
+		ItemId		itemid;
+		int		trans_slot;
+
+		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			continue;
+
+		if (!ItemIdIsUsed(itemid))
+		{
+			if (!ItemIdHasPendingXact(itemid))
+				continue;
+			trans_slot = ItemIdGetTransactionSlot(itemid);
+		}
+		else if (ItemIdIsDeleted(itemid))
+		{
+			trans_slot = ItemIdGetTransactionSlot(itemid);
+		}
+		else
+		{
+			tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid);
+			trans_slot = ZHeapTupleHeaderGetXactSlot(tup_hdr);
+		}
+
+		/*
+		 * Update the itemid to slot map in tpd entry such that all of the
+		 * offsets corresponding to tuples that were pointing to last slot in
+		 * heap page will now point to first slot in TPD entry.
+		 */
+		if (trans_slot == ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			uint8	offset_tpd_e_loc;
+
+			offset_tpd_e_loc = ZHEAP_PAGE_TRANS_SLOTS + 1;
+
+			/*
+			 * One byte access shouldn't cause unaligned access, but using memcpy
+			 * for the sake of consistency.
+			 */
+			memcpy(tpd_entry_data + (offnum - 1), (char *) &offset_tpd_e_loc,
+				   sizeof(uint8));
+		}
+	}
+
+	memcpy(tpd_entry + SizeofTPDEntryHeader + size_tpd_e_map,
+		   (char *) tpd_e_trans_slots, size_tpd_e_slots);
+
+	/*
+	 * The first slot location has been already assigned to last slot moved
+	 * from heap page.  We can safely reserve the second slot location in new
+	 * TPD entry.
+	 */
+	*reserved_slot = ZHEAP_PAGE_TRANS_SLOTS + 2;
+
+	/* be tidy */
+	pfree(tpd_e_trans_slots);
+
+	return tpd_entry;
+}
+
+/*
+ * ExtendTPDEntry - Allocate bigger TPD entry and copy the contents of old TPD
+ *  entry to new TPD entry.
+ *
+ * We are quite conservative in extending the TPD entry because the bigger the
+ * entry more is the chance of space wastage.  OTOH, it might have some
+ * performance impact because smaller the entry more is the chance of getting
+ * a request for extension.  However, we feel that as we have a mechanism to
+ * reuse the transaction slots, we shouldn't get the frequent requests for
+ * extending the entry, at the very least not in performance critical paths.
+ */
+static void
+ExtendTPDEntry(Relation relation, Buffer heapbuf, TransInfo *trans_slots,
+			   OffsetNumber offnum, int buf_idx, int old_num_map_entries,
+			   int old_num_slots, int *reserved_slot_no, UndoRecPtr *urecptr,
+			   bool *tpd_e_pruned)
+{
+	TPDEntryHeaderData	old_tpd_e_header, tpd_e_header;
+	ZHeapPageOpaque		zopaque;
+	TransInfo	last_trans_slot_info;
+	Page		old_tpd_page;
+	Page		heappage;
+	Buffer		old_tpd_buf;
+	Buffer		metabuf = InvalidBuffer;
+	BlockNumber	tpdblk;
+	OffsetNumber	max_page_offnum;
+	Size		tpdpageFreeSpace;
+	Size		new_size_tpd_entry,
+				old_size_tpd_entry,
+				new_size_tpd_e_map,
+				new_size_tpd_e_slots,
+				old_size_tpd_e_map,
+				old_size_tpd_e_slots;
+	ItemId		itemId;
+	OffsetNumber	tpdItemOff;
+	int			old_loc_tpd_e_map,
+				old_loc_trans_slots;
+	int			max_reqd_map_entries;
+	int			max_reqd_slots = 0;
+	int			num_free_slots = 0;
+	int			slot_no;
+	int			entries_removed;
+	uint16		tpd_e_offset;
+	char		*tpd_entry;
+	bool		already_exists;
+	bool		allocate_new_tpd_page = false;
+	bool		update_tpd_inplace,
+				tpd_pruned;
+
+	heappage = BufferGetPage(heapbuf);
+	max_page_offnum = PageGetMaxOffsetNumber(heappage);
+
+	/*
+	 * Select the maximum among required offset num, current map
+	 * entries, and highest page offset as the number of offset-map
+	 * entries for a new TPD entry.  We do allocate few additional map
+	 * entries so that we don't need to allocate new TPD entry soon.
+	 * Also, we ensure that we don't try to allocate more than
+	 * MaxZHeapTuplesPerPage offset-map entries.
+	 */
+	max_reqd_map_entries = Max(offnum,
+							   Max(old_num_map_entries, max_page_offnum));
+	max_reqd_map_entries += ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY;
+	max_reqd_map_entries = Min(max_reqd_map_entries,
+							   MaxZHeapTuplesPerPage);
+
+	/*
+	 * If there are more than fifty percent of empty slots available,
+	 * then we don't extend the number of transaction slots in new TPD
+	 * entry.  Otherwise also, we extend the slots quite conservately
+	 * to avoid space wastage.
+	 */
+	if (*reserved_slot_no != InvalidXactSlotId)
+	{
+		for (slot_no = 0; slot_no < old_num_slots; slot_no++)
+		{
+			/*
+			 * Check for the number of unreserved transaction slots in
+			 * the TPD entry.
+			 */
+			if (trans_slots[slot_no].xid == InvalidTransactionId)
+				num_free_slots++;
+		}
+
+		if (num_free_slots >= old_num_slots / 2)
+			max_reqd_slots = old_num_slots;
+	}
+
+	if (max_reqd_slots <= 0)
+		max_reqd_slots = old_num_slots + INITIAL_TRANS_SLOTS_IN_TPD_ENTRY;
+
+	/*
+	 * The transaction slots in TPD entry are in addition to the
+	 * maximum slots in the heap page. The one-byte offset-map can
+	 * store maximum upto 255 transaction slot number.
+	 */
+	if (max_reqd_slots + ZHEAP_PAGE_TRANS_SLOTS < 256)
+		new_size_tpd_e_map = max_reqd_map_entries * sizeof(uint8);
+	else
+		new_size_tpd_e_map = max_reqd_map_entries * sizeof(uint32);
+	new_size_tpd_e_slots = max_reqd_slots * sizeof(TransInfo);
+	new_size_tpd_entry = SizeofTPDEntryHeader + new_size_tpd_e_map +
+									new_size_tpd_e_slots;
+
+	/* TPD entries can't span in multiple blocks. */
+	if (new_size_tpd_entry > MaxTPDEntrySize)
+	{
+		/*
+		 * FIXME:  what we should do if TPD entry can not fit in one page?
+		 * currently we are forcing it to retry.
+		 */
+		elog(LOG, "TPD entry size (%lu) cannot be greater than \
+			 MaxTPDEntrySize (%u)", new_size_tpd_entry, MaxTPDEntrySize);
+
+		*reserved_slot_no = InvalidXactSlotId;
+		return;
+	}
+
+	if (buf_idx != -1)
+		old_tpd_buf = tpd_buffers[buf_idx].buf;
+	else
+	{
+		/*
+		 * The last slot in page has the address of the required TPD
+		 * entry.
+		 */
+		zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+		last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+		tpdblk = last_trans_slot_info.xid_epoch;
+		buf_idx = GetTPDBuffer(relation, tpdblk, InvalidBuffer,
+							   TPD_BUF_FIND_OR_ENTER, &already_exists);
+		old_tpd_buf = tpd_buffers[buf_idx].buf;
+
+		/*
+		 * The tpd buffer must already exists as before reaching here
+		 * we must have called TPDPageGetTransactionSlots which would
+		 * have read the required buffer.
+		 */
+		Assert(already_exists);
+	}
+
+	/* The last slot in page has the address of the required TPD entry. */
+	old_tpd_page = BufferGetPage(old_tpd_buf);
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(BufferGetPage(heapbuf));
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+	tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK;
+	itemId = PageGetItemId(old_tpd_page, tpdItemOff);
+	old_size_tpd_entry = ItemIdGetLength(itemId);
+
+	/* We have a lock on tpd page, so nobody can prune our tpd entry. */
+	Assert(ItemIdIsUsed(itemId));
+
+	tpdpageFreeSpace = PageGetTPDFreeSpace(old_tpd_page);
+
+	/*
+	 * Call TPDPagePrune to ensure that it will create a space adjacent to
+	 * current offset for the new (bigger) TPD entry, if possible.
+	 */
+	entries_removed = TPDPagePrune(relation, old_tpd_buf, NULL, tpdItemOff,
+								   (new_size_tpd_entry - old_size_tpd_entry),
+								   true, &update_tpd_inplace, &tpd_pruned);
+	/*
+	 * If the item got pruned, then clear the TPD slot from the page and
+	 * return.  The entry can be pruned by ourselves or by anyone else
+	 * as we release the lock during pruning if the page is empty.
+	 */
+	if (PageIsEmpty(old_tpd_page) ||
+		!ItemIdIsUsed(itemId) ||
+		tpd_pruned)
+	{
+		LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned);
+		*reserved_slot_no = InvalidXactSlotId;
+		*tpd_e_pruned = true;
+		if (metabuf != InvalidBuffer)
+			ReleaseBuffer(metabuf);
+		return;
+	}
+
+	if (!update_tpd_inplace)
+	{
+		if (entries_removed > 0)
+			tpdpageFreeSpace = PageGetTPDFreeSpace(old_tpd_page);
+
+		if (tpdpageFreeSpace < new_size_tpd_entry)
+		{
+			/*
+			 * XXX Here, we can have an optimization such that instead of
+			 * allocating a new page, we can search other TPD pages starting
+			 * from the first_used_tpd_page till we reach last_used_tpd_page.
+			 * It is not clear whether such an optimization can help because
+			 * checking all the TPD pages isn't free either.
+			 */
+			metabuf = ReadBuffer(relation, ZHEAP_METAPAGE);
+			allocate_new_tpd_page = true;
+		}
+		else
+		{
+			/*
+			 * We must not reach here because if the new tpd entry can fit on the same
+			 * page, then update_tpd_inplace would have been set by TPDPagePrune.
+			 */
+			Assert(false);
+		}
+	}
+
+	/* form tpd entry header */
+	tpd_e_header.blkno = BufferGetBlockNumber(heapbuf);
+	tpd_e_header.tpe_num_map_entries = max_reqd_map_entries;
+	tpd_e_header.tpe_num_slots = max_reqd_slots;
+
+	/*
+	 * The transaction slots in TPD entry are in addition to the
+	 * maximum slots in the heap page. The one-byte offset-map can
+	 * store maximum upto 255 transaction slot number.
+	 */
+	if (max_reqd_slots + ZHEAP_PAGE_TRANS_SLOTS < 256)
+		tpd_e_header.tpe_flags = TPE_ONE_BYTE;
+	else
+		tpd_e_header.tpe_flags = TPE_FOUR_BYTE;
+
+	/*
+	 * If we reach here, then the page must be a TPD page.
+	 */
+	Assert(PageGetSpecialSize(old_tpd_page) == MAXALIGN(sizeof(TPDPageOpaqueData)));
+
+	/* TPD entry isn't pruned */
+	tpd_e_offset = ItemIdGetOffset(itemId);
+
+	memcpy((char *) &old_tpd_e_header, old_tpd_page + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/* We should never access deleted entry. */
+	Assert(!TPDEntryIsDeleted(old_tpd_e_header));
+
+	/* This TPD entry can't be for some other block. */
+	Assert(old_tpd_e_header.blkno == BufferGetBlockNumber(heapbuf));
+
+	if (old_tpd_e_header.tpe_flags & TPE_ONE_BYTE)
+		old_size_tpd_e_map = old_tpd_e_header.tpe_num_map_entries * sizeof(uint8);
+	else
+	{
+		Assert(old_tpd_e_header.tpe_flags & TPE_FOUR_BYTE);
+		old_size_tpd_e_map = old_tpd_e_header.tpe_num_map_entries * sizeof(uint32);
+	}
+
+	old_size_tpd_e_slots = old_tpd_e_header.tpe_num_slots * sizeof(TransInfo);
+	old_loc_tpd_e_map = tpd_e_offset + SizeofTPDEntryHeader;
+	old_loc_trans_slots = tpd_e_offset + SizeofTPDEntryHeader + old_size_tpd_e_map;
+
+	/* Form new TPD entry.  Whatever be the case, header will remain same. */
+	tpd_entry = (char *) palloc0(new_size_tpd_entry);
+	memcpy(tpd_entry, (char *) &tpd_e_header, SizeofTPDEntryHeader);
+
+	if (tpd_e_header.tpe_flags & TPE_ONE_BYTE ||
+		(tpd_e_header.tpe_flags & TPE_FOUR_BYTE &&
+		 old_tpd_e_header.tpe_flags & TPE_FOUR_BYTE))
+	{
+		/*
+		 * Caller must try to extend the TPD entry iff either there is a
+		 * need of more offset-map entries or transaction slots.
+		 */
+		Assert(tpd_e_header.tpe_num_map_entries >= old_num_map_entries);
+		Assert(tpd_e_header.tpe_num_slots >= old_num_slots);
+
+		/*
+		 * In this case we can copy the contents of old offset-map and
+		 * old transaction slots as it is.
+		 */
+		memcpy(tpd_entry + SizeofTPDEntryHeader,
+			   old_tpd_page + old_loc_tpd_e_map,
+			   old_size_tpd_e_map);
+		memcpy(tpd_entry + SizeofTPDEntryHeader + new_size_tpd_e_map,
+			   old_tpd_page + old_loc_trans_slots,
+			   old_size_tpd_e_slots);
+	}
+	else if (tpd_e_header.tpe_flags & TPE_FOUR_BYTE &&
+			 old_tpd_e_header.tpe_flags & TPE_ONE_BYTE)
+	{
+		int		i;
+		char	*new_start_loc,
+				*old_start_loc;
+
+		/*
+		 * Here, we can't directly copy the offset-map because we are
+		 * expanding it from one byte to four-bytes.  We need to perform
+		 * byte-by-byte copy for the offset-map.  However, transaction
+		 * slots can be directly copied as the size for each slot still
+		 * remains same.
+		 */
+		Assert(old_tpd_e_header.tpe_num_map_entries == old_num_map_entries);
+
+		new_start_loc = tpd_entry + SizeofTPDEntryHeader;
+		old_start_loc = old_tpd_page + old_loc_tpd_e_map;
+
+		for (i = 0; i < old_num_map_entries; i++)
+		{
+			memcpy(new_start_loc, old_start_loc, sizeof(uint8));
+			old_start_loc += sizeof(uint8);
+			new_start_loc += sizeof(uint32);
+		}
+
+		memcpy(tpd_entry + SizeofTPDEntryHeader + new_size_tpd_e_map,
+			   old_tpd_page + old_loc_trans_slots,
+			   old_size_tpd_e_slots);
+	}
+	else
+	{
+		/* All the valid cases should have been dealt above. */
+		Assert(false);
+	}
+	
+
+	if (update_tpd_inplace)
+	{
+		TPDEntryUpdate(relation, old_tpd_buf, tpd_e_offset, tpdItemOff,
+					   tpd_entry, new_size_tpd_entry);
+	}
+	else
+	{
+		/*
+		 * Note that if we have to allocate a new page, we must delete the
+		 * old tpd entry in old tpd buffer.
+		 */
+		TPDAllocatePageAndAddEntry(relation, metabuf, heapbuf, old_tpd_buf,
+								   tpdItemOff, tpd_entry, new_size_tpd_entry,
+								   allocate_new_tpd_page,
+								   allocate_new_tpd_page);
+	}
+
+	/* Release the meta buffer. */
+	if (metabuf != InvalidBuffer)
+		ReleaseBuffer(metabuf);
+
+	if (*reserved_slot_no == InvalidXactSlotId)
+	{
+		int		slot_no;
+
+		trans_slots = (TransInfo *) (tpd_entry + SizeofTPDEntryHeader + new_size_tpd_e_map);
+
+		for (slot_no = 0; slot_no < tpd_e_header.tpe_num_slots; slot_no++)
+		{
+			/* Check for an unreserved transaction slot in the TPD entry */
+			if (trans_slots[slot_no].xid == InvalidTransactionId)
+			{
+				*reserved_slot_no = slot_no;
+				break;
+			}
+		}
+	}
+	
+	if (*reserved_slot_no != InvalidXactSlotId)
+		*urecptr = trans_slots[*reserved_slot_no].urec_ptr;
+
+	pfree(tpd_entry);
+
+	return;
+}
+
+/*
+ * TPDPageAddEntry - Add the given to TPD entry on the page and
+ * move the upper to point to the next free location.
+ *
+ * Return value is the offset at which it was inserted, or InvalidOffsetNumber
+ * if the item is not inserted for any reason.  A WARNING is issued indicating
+ * the reason for the refusal.
+ *
+ * This function is same as PageAddItemExtended, but has different
+ * alignment requirements.  We might want to deal with that by passing
+ * additional argument to PageAddItemExtended, but for now we have kept
+ * it as a separate function.
+ */
+OffsetNumber
+TPDPageAddEntry(Page tpdpage, char *tpd_entry, Size size,
+				OffsetNumber offnum)
+{
+	PageHeader	phdr = (PageHeader) tpdpage;
+	OffsetNumber	limit;
+	ItemId		itemId;
+	uint16		lower;
+	uint16		upper;
+
+	/*
+	 * Be wary about corrupted page pointers
+	 */
+	if (phdr->pd_lower < SizeOfPageHeaderData ||
+		phdr->pd_lower > phdr->pd_upper ||
+		phdr->pd_upper > phdr->pd_special ||
+		phdr->pd_special > BLCKSZ)
+		ereport(PANIC,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+						phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+	/*
+	 * Select offsetNumber to place the new item at
+	 */
+	limit = OffsetNumberNext(PageGetMaxOffsetNumber(tpdpage));
+
+	lower = phdr->pd_lower + sizeof(ItemIdData);
+
+	if (OffsetNumberIsValid(offnum))
+	{
+		/*
+		 * In TPD, we send valid offset number only during recovery. Hence,
+		 * we don't need to shuffle the offsets as well.
+		 */
+		Assert(InRecovery);
+		if (offnum < limit)
+		{
+			itemId = PageGetItemId(phdr, offnum);
+			if (ItemIdIsUsed(itemId) || ItemIdHasStorage(itemId))
+			{
+				elog(WARNING, "will not overwrite a used ItemId");
+				return InvalidOffsetNumber;
+			}
+		}
+	}
+	else
+	{
+		/* offsetNumber was not passed in, so find a free slot */
+		/* if no free slot, we'll put it at limit (1st open slot) */
+		if (PageHasFreeLinePointers(phdr))
+		{
+			/*
+			 * Look for "recyclable" (unused) ItemId.  We check for no storage
+			 * as well, just to be paranoid --- unused items should never have
+			 * storage.
+			 */
+			for (offnum = 1; offnum < limit; offnum++)
+			{
+				itemId = PageGetItemId(phdr, offnum);
+				if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
+					break;
+			}
+			if (offnum >= limit)
+			{
+				/* the hint is wrong, so reset it */
+				PageClearHasFreeLinePointers(phdr);
+			}
+		}
+		else
+		{
+			offnum = limit;
+		}
+	}
+
+	/* Reject placing items beyond the first unused line pointer */
+	if (offnum > limit)
+	{
+		elog(WARNING, "specified item offset is too large");
+		return InvalidOffsetNumber;
+	}
+
+	/* Reject placing items beyond tpd boundary */
+	if (offnum > MaxTPDTuplesPerPage)
+	{
+		elog(WARNING, "can't put more than MaxTPDTuplesPerPage items in a tpd page");
+		return InvalidOffsetNumber;
+	}
+
+	/*
+	 * Compute new lower and upper pointers for page, see if it'll fit.
+	 *
+	 * Note: do arithmetic as signed ints, to avoid mistakes if, say,
+	 * alignedSize > pd_upper.
+	 */
+	if (offnum == limit)
+		lower = phdr->pd_lower + sizeof(ItemIdData);
+	else
+		lower = phdr->pd_lower;
+
+	upper = (int) phdr->pd_upper - (int) size;
+
+	if (lower > upper)
+		return InvalidOffsetNumber;
+
+	/* OK to insert the item. */
+	itemId = PageGetItemId(phdr, offnum);
+
+	/* set the item pointer */
+	ItemIdSetNormal(itemId, upper, size);
+
+	/* copy the item's data onto the page */
+	memcpy((char *) tpdpage + upper, tpd_entry, size);
+
+	phdr->pd_lower = (LocationIndex) lower;
+	phdr->pd_upper = (LocationIndex) upper;
+
+	return offnum;
+}
+
+/*
+ * SetTPDLocation - Set TPD entry location in the last transaction slot of
+ *		heap page and indicate the same in page.
+ */
+void
+SetTPDLocation(Buffer heapbuffer, Buffer tpdbuffer, OffsetNumber offset)
+{
+	Page	heappage;
+	PageHeader	phdr;
+	ZHeapPageOpaque	opaque;
+
+	heappage = BufferGetPage(heapbuffer);
+	phdr = (PageHeader) heappage;
+
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+
+	/* clear the last transaction slot info */
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch = 0;
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid =
+											InvalidTransactionId;
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].urec_ptr =
+											InvalidUndoRecPtr;
+	/* set TPD location in last transaction slot */
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch =
+											BufferGetBlockNumber(tpdbuffer);
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid =
+			(opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid & ~OFFSET_MASK) | offset;
+
+	phdr->pd_flags |= PD_PAGE_HAS_TPD_SLOT;
+}
+
+/*
+ * ClearTPDLocation - Clear TPD entry location in the last transaction slot of
+ *		heap page and indicate the same in page.
+ */
+void
+ClearTPDLocation(Buffer heapbuf)
+{
+	PageHeader	phdr;
+	ZHeapPageOpaque	opaque;
+	Page		heappage;
+	int frozen_slots = ZHEAP_PAGE_TRANS_SLOTS - 1;
+
+	heappage = BufferGetPage(heapbuf);
+	phdr = (PageHeader) heappage;
+
+	/*
+	 * Before clearing the TPD slot, mark all the tuples pointing to TPD slot
+	 * as frozen.
+	 */
+	zheap_freeze_or_invalidate_tuples(heapbuf, 1, &frozen_slots,
+									  true, false);
+
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+
+	/* clear the last transaction slot info */
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch = 0;
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid =
+											InvalidTransactionId;
+	opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].urec_ptr =
+											InvalidUndoRecPtr;
+
+	phdr->pd_flags &= ~PD_PAGE_HAS_TPD_SLOT;
+}
+
+/*
+ * LogClearTPDLocation - Write a WAL record for clearing TPD location.
+ */
+static void
+LogClearTPDLocation(Buffer buffer)
+{
+	XLogRecPtr	recptr;
+
+	XLogBeginInsert();
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+	recptr = XLogInsert(RM_TPD_ID, XLOG_TPD_CLEAR_LOCATION);
+
+	PageSetLSN(BufferGetPage(buffer), recptr);
+}
+
+/*
+ * LogAndClearTPDLocation - Clear the TPD location from heap page and WAL log
+ *			it.
+ */
+static void
+LogAndClearTPDLocation(Relation relation, Buffer heapbuf, bool *tpd_e_pruned)
+{
+	START_CRIT_SECTION();
+
+	ClearTPDLocation(heapbuf);
+	MarkBufferDirty(heapbuf);
+	if (RelationNeedsWAL(relation))
+		LogClearTPDLocation(heapbuf);
+
+	END_CRIT_SECTION();
+
+	if (tpd_e_pruned)
+		*tpd_e_pruned = true;
+}
+
+/*
+ * TPDInitPage - Initialize the TPD page.
+ */
+void
+TPDInitPage(Page page, Size pageSize)
+{
+	TPDPageOpaque	tpdopaque;
+
+	PageInit(page, pageSize, sizeof(TPDPageOpaqueData));
+
+	tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(page);
+	tpdopaque->tpd_prevblkno = InvalidBlockNumber;
+	tpdopaque->tpd_nextblkno = InvalidBlockNumber;
+	tpdopaque->tpd_latest_xid_epoch = 0;
+	tpdopaque->tpd_latest_xid = InvalidTransactionId;
+}
+
+/*
+ * TPDFreePage - Remove the TPD page from the chain.
+ *
+ * Initialize the empty page and remove it from the chain.  This function
+ * ensures that the buffers are locked such that the block that exists prior
+ * in chain gets locked first and meta page is locked at end after which no
+ * existing page is locked.  This is to avoid deadlocks, see comments atop
+ * function TPDAllocatePageAndAddEntry.
+ *
+ * We expect that the caller must have acquired EXCLUSIVE lock on the current
+ * buffer (buf) and will be responsible for releasing the same.
+ *
+ * Returns true, if we are able to successfully remove the page from chain,
+ * false, otherwise.
+ */
+bool
+TPDFreePage(Relation rel, Buffer buf, BufferAccessStrategy bstrategy)
+{
+	TPDPageOpaque	tpdopaque,
+					prevtpdopaque,
+					nexttpdopaque;
+	ZHeapMetaPage	metapage;
+	Page			page = NULL,
+					prevpage = NULL,
+					nextpage = NULL;
+	BlockNumber		curblkno PG_USED_FOR_ASSERTS_ONLY = InvalidBlockNumber;
+	BlockNumber		prevblkno = InvalidBlockNumber;
+	BlockNumber		nextblkno = InvalidBlockNumber;
+	Buffer			prevbuf = InvalidBuffer;
+	Buffer			nextbuf = InvalidBuffer;
+	Buffer			metabuf = InvalidBuffer;
+	bool			update_meta = false;
+
+	page = BufferGetPage(buf);
+	curblkno = BufferGetBlockNumber(buf);
+	tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(page);
+
+	prevblkno = tpdopaque->tpd_prevblkno;
+
+	if (BlockNumberIsValid(prevblkno))
+	{
+		/*
+		 * Before taking the lock on previous block, we need to release the
+		 * lock on the current buffer.  This is to ensure that we always lock
+		 * the buffers in the order in which they are present in list.  This
+		 * avoids the deadlock risks.  See atop TPDAllocatePageAndAddEntry.
+		 */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM, prevblkno, RBM_NORMAL,
+									 bstrategy);
+		LockBuffer(prevbuf, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * After reaquiring the lock, check whether page is still empty, if
+		 * not, then we don't need to do anything.  As of now, there is no
+		 * possiblity that the empty page in the chain can be reused, however,
+		 * in future, we can use it.
+		 */
+		page = BufferGetPage(buf);
+		if (!PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(prevbuf);
+			return false;
+		}
+		tpdopaque = (TPDPageOpaque)PageGetSpecialPointer(page);
+	}
+
+	nextblkno = tpdopaque->tpd_nextblkno;
+
+	if (BlockNumberIsValid(nextblkno))
+	{
+		nextbuf = ReadBufferExtended(rel, MAIN_FORKNUM, nextblkno, RBM_NORMAL,
+									 bstrategy);
+		LockBuffer(nextbuf, BUFFER_LOCK_EXCLUSIVE);
+	}
+
+	metabuf = ReadBufferExtended(rel, MAIN_FORKNUM, ZHEAP_METAPAGE,
+								 RBM_NORMAL, bstrategy);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	metapage = ZHeapPageGetMeta(BufferGetPage(metabuf));
+	Assert(metapage->zhm_magic == ZHEAP_MAGIC);
+
+	START_CRIT_SECTION();
+
+	/* Update the current page. */
+	tpdopaque->tpd_prevblkno = InvalidBlockNumber;
+	tpdopaque->tpd_nextblkno = InvalidBlockNumber;
+	tpdopaque->tpd_latest_xid_epoch = 0;
+	tpdopaque->tpd_latest_xid = InvalidTransactionId;
+
+	MarkBufferDirty(buf);
+
+	/* Update the previous page. */
+	if (BufferIsValid(prevbuf))
+	{
+		prevpage = BufferGetPage(prevbuf);
+		prevtpdopaque = (TPDPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevtpdopaque->tpd_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+	/* Update the next page. */
+	if (BufferIsValid(nextbuf))
+	{
+		nextpage = BufferGetPage(nextbuf);
+		nexttpdopaque = (TPDPageOpaque) PageGetSpecialPointer(nextpage);
+
+		nexttpdopaque->tpd_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	/*
+	 * Update the metapage.  If the previous or next block is invalid, the
+	 * page to be removed could be first or last page in the chain in which
+	 * case we need to update the metapage accordingly.
+	 */
+	if (!BlockNumberIsValid(prevblkno) ||
+		!BlockNumberIsValid(nextblkno))
+	{
+		if (!BlockNumberIsValid(prevblkno) && !BlockNumberIsValid(nextblkno))
+		{
+			/*
+			 * If there is no prevblock and nextblock, then the current page
+			 * must be the first and the last page.
+			 */
+			Assert(metapage->zhm_first_used_tpd_page == curblkno);
+			Assert(metapage->zhm_last_used_tpd_page == curblkno);
+			metapage->zhm_first_used_tpd_page = InvalidBlockNumber;
+			metapage->zhm_last_used_tpd_page = InvalidBlockNumber;
+		}
+		else if (!BlockNumberIsValid(prevblkno))
+		{
+			/*
+			 * If there is no prevblock, then the current block must be first
+			 * used page.
+			 */
+			Assert(BlockNumberIsValid(nextblkno));
+			metapage->zhm_first_used_tpd_page = nextblkno;
+		}
+		else if (!BlockNumberIsValid(nextblkno))
+		{
+			/*
+			 * If next block is invalid, then the current block must be last
+			 * used page.
+			 */
+			Assert(metapage->zhm_last_used_tpd_page == curblkno);
+			metapage->zhm_last_used_tpd_page = prevblkno;
+		}
+		else
+		{
+			/* one of the above two conditions must be satisfied. */
+			Assert(false);
+		}
+
+		MarkBufferDirty(metabuf);
+		update_meta = true;
+	}
+	else
+	{
+		/*
+		 * If next block is a valid block then the last used page can't be the
+		 * current page being removed.
+		 */
+		Assert(metapage->zhm_last_used_tpd_page != curblkno);
+	}
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_tpd_free_page	xlrec;
+		uint8 	info =  XLOG_TPD_FREE_PAGE;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfTPDFreePage);
+		if (BufferIsValid(prevbuf))
+			XLogRegisterBuffer(0, prevbuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(2, nextbuf, REGBUF_STANDARD);
+		if (update_meta)
+		{
+			xl_zheap_metadata		xl_meta;
+
+			info |= XLOG_TPD_INIT_PAGE;
+			xl_meta.first_used_tpd_page = metapage->zhm_first_used_tpd_page;
+			xl_meta.last_used_tpd_page = metapage->zhm_last_used_tpd_page;
+			XLogRegisterBuffer(3, metabuf, REGBUF_STANDARD | REGBUF_WILL_INIT);
+			XLogRegisterBufData(3, (char *) &xl_meta, SizeOfMetaData);
+		}
+
+		recptr = XLogInsert(RM_TPD_ID, info);
+
+		if (BufferIsValid(prevbuf))
+			PageSetLSN(prevpage, recptr);
+		PageSetLSN(page, recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(nextpage, recptr);
+		if (update_meta)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+	if (BufferIsValid(nextbuf))
+		UnlockReleaseBuffer(nextbuf);
+	UnlockReleaseBuffer(metabuf);
+
+	return true;
+}
+
+/*
+ * TPDEntryUpdate - Update the TPD entry inplace and write a WAL record for
+ *					the same.
+ */
+static void
+TPDEntryUpdate(Relation relation, Buffer tpd_buf, uint16 tpd_e_offset,
+			   OffsetNumber tpd_item_off, char *tpd_entry,
+			   Size size_tpd_entry)
+{
+	Page	tpd_page = BufferGetPage(tpd_buf);
+	ItemId	itemId = PageGetItemId(tpd_page, tpd_item_off);
+
+	START_CRIT_SECTION();
+
+	memcpy((char *) (tpd_page + tpd_e_offset),
+		   tpd_entry,
+		   size_tpd_entry);
+	ItemIdChangeLen(itemId, size_tpd_entry);
+
+	MarkBufferDirty(tpd_buf);
+
+	if (RelationNeedsWAL(relation))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, tpd_buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) &tpd_item_off, sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) tpd_entry, size_tpd_entry);
+
+		recptr = XLogInsert(RM_TPD_ID, XLOG_INPLACE_UPDATE_TPD_ENTRY);
+
+		PageSetLSN(tpd_page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * TPDAllocatePageAndAddEntry - Allocates a new tpd page if required and adds
+ *								tpd entry.
+ *
+ * This function takes care of inserting the new tpd entry to a page and
+ * allows to mark old entry as deleted when requested.  The typical actions
+ * performed in this function are (a) add a TPD entry in the newly allocated
+ * or an existing TPD page, (b) update the metapage to indicate the addion of
+ * a new page (if allocated) and for updating zhm_last_used_tpd_page, (c) mark
+ * the old TPD entry as prunable, (c) update the new offset number of TPD
+ * entry in heap page. Finally write a WAL entry and corresponding replay
+ * routine to cover all these operations and release all the buffers.
+ *
+ * The other aspect this function needs to ensure is the buffer locking order
+ * to avoid deadlocks.  We operate on four buffers: metapage buffer, old tpd
+ * page buffer, last used tpd page buffer and new tpd page buffer.  The old
+ * buffer is always locked by the caller and we ensure that this function first
+ * locks the last used tpd page buffer, then locks the metapage buffer and then
+ * the newly allocated page buffer.  This locking can never lead to deadlock as
+ * old buffer block will always be lesser (or equal) than last buffer block.
+ * However, if anytime, we change our startegy such that after acquiring
+ * metapage lock, we try to acquire lock on any existing page, then we might
+ * need to reconsider our locking order.
+ */
+static void
+TPDAllocatePageAndAddEntry(Relation relation, Buffer metabuf, Buffer pagebuf,
+						   Buffer old_tpd_buf, OffsetNumber old_off_num,
+						   char *tpd_entry, Size size_tpd_entry,
+						   bool add_new_tpd_page, bool delete_old_entry)
+{
+	ZHeapMetaPage	metapage = NULL;
+	TPDPageOpaque	tpdopaque, last_tpdopaque;
+	TPDEntryHeader	old_tpd_entry;
+	Buffer	last_used_tpd_buf = InvalidBuffer;
+	Buffer	tpd_buf;
+	Page	tpdpage;
+	BlockNumber	prevblk = InvalidBlockNumber;
+	BlockNumber	nextblk = InvalidBlockNumber;
+	BlockNumber last_used_tpd_page;
+	OffsetNumber	offset_num;
+	bool			free_last_used_tpd_buf = false;
+
+	if (add_new_tpd_page)
+	{
+		BlockNumber		targetBlock = InvalidBlockNumber;
+		Size	len = MaxTPDEntrySize;
+		int		buf_idx;
+		bool	needLock;
+		bool	already_exists;
+
+		/*
+		 * While adding a new page, if we've to delete the old entry,
+		 * the old buffer must be valid. Else, it should be invalid.
+		 */
+		Assert(!delete_old_entry || BufferIsValid(old_tpd_buf));
+		Assert(delete_old_entry || !BufferIsValid(old_tpd_buf));
+
+		/* Before extending the relation, check the FSM for free page. */
+		targetBlock = GetPageWithFreeSpace(relation, len);
+
+		while (targetBlock != InvalidBlockNumber)
+		{
+			Page	page;
+			Size	pageFreeSpace;
+
+			tpd_buf = ReadBuffer(relation, targetBlock);
+
+			/*
+			 * We need to take the lock on meta page before new page to avoid
+			 * deadlocks.  See comments atop of function.
+			 */
+			LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 *  It's possible that FSM returns a zheap page on which the current
+			 *  backend already holds a lock in exclusive mode. Hence, try using
+			 *  conditional lock. If it can't get the lock immediately, extend
+			 *  the relation and allocate a new TPD block.
+			 */
+			if (ConditionalLockBuffer(tpd_buf))
+			{
+				page = BufferGetPage(tpd_buf);
+
+				if (PageIsEmpty(page))
+				{
+					GetTPDBuffer(relation, targetBlock, tpd_buf,
+								 TPD_BUF_FIND_OR_KNOWN_ENTER, &already_exists);
+					break;
+				}
+
+				LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+				if (PageGetSpecialSize(page) == MAXALIGN(sizeof(TPDPageOpaqueData)))
+					pageFreeSpace = PageGetTPDFreeSpace(page);
+				else
+					pageFreeSpace = PageGetZHeapFreeSpace(page);
+
+				/*
+				 * Update FSM as to condition of this page, and ask for another page
+				 * to try.
+				 */
+				targetBlock = RecordAndGetPageWithFreeSpace(relation,
+															targetBlock,
+															pageFreeSpace,
+															len);
+				UnlockReleaseBuffer(tpd_buf);
+			}
+			else
+			{
+				LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+				ReleaseBuffer(tpd_buf);
+				targetBlock = InvalidBlockNumber;
+			}
+		}
+
+		/* Extend the relation, if required? */
+		if (targetBlock == InvalidBlockNumber)
+		{
+			/* Acquire the extension lock, if extension is required. */
+			needLock = !RELATION_IS_LOCAL(relation);
+			if (needLock)
+				LockRelationForExtension(relation, ExclusiveLock);
+
+			buf_idx = GetTPDBuffer(relation, P_NEW, InvalidBuffer,
+									TPD_BUF_ENTER, &already_exists);
+			/* This must be a new buffer. */
+			Assert(!already_exists);
+			tpd_buf = tpd_buffers[buf_idx].buf;
+			LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (needLock)
+				UnlockRelationForExtension(relation, ExclusiveLock);
+		}
+
+		/*
+		 * Lock the last tpd page in list, so that we can append new page to
+		 * it.
+		 */
+		metapage = ZHeapPageGetMeta(BufferGetPage(metabuf));
+		Assert(metapage->zhm_magic == ZHEAP_MAGIC);
+
+recheck_meta:
+		last_used_tpd_page = metapage->zhm_last_used_tpd_page;
+		if (metapage->zhm_last_used_tpd_page != InvalidBlockNumber)
+		{
+			last_used_tpd_page	= metapage->zhm_last_used_tpd_page;
+			buf_idx = GetTPDBuffer(relation, last_used_tpd_page, InvalidBuffer,
+								   TPD_BUF_FIND, &already_exists);
+
+			if (buf_idx == -1)
+			{
+				last_used_tpd_buf = ReadBuffer(relation,
+											   metapage->zhm_last_used_tpd_page);
+				/*
+				 * To avoid deadlock, ensure that we never acquire lock on any existing
+				 * block after acquiring meta page lock.  See comments atop function.
+				 */
+				LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+				LockBuffer(last_used_tpd_buf, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+				
+				if (metapage->zhm_last_used_tpd_page != last_used_tpd_page)
+				{
+					UnlockReleaseBuffer(last_used_tpd_buf);
+					goto recheck_meta;
+				}
+
+				free_last_used_tpd_buf = true;
+			}
+			else
+			{
+				/* We don't need to lock the buffer, if it is already locked */
+				last_used_tpd_buf = tpd_buffers[buf_idx].buf;
+			}
+		}
+	}
+	else
+	{
+		/* old buffer must be valid */
+		Assert(BufferIsValid(old_tpd_buf));
+		tpd_buf = old_tpd_buf;
+	}
+
+	/* NO EREPORT(ERROR) from here till changes are logged */
+	START_CRIT_SECTION();
+
+	tpdpage = BufferGetPage(tpd_buf);
+
+	/* Update metapage and add the new TPD page in the TPD page list. */
+	if (add_new_tpd_page)
+	{
+		BlockNumber tpdblkno;
+
+		/* Page must be new or empty. */
+		Assert(PageIsEmpty(tpdpage) || PageIsNew(tpdpage));
+
+		TPDInitPage(tpdpage, BufferGetPageSize(tpd_buf));
+		tpdblkno = BufferGetBlockNumber(tpd_buf);
+		tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+
+		if (metapage->zhm_first_used_tpd_page == InvalidBlockNumber)
+			metapage->zhm_first_used_tpd_page = tpdblkno;
+		else
+		{
+			Assert(BufferIsValid(last_used_tpd_buf));
+
+			/* Add the new TPD page at the end of the TPD page list. */
+			last_tpdopaque = (TPDPageOpaque)
+				PageGetSpecialPointer(BufferGetPage(last_used_tpd_buf));
+			prevblk = tpdopaque->tpd_prevblkno = metapage->zhm_last_used_tpd_page;
+			nextblk = last_tpdopaque->tpd_nextblkno = tpdblkno;
+
+			MarkBufferDirty(last_used_tpd_buf);
+		}
+
+		metapage->zhm_last_used_tpd_page = tpdblkno;
+
+		MarkBufferDirty(metabuf);
+	}
+	else
+	{
+		/*
+		 * TPD chain should remain unchanged.
+		 */
+		tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+		prevblk = tpdopaque->tpd_prevblkno;
+		nextblk = tpdopaque->tpd_nextblkno;
+	}
+
+	/* Mark the old tpd entry as dead before adding new entry. */
+	if (delete_old_entry)
+	{
+		Page	otpdpage;
+		ItemId	old_item_id;
+
+		/* We must be adding new TPD entry into a new page. */
+		Assert(add_new_tpd_page);
+		Assert(old_tpd_buf != tpd_buf);
+
+		otpdpage = BufferGetPage(old_tpd_buf);
+		old_item_id = PageGetItemId(otpdpage, old_off_num);
+		old_tpd_entry = (TPDEntryHeader) PageGetItem(otpdpage, old_item_id);
+		old_tpd_entry->tpe_flags |= TPE_DELETED;
+		MarkBufferDirty(old_tpd_buf);
+	}
+
+	/* Add tpd entry to page */
+	offset_num = TPDPageAddEntry(tpdpage, tpd_entry, size_tpd_entry,
+								 InvalidOffsetNumber);
+	if (offset_num == InvalidOffsetNumber)
+		elog(PANIC, "failed to add TPD entry");
+
+	MarkBufferDirty(tpd_buf);
+
+	/*
+	 * Now that the last transaction slot from heap page has moved to TPD,
+	 * we need to assign TPD location in the last transaction slot of heap.
+	 */
+	SetTPDLocation(pagebuf, tpd_buf, offset_num);
+	MarkBufferDirty(pagebuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(relation))
+	{
+		XLogRecPtr	recptr;
+		xl_tpd_allocate_entry	xlrec;
+		xl_zheap_metadata	metadata;
+		int		bufflags = 0;
+		uint8	info = XLOG_ALLOCATE_TPD_ENTRY;
+
+		xlrec.offnum = offset_num;
+		xlrec.prevblk = prevblk;
+		xlrec.nextblk = nextblk;
+		xlrec.flags = 0;
+
+		/*
+		 * If we are adding TPD entry to a new page, we will reinit the page
+		 * during replay.
+		 */
+		if (add_new_tpd_page)
+		{
+			info |= XLOG_TPD_INIT_PAGE;
+			bufflags |= REGBUF_WILL_INIT;
+		}
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfTPDAllocateEntry);
+		XLogRegisterBuffer(0, tpd_buf, REGBUF_STANDARD | bufflags);
+		XLogRegisterBufData(0, (char *) tpd_entry, size_tpd_entry);
+		XLogRegisterBuffer(1, pagebuf, REGBUF_STANDARD);
+		if (add_new_tpd_page)
+		{
+			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+			metadata.first_used_tpd_page = metapage->zhm_first_used_tpd_page;
+			metadata.last_used_tpd_page = metapage->zhm_last_used_tpd_page;
+			XLogRegisterBufData(2, (char *) &metadata, SizeOfMetaData);
+
+			if (BufferIsValid(last_used_tpd_buf))
+				XLogRegisterBuffer(3, last_used_tpd_buf, REGBUF_STANDARD);
+
+			/* The old entry is deleted only when new page is allocated. */
+			if (delete_old_entry)
+			{
+				/*
+				 * If the last tpd buffer and the old tpd buffer are same, we
+				 * don't need to register old_tpd_buf.
+				 */
+				if (last_used_tpd_buf == old_tpd_buf)
+				{
+					xlrec.flags = XLOG_OLD_TPD_BUF_EQ_LAST_TPD_BUF;
+					XLogRegisterBufData(3, (char *) &old_off_num, sizeof(OffsetNumber));
+				}
+				else
+				{
+					XLogRegisterBuffer(4, old_tpd_buf, REGBUF_STANDARD);
+					XLogRegisterBufData(4, (char *) &old_off_num, sizeof(OffsetNumber));
+				}
+			}
+		}
+
+		recptr = XLogInsert(RM_TPD_ID, info);
+
+		PageSetLSN(tpdpage, recptr);
+		PageSetLSN(BufferGetPage(pagebuf), recptr);
+		if (add_new_tpd_page)
+		{
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+			if (BufferIsValid(last_used_tpd_buf))
+				PageSetLSN(BufferGetPage(last_used_tpd_buf), recptr);
+			if (delete_old_entry)
+				PageSetLSN(BufferGetPage(old_tpd_buf), recptr);
+		}
+	}
+
+	END_CRIT_SECTION();
+
+	if (add_new_tpd_page)
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	if (free_last_used_tpd_buf)
+	{
+		Assert (last_used_tpd_buf != tpd_buf);
+		UnlockReleaseBuffer(last_used_tpd_buf);
+	}
+}
+
+/*
+ * TPDAllocateAndReserveTransSlot - Allocates a new TPD entry and reserve a
+ *		transaction slot in that entry.
+ *
+ * To allocate a new TPD entry, we first check if there is a space in any
+ * existing TPD page starting from the last used TPD page and incase we
+ * don't find any such page, then allocate a new TPD page and add it to the
+ * existing list of TPD pages.
+ *
+ * We intentionally don't release the TPD buffer here as that will be
+ * released once we have updated the transaction slot with required
+ * information.  Caller must call UnlockReleaseTPDBuffers after doing
+ * necessary updates.
+ *
+ * pagebuf - Caller must have an exclusive lock on this buffer.
+ */
+int
+TPDAllocateAndReserveTransSlot(Relation relation, Buffer pagebuf,
+							   OffsetNumber offnum, UndoRecPtr *urec_ptr)
+{
+	ZHeapMetaPage	metapage;
+	Buffer	metabuf;
+	Buffer	tpd_buf = InvalidBuffer;
+	Page	heappage;
+	uint32		first_used_tpd_page;
+	uint32		last_used_tpd_page;
+	char		*tpd_entry;
+	Size		size_tpd_entry;
+	int			reserved_slot = InvalidXactSlotId;
+	int			buf_idx;
+	bool		allocate_new_tpd_page = false;
+	bool		update_meta = false;
+	bool		already_exists;
+
+	metabuf = ReadBuffer(relation, ZHEAP_METAPAGE);
+	LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+	metapage = ZHeapPageGetMeta(BufferGetPage(metabuf));
+	Assert(metapage->zhm_magic == ZHEAP_MAGIC);
+
+	first_used_tpd_page = metapage->zhm_first_used_tpd_page;
+	last_used_tpd_page = metapage->zhm_last_used_tpd_page;
+
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	heappage = BufferGetPage(pagebuf);
+
+	if (last_used_tpd_page != InvalidBlockNumber)
+	{
+		Size	tpdpageFreeSpace;
+		Size	size_tpd_e_map, size_tpd_entry, size_tpd_e_slots;
+		uint16	num_map_entries;
+		OffsetNumber	max_required_offset;
+
+		if (OffsetNumberIsValid(offnum))
+			max_required_offset = offnum;
+		else
+			max_required_offset = PageGetMaxOffsetNumber(heappage);
+		num_map_entries = max_required_offset +
+							ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY;
+
+		size_tpd_e_map = num_map_entries * sizeof(uint8);
+		size_tpd_e_slots = INITIAL_TRANS_SLOTS_IN_TPD_ENTRY * sizeof(TransInfo);
+		size_tpd_entry = SizeofTPDEntryHeader + size_tpd_e_map +
+										size_tpd_e_slots;
+
+		buf_idx = GetTPDBuffer(relation, last_used_tpd_page, InvalidBuffer,
+							   TPD_BUF_FIND_OR_ENTER, &already_exists);
+		tpd_buf = tpd_buffers[buf_idx].buf;
+		/* We don't need to lock the buffer, if it is already locked */
+		if (!already_exists)
+			LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE);
+		tpdpageFreeSpace = PageGetTPDFreeSpace(BufferGetPage(tpd_buf));
+
+		if (tpdpageFreeSpace < size_tpd_entry)
+		{
+			int		entries_removed;
+
+			/*
+			 * Prune the TPD page to make space for new TPD entries.  After
+			 * pruning, check again to see if the TPD entry can be accomodated
+			 * on the page. We can't afford to free the page while pruning as
+			 * we need to use it to insert the TPD entry.
+			 */
+			entries_removed = TPDPagePrune(relation, tpd_buf, NULL,
+										   InvalidOffsetNumber, 0, false, NULL,
+										   NULL);
+
+			if (entries_removed > 0)
+				tpdpageFreeSpace = PageGetTPDFreeSpace(BufferGetPage(tpd_buf));
+
+			if (tpdpageFreeSpace < size_tpd_entry)
+			{
+				/*
+				 * XXX Here, we can have an optimization such that instead of
+				 * allocating a new page, we can search other TPD pages starting
+				 * from the first_used_tpd_page till we reach last_used_tpd_page.
+				 * It is not clear whether such an optimization can help because
+				 * checking all the TPD pages isn't free either.
+				 */
+				if (!already_exists)
+					ReleaseLastTPDBuffer(tpd_buf);
+				allocate_new_tpd_page = true;
+			}
+		}
+	}
+
+	if (allocate_new_tpd_page ||
+		(last_used_tpd_page == InvalidBlockNumber &&
+		first_used_tpd_page == InvalidBlockNumber))
+	{
+		tpd_buf = InvalidBuffer;
+		update_meta = true;
+	}
+
+	/* Allocate a new TPD entry */
+	tpd_entry = AllocateAndFormTPDEntry(pagebuf, offnum, &size_tpd_entry,
+										&reserved_slot);
+	Assert (tpd_entry != NULL);
+
+	TPDAllocatePageAndAddEntry(relation, metabuf, pagebuf, tpd_buf,
+							   InvalidOffsetNumber, tpd_entry, size_tpd_entry,
+							   update_meta, false);
+
+	ReleaseBuffer(metabuf);
+
+	/*
+	 * Here, we don't release the tpdbuffer in which we have added the newly
+	 * allocated TPD entry as that will be relased once we update the required
+	 * trasaction slot info in it.  The caller will later call TPDPageSetUndo
+	 * to update the required information.
+	 */
+
+	pfree(tpd_entry);
+
+	/*
+	 * As this is always a fresh transaction slot, so we can assume that
+	 * there is no preexisting undo record pointer.
+	 */
+	*urec_ptr = InvalidUndoRecPtr;
+
+	return reserved_slot;
+}
+
+/*
+ * TPDPageGetTransactionSlots - Get the transaction slots array stored in TPD
+ *			entry.  This is a helper routine for TPDPageReserveTransSlot and
+ *			TPDPageGetSlotIfExists.
+ *
+ * The tpd entries are stored unaligned, so we need to be careful to read
+ * them.  We use memcpy to avoid unaligned reads.
+ *
+ * It is quite possible that the TPD entry containing required transaction slot
+ * information got pruned away (as all the transaction entries are all-visible)
+ * by the time caller tries to enquire about it.  See atop
+ * TPDPageGetTransactionSlotInfo for more details on how we deal with pruned
+ * TPD entries.
+ *
+ * This function returns a pointer to an array of transaction slots, it is the
+ * responsibility of the caller to free it.
+ */
+TransInfo *
+TPDPageGetTransactionSlots(Relation relation, Buffer heapbuf,
+						   OffsetNumber offnum, bool keepTPDBufLock,
+						   bool checkOffset, int *num_map_entries,
+						   int *num_trans_slots, int *tpd_buf_id,
+						   bool *tpd_e_pruned, bool *alloc_bigger_map)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	Page		heappage = BufferGetPage(heapbuf);
+	ZHeapPageOpaque	zopaque;
+	TransInfo	*trans_slots = NULL;
+	TransInfo	last_trans_slot_info;
+	Buffer	tpd_buf;
+	Page	tpdpage;
+	BlockNumber	tpdblk;
+	BlockNumber lastblock;
+	TPDEntryHeaderData	tpd_e_hdr;
+	Size	size_tpd_e_map;
+	Size	size_tpd_e_slots;
+	int		loc_trans_slots;
+	int		buf_idx;
+	OffsetNumber	tpdItemOff;
+	ItemId	itemId;
+	uint16	tpd_e_offset;
+	bool	already_exists;
+
+	phdr = (PageHeader) heappage;
+	
+	if (tpd_buf_id)
+		*tpd_buf_id = -1;
+	if (num_map_entries)
+		*num_map_entries = 0;
+	if (num_trans_slots)
+		*num_trans_slots = 0;
+	if (tpd_e_pruned)
+		*tpd_e_pruned = false;
+	if (alloc_bigger_map)
+		*alloc_bigger_map = false;
+
+	/* Heap page must have TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	/* The last slot in page has the address of the required TPD entry. */
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpdblk = last_trans_slot_info.xid_epoch;
+	tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK;
+
+	if (!InRecovery)
+	{
+		lastblock = RelationGetNumberOfBlocks(relation);
+
+		if (lastblock <= tpdblk)
+		{
+			/*
+			 * The required TPD block has been pruned and then truncated away
+			 * which means all transaction slots on that page are older than
+			 * oldestXidHavingUndo.  So, we can assume the transaction slot is
+			 * frozen aka transaction is all-visible and can clear the slot from
+			 * heap tuples.
+			 */
+			LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned);
+			goto failed_and_buf_not_locked;
+		}
+	}
+
+	/*
+	 * Fetch the required TPD entry.  We need to lock the buffer in exclusive
+	 * mode as we later want to set the values in one of the transaction slot.
+	 */
+	buf_idx = GetTPDBuffer(relation, tpdblk, InvalidBuffer,
+						   TPD_BUF_FIND_OR_ENTER, &already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+	/* We don't need to lock the buffer, if it is already locked */
+	if (!already_exists)
+	{
+		LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE);
+		if (tpd_buf_id)
+			*tpd_buf_id = buf_idx;
+	}
+
+	tpdpage = BufferGetPage(tpd_buf);
+
+	/* Check whether TPD entry can exist on page? */
+	if (PageIsEmpty(tpdpage))
+	{
+		LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned);
+		goto failed;
+	}
+	if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData)))
+	{
+		LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned);
+		goto failed;
+	}
+
+	itemId = PageGetItemId(tpdpage, tpdItemOff);
+
+	/* TPD entry has been pruned */
+	if (!ItemIdIsUsed(itemId))
+	{
+		BufferDesc *bufHdr = GetBufferDescriptor(heapbuf - 1);
+
+		if (BufferIsLocal(heapbuf) ||
+			LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								 LW_EXCLUSIVE))
+			LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned);
+		goto failed;
+	}
+
+	tpd_e_offset = ItemIdGetOffset(itemId);
+
+	memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/*
+	 * This TPD entry is for some other block, so we can't continue.  This
+	 * indicates that the TPD entry corresponding to heap block has been
+	 * pruned and some other TPD entry has been moved at its location.
+	 */
+	if (tpd_e_hdr.blkno != BufferGetBlockNumber(heapbuf))
+	{
+		BufferDesc *bufHdr = GetBufferDescriptor(heapbuf - 1);
+
+		if (BufferIsLocal(heapbuf) ||
+			LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								 LW_EXCLUSIVE))
+			LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned);
+		goto failed;
+	}
+
+	/* We should never access deleted entry. */
+	Assert(!TPDEntryIsDeleted(tpd_e_hdr));
+
+	/* Allow caller to allocate a bigger TPD entry instead. */
+	if (checkOffset && offnum > tpd_e_hdr.tpe_num_map_entries)
+	{
+		/*
+		 * If the caller has requested to check offset, it must be prepared to
+		 * allocate a TPD entry.
+		 */
+		Assert(alloc_bigger_map);
+		*alloc_bigger_map = true;
+	}
+
+	if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint8);
+	else
+	{
+		Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE);
+		size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint32);
+	}
+
+	if (num_map_entries)
+		*num_map_entries = tpd_e_hdr.tpe_num_map_entries;
+	if (num_trans_slots)
+		*num_trans_slots = tpd_e_hdr.tpe_num_slots;
+	size_tpd_e_slots = tpd_e_hdr.tpe_num_slots * sizeof(TransInfo);
+	loc_trans_slots = tpd_e_offset + SizeofTPDEntryHeader + size_tpd_e_map;
+
+	trans_slots = (TransInfo *) palloc(size_tpd_e_slots);
+	memcpy((char *) trans_slots, tpdpage + loc_trans_slots, size_tpd_e_slots);
+
+failed:
+	if (!keepTPDBufLock)
+	{
+		/*
+		 * If we don't want to retain the buffer lock, it must have been taken
+		 * now.  We can't release the already existing lock taken.
+		 */
+		Assert(!already_exists);
+		ReleaseLastTPDBuffer(tpd_buf);
+
+		if (tpd_buf_id)
+			*tpd_buf_id = -1;
+	}
+
+failed_and_buf_not_locked:
+	return trans_slots;
+}
+
+/*
+ * TPDPageReserveTransSlot - Reserve the available transaction in current TPD
+ *		entry if any, otherwise, return InvalidXactSlotId.
+ *
+ * We intentionally don't release the TPD buffer here as that will be
+ * released once we have updated the transaction slot with required
+ * information.  However, if no free slot is available, then we release the
+ * buffer.  Caller must call UnlockReleaseTPDBuffers after doing necessary
+ * updates if it is able to reserve a slot.
+ */
+int
+TPDPageReserveTransSlot(Relation relation, Buffer buf, OffsetNumber offnum,
+						UndoRecPtr *urec_ptr, bool *lock_reacquired)
+{
+	TransInfo	*trans_slots;
+	int		slot_no;
+	int		num_map_entries;
+	int		num_slots;
+	int		result_slot_no = InvalidXactSlotId;
+	int		buf_idx;
+	bool	tpd_e_pruned;
+	bool	alloc_bigger_map;
+
+	trans_slots = TPDPageGetTransactionSlots(relation, buf, offnum,
+											 true, true, &num_map_entries,
+											 &num_slots, &buf_idx,
+											 &tpd_e_pruned, &alloc_bigger_map);
+	if (tpd_e_pruned)
+	{
+		Assert(trans_slots == NULL);
+		Assert(num_slots == 0);
+	}
+
+	for (slot_no = 0; slot_no < num_slots; slot_no++)
+	{
+		/* Check for an unreserved transaction slot in the TPD entry */
+		if (trans_slots[slot_no].xid == InvalidTransactionId)
+		{
+			result_slot_no = slot_no;
+			*urec_ptr = trans_slots[slot_no].urec_ptr;
+			goto extend_entry_if_required;
+		}
+	}
+
+	/* no transaction slot available, try to reuse some existing slot */
+	if (num_slots > 0 &&
+		PageFreezeTransSlots(relation, buf, lock_reacquired, trans_slots, num_slots))
+	{
+		pfree(trans_slots);
+
+		/*
+		 * If the lock is re-acquired inside, then the callers must recheck
+		 * that whether they can still perform the required operation.
+		 */
+		if (*lock_reacquired)
+			return InvalidXactSlotId;
+
+		trans_slots = TPDPageGetTransactionSlots(relation, buf, offnum, true,
+												 true, &num_map_entries,
+												 &num_slots, &buf_idx,
+												 &tpd_e_pruned, &alloc_bigger_map);
+		/*
+		 * We are already holding TPD buffer lock so the TPD entry can not be
+		 * pruned away.
+		 */
+		Assert(!tpd_e_pruned);
+
+		for (slot_no = 0; slot_no < num_slots; slot_no++)
+		{
+			if (trans_slots[slot_no].xid == InvalidTransactionId)
+			{
+				*urec_ptr = trans_slots[slot_no].urec_ptr;
+				result_slot_no = slot_no;
+				goto extend_entry_if_required;
+			}
+		}
+
+		/*
+		 * After freezing transaction slots, we should get at least one free
+		 * slot.
+		 */
+		Assert(result_slot_no != InvalidXactSlotId);
+	}
+
+extend_entry_if_required:
+
+	/*
+	 * Allocate a bigger TPD entry if either we need a bigger offset-map
+	 * or there is no unreserved slot available provided TPD entry is not
+	 * pruned in which case we can use last slot on the heap page.
+	 */
+	if (!tpd_e_pruned &&
+		(alloc_bigger_map || result_slot_no == InvalidXactSlotId))
+	{
+		ExtendTPDEntry(relation, buf, trans_slots, offnum, buf_idx,
+					   num_map_entries, num_slots, &result_slot_no, urec_ptr,
+					   &tpd_e_pruned);
+	}
+
+	/* be tidy */
+	if (trans_slots != NULL)
+		pfree(trans_slots);
+
+	/*
+	 * The transaction slots in TPD entry are in addition to the maximum slots
+	 * in the heap page.
+	 */
+	if (result_slot_no != InvalidXactSlotId)
+		result_slot_no += (ZHEAP_PAGE_TRANS_SLOTS + 1);
+	else if (buf_idx != -1)
+		ReleaseLastTPDBuffer(tpd_buffers[buf_idx].buf);
+
+	/*
+	 * As TPD entry is pruned, so last transaction slot must be free on the
+	 * heap page.
+	 */
+	if (tpd_e_pruned)
+	{
+		Assert(result_slot_no == InvalidXactSlotId);
+		result_slot_no = ZHEAP_PAGE_TRANS_SLOTS;
+		*urec_ptr = InvalidUndoRecPtr;
+	}
+
+	return result_slot_no;
+}
+
+/*
+ * TPDPageGetSlotIfExists - Get the existing slot for the required transaction
+ *		if exists, otherwise, return InvalidXactSlotId.
+ *
+ * This is similar to the TPDPageReserveTransSlot except that here we find the
+ * exisiting transaction slot instead of reserving a new one.
+ *
+ * keepTPDBufLock - This indicates whether we need to retain the lock on TPD
+ * buffer if we are able to reserve a transaction slot.
+ */
+int
+TPDPageGetSlotIfExists(Relation relation, Buffer heapbuf, OffsetNumber offnum,
+					   uint32 epoch, TransactionId xid, UndoRecPtr *urec_ptr,
+					   bool keepTPDBufLock, bool checkOffset)
+{
+	TransInfo	*trans_slots;
+	int		slot_no;
+	int		num_map_entries;
+	int		num_slots;
+	int		result_slot_no = InvalidXactSlotId;
+	int		buf_idx;
+	bool	tpd_e_pruned;
+	bool	alloc_bigger_map;
+
+	trans_slots = TPDPageGetTransactionSlots(relation,
+											 heapbuf,
+											 offnum,
+											 keepTPDBufLock,
+											 checkOffset,
+											 &num_map_entries,
+											 &num_slots,
+											 &buf_idx,
+											 &tpd_e_pruned,
+											 &alloc_bigger_map);
+	if (tpd_e_pruned)
+	{
+		Assert(trans_slots == NULL);
+		Assert(num_slots == 0);
+	}
+
+	for (slot_no = 0; slot_no < num_slots; slot_no++)
+	{
+		/* Check if already have a slot in the TPD entry */
+		if (trans_slots[slot_no].xid_epoch == epoch &&
+			trans_slots[slot_no].xid == xid)
+		{
+			result_slot_no = slot_no;
+			*urec_ptr = trans_slots[slot_no].urec_ptr;
+			break;
+		}
+	}
+
+	/*
+	 * Allocate a bigger TPD entry if we get the required slot in TPD entry,
+	 * but it requires a bigger offset-map.
+	 */
+	if (result_slot_no != InvalidXactSlotId && alloc_bigger_map)
+	{
+		ExtendTPDEntry(relation, heapbuf, trans_slots, offnum, buf_idx,
+					   num_map_entries, num_slots, &result_slot_no, urec_ptr,
+					   &tpd_e_pruned);
+	}
+
+	/* be tidy */
+	if (trans_slots)
+		pfree(trans_slots);
+
+	/*
+	 * The transaction slots in TPD entry are in addition to the maximum slots
+	 * in the heap page.
+	 */
+	if (result_slot_no != InvalidXactSlotId)
+		result_slot_no += (ZHEAP_PAGE_TRANS_SLOTS + 1);
+	else if (buf_idx != -1)
+		ReleaseLastTPDBuffer(tpd_buffers[buf_idx].buf);
+
+	return result_slot_no;
+}
+
+/*
+ * TPDPageGetTransactionSlotInfo - Get the required transaction information from
+ *		heap page's TPD entry.
+ *
+ * It is quite possible that the TPD entry containing required transaction slot
+ * information got pruned away (as all the transaction entries are all-visible)
+ * by the time caller tries to enquire about it.  One might expect that if the
+ * TPD entry is pruned, the corresponding affected tuples should be updated to
+ * reflect the same, however, we don't do that due to multiple reasons (a) we
+ * don't access heap pages from TPD layer, it can lead to deadlock, (b) it
+ * might lead to dirtying a lot of pages and random I/O.  However, the first
+ * time we detect it and we have exclusive lock on page, we update the
+ * corresponding heap page.
+ *
+ * We can consider TPD entry to be pruned under following conditions: (a) the
+ * tpd block doesn't exist (pruned and truncated by vacuum), (b) the tpd block
+ * is empty which means all the entries in it are pruned, (c) the tpd block
+ * has been reused as a heap page, (d) the corresponding TPD entry has been
+ * pruned away and either the itemid is unused or is reused for some other
+ * block's TPD entry.
+ *
+ * NoTPDBufLock - This indicates that caller doesn't have lock on required tpd
+ * buffer in which case we need to read and lock the required buffer.
+ */
+int
+TPDPageGetTransactionSlotInfo(Buffer heapbuf, int trans_slot,
+							  OffsetNumber offset, uint32 *epoch,
+							  TransactionId *xid, UndoRecPtr *urec_ptr,
+							  bool NoTPDBufLock, bool keepTPDBufLock)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	trans_slot_info, last_trans_slot_info;
+	RelFileNode	rnode;
+	Buffer	tpdbuffer;
+	Page	tpdpage;
+	Page	heappage;
+	BlockNumber	tpdblk, heapblk;
+	ForkNumber forknum;
+	TPDEntryHeaderData	tpd_e_hdr;
+	Size		size_tpd_e_map;
+	uint32	tpd_e_num_map_entries;
+	int		trans_slot_loc;
+	int		trans_slot_id = trans_slot;
+	char	*tpd_entry_data;
+	OffsetNumber	tpdItemOff;
+	ItemId	itemId;
+	uint16	tpd_e_offset;
+	char relpersistence;
+
+	heappage = BufferGetPage(heapbuf);
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have a TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpdblk = last_trans_slot_info.xid_epoch;
+	tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK;
+
+	if (NoTPDBufLock)
+	{
+		SMgrRelation	smgr;
+		BlockNumber		lastblock;
+
+		BufferGetTag(heapbuf, &rnode, &forknum, &heapblk);
+
+		if (InRecovery)
+			relpersistence = RELPERSISTENCE_PERMANENT;
+		else
+		{
+			Oid		reloid;
+
+			reloid = RelidByRelfilenode(rnode.spcNode, rnode.relNode);
+			relpersistence = get_rel_persistence(reloid);
+		}
+
+		smgr = smgropen(rnode,
+						relpersistence == RELPERSISTENCE_TEMP ?
+						MyBackendId : InvalidBackendId);
+
+		lastblock = smgrnblocks(smgr, forknum);
+
+		/* required block exists? */
+		if (tpdblk < lastblock)
+		{
+			tpdbuffer = ReadBufferWithoutRelcache(rnode, forknum, tpdblk, RBM_NORMAL,
+												  NULL, relpersistence);
+			if (keepTPDBufLock)
+				LockBuffer(tpdbuffer, BUFFER_LOCK_EXCLUSIVE);
+			else
+				LockBuffer(tpdbuffer, BUFFER_LOCK_SHARE);
+		}
+		else
+		{
+			/*
+			 * The required TPD block has been pruned and then truncated away
+			 * which means all transaction slots on that page are older than
+			 * oldestXidHavingUndo.  So, we can assume the transaction slot is
+			 * frozen aka transaction is all-visible.
+			 */
+			goto slot_is_frozen_and_buf_not_locked;
+		}
+	}
+	else
+	{
+		int		buf_idx;
+		bool	already_exists PG_USED_FOR_ASSERTS_ONLY;
+
+		buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND,
+							   &already_exists);
+		/* We must get a valid buffer. */
+		Assert(buf_idx != -1);
+		Assert(already_exists);
+		tpdbuffer = tpd_buffers[buf_idx].buf;
+	}
+
+	tpdpage = BufferGetPage(tpdbuffer);
+
+	/* Check whether TPD entry can exist on page? */
+	if (PageIsEmpty(tpdpage))
+		goto slot_is_frozen;
+	if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData)))
+		goto slot_is_frozen;
+
+	itemId = PageGetItemId(tpdpage, tpdItemOff);
+
+	/* TPD entry has been pruned */
+	if (!ItemIdIsUsed(itemId))
+	{
+		/*
+		 * Ideally, we can clear the TPD location from heap page, but for that
+		 * we need to have an exclusive lock on the heap page.  As this API
+		 * can be called with shared lock on a heap page, we can't perform
+		 * that action.
+		 *
+		 * XXX If it ever turns out to be a performance problem, we can
+		 * release the current lock and acuire the exclusive lock on heap
+		 * page.  Also we need to ensure that the lock on TPD page also needs
+		 * to be released and reacquired as we always follow the protocol of
+		 * acquiring the lock on heap page first and then on TPD page, doing
+		 * it otherway can lead to undetected deadlock.
+		 */
+		goto slot_is_frozen;
+	}
+
+	tpd_e_offset = ItemIdGetOffset(itemId);
+
+	memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/*
+	 * This TPD entry is for some other block, so we can't continue.  This
+	 * indicates that the TPD entry corresponding to heap block has been
+	 * pruned and some other TPD entry has been moved at its location.
+	 */
+	if (tpd_e_hdr.blkno != BufferGetBlockNumber(heapbuf))
+		goto slot_is_frozen;
+
+	/* We should never access deleted entry. */
+	Assert(!TPDEntryIsDeleted(tpd_e_hdr));
+
+	tpd_e_num_map_entries = tpd_e_hdr.tpe_num_map_entries;
+	tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader;
+	if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint8);
+	else
+	{
+		Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE);
+		size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint32);
+	}
+
+	/*
+	 * If the caller has passed transaction slot number that belongs to TPD
+	 * entry, then we directly go and fetch the required info from the slot.
+	 */
+	if (offset != InvalidOffsetNumber)
+	{
+		/*
+		 * The item for which we want to get the transaction slot information
+		 * must be present in this TPD entry.
+		 */
+		Assert (offset <= tpd_e_num_map_entries);
+
+		/* Get TPD entry map */
+		if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		{
+			uint8	offset_tpd_e_loc;
+
+			/*
+			 * One byte access shouldn't cause unaligned access, but using memcpy
+			 * for the sake of consistency.
+			 */
+			memcpy((char *) &offset_tpd_e_loc, tpd_entry_data + (offset - 1),
+				   sizeof(uint8));
+			trans_slot_id = offset_tpd_e_loc;
+		}
+		else
+		{
+			uint32	offset_tpd_e_loc;
+
+			memcpy((char *) &offset_tpd_e_loc,
+				   tpd_entry_data + (sizeof(uint32) * (offset - 1)),
+				   sizeof(uint32));
+			trans_slot_id = offset_tpd_e_loc;
+		}
+	}
+
+	/* Transaction must belong to TPD entry. */
+	Assert(trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS);
+
+	/* Get the required transaction slot information. */
+	trans_slot_loc = (trans_slot_id - ZHEAP_PAGE_TRANS_SLOTS - 1) *
+										sizeof(TransInfo);
+	memcpy((char *) &trans_slot_info,
+			tpd_entry_data + size_tpd_e_map + trans_slot_loc,
+			sizeof(TransInfo));
+
+	/* Update the required output */
+	if (epoch)
+		*epoch = trans_slot_info.xid_epoch;
+	if (xid)
+		*xid = trans_slot_info.xid;
+	if (urec_ptr)
+		*urec_ptr = trans_slot_info.urec_ptr;
+
+	if (NoTPDBufLock && !keepTPDBufLock)
+		UnlockReleaseBuffer(tpdbuffer);
+
+	return trans_slot_id;
+
+slot_is_frozen:
+	if (NoTPDBufLock && !keepTPDBufLock)
+		UnlockReleaseBuffer(tpdbuffer);
+
+slot_is_frozen_and_buf_not_locked:
+	trans_slot_id = ZHTUP_SLOT_FROZEN;
+	if (epoch)
+		*epoch = 0;
+	if (xid)
+		*xid = InvalidTransactionId;
+	if (urec_ptr)
+		*urec_ptr = InvalidUndoRecPtr;
+
+	return trans_slot_id;
+}
+
+/*
+ * TPDPageSetTransactionSlotInfo - Set the transaction information for a given
+ *		transaction slot in the TPD entry.
+ *
+ * Caller must ensure that it has required lock on tpd buffer which is going to
+ * be updated here.  We can't lock the buffer here as this API is supposed to
+ * be called from critical section and lock acquisition can fail.
+ */
+void
+TPDPageSetTransactionSlotInfo(Buffer heapbuf, int trans_slot_id,
+							  uint32 epoch, TransactionId xid,
+							  UndoRecPtr urec_ptr)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	trans_slot_info, last_trans_slot_info;
+	BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY;
+	Buffer	tpd_buf;
+	Page	tpdpage;
+	Page	heappage;
+	BlockNumber	tpdblk;
+	TPDEntryHeaderData	tpd_e_hdr;
+	TPDPageOpaque	tpdopaque;
+	uint64		tpd_latest_xid_epoch, current_xid_epoch;
+	Size		size_tpd_e_map;
+	int		trans_slot_loc;
+	int		buf_idx;
+	char	*tpd_entry_data;
+	OffsetNumber	tpdItemOff;
+	ItemId	itemId;
+	uint16	tpd_e_offset;
+	bool	already_exists PG_USED_FOR_ASSERTS_ONLY;
+
+	heappage = BufferGetPage(heapbuf);
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have a TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpdblk = last_trans_slot_info.xid_epoch;
+	tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK;
+
+	buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND,
+						   &already_exists);
+	/* We must get a valid buffer. */
+	Assert(buf_idx != -1);
+	Assert(already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+	Assert(BufferIsValid(tpd_buf));
+	tpdbufhdr = GetBufferDescriptor(tpd_buf - 1);
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr),
+								LW_EXCLUSIVE));
+	Assert(BufferGetBlockNumber(tpd_buf) == tpdblk);
+
+	tpdpage = BufferGetPage(tpd_buf);
+	itemId = PageGetItemId(tpdpage, tpdItemOff);
+
+	/*
+	 * TPD entry can't go away as we acquire the lock while reserving the slot
+	 * from TPD entry and keep it till we set the required transaction
+	 * information in the slot.
+	 */
+	Assert(ItemIdIsUsed(itemId));
+
+	tpd_e_offset = ItemIdGetOffset(itemId);
+
+	memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/* TPD entry can't be pruned. */
+	Assert(tpd_e_hdr.blkno == BufferGetBlockNumber(heapbuf));
+
+	/* We should never access deleted entry. */
+	Assert(!TPDEntryIsDeleted(tpd_e_hdr));
+
+	tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader;
+
+	/* Get TPD entry map */
+	if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint8);
+	else
+		size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint32);
+
+	/* Set the required transaction slot information. */
+	trans_slot_loc = (trans_slot_id - ZHEAP_PAGE_TRANS_SLOTS - 1) *
+										sizeof(TransInfo);
+	trans_slot_info.xid_epoch = epoch;
+	trans_slot_info.xid = xid;
+	trans_slot_info.urec_ptr = urec_ptr;
+
+	memcpy(tpd_entry_data + size_tpd_e_map + trans_slot_loc,
+		   (char *) &trans_slot_info,
+		   sizeof(TransInfo));
+
+	/* Update latest transaction information on the page. */
+	tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+	tpd_latest_xid_epoch = (uint64) tpdopaque->tpd_latest_xid_epoch;
+	tpd_latest_xid_epoch = MakeEpochXid(tpd_latest_xid_epoch,
+										tpdopaque->tpd_latest_xid);
+	current_xid_epoch = (uint64) epoch;
+	current_xid_epoch = MakeEpochXid(current_xid_epoch, xid);
+	if (tpd_latest_xid_epoch < current_xid_epoch)
+	{
+		tpdopaque->tpd_latest_xid_epoch = epoch;
+		tpdopaque->tpd_latest_xid = xid;
+	}
+
+	MarkBufferDirty(tpd_buf);
+}
+
+/*
+ * GetTPDEntryData - Helper function for TPDPageGetOffsetMap and
+ *					 TPDPageSetOffsetMap.
+ *
+ * Caller must ensure that it has acquired lock on the TPD buffer.
+ */
+static char *
+GetTPDEntryData(Buffer heapbuf, int *num_entries, int *entry_size,
+				Buffer *tpd_buffer)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	last_trans_slot_info;
+	BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY;
+	Buffer	tpd_buf;
+	Page	tpdpage;
+	Page	heappage;
+	BlockNumber	tpdblk;
+	TPDEntryHeaderData	tpd_e_hdr;
+	int		buf_idx;
+	char	*tpd_entry_data;
+	OffsetNumber	tpdItemOff;
+	ItemId	itemId;
+	uint16	tpd_e_offset;
+	bool	already_exists PG_USED_FOR_ASSERTS_ONLY;
+
+	heappage = BufferGetPage(heapbuf);
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have a TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpdblk = last_trans_slot_info.xid_epoch;
+	tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK;
+
+	/*
+	 * Here we don't need to check if the tpd block is pruned and truncated
+	 * away because the tpd buffer must be locked before.
+	 */
+
+	buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND,
+						   &already_exists);
+	/* We must get a valid buffer. */
+	Assert(buf_idx != -1);
+	Assert(already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+	Assert(BufferIsValid(tpd_buf));
+	tpdbufhdr = GetBufferDescriptor(tpd_buf - 1);
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr),
+								LW_EXCLUSIVE));
+	Assert(BufferGetBlockNumber(tpd_buf) == tpdblk);
+
+	tpdpage = BufferGetPage(tpd_buf);
+
+	/* Check whether TPD entry can exist on page? */
+	if (PageIsEmpty(tpdpage))
+		return NULL;
+	if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData)))
+		return NULL;
+
+	itemId = PageGetItemId(tpdpage, tpdItemOff);
+
+	/* TPD entry is already pruned away. */
+	if (!ItemIdIsUsed(itemId))
+		return NULL;
+
+	tpd_e_offset = ItemIdGetOffset(itemId);
+
+	memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/* TPD entry is pruned away. */
+	if (tpd_e_hdr.blkno != BufferGetBlockNumber(heapbuf))
+		return NULL;
+
+	/* We should never access deleted entry. */
+	Assert(!TPDEntryIsDeleted(tpd_e_hdr));
+
+	tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader;
+	*num_entries = tpd_e_hdr.tpe_num_map_entries;
+
+	if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		*entry_size = sizeof(uint8);
+	else
+		*entry_size = sizeof(uint32);
+
+	if (tpd_buffer)
+		*tpd_buffer = tpd_buf;
+
+	return tpd_entry_data;
+}
+
+/*
+ * TPDPageSetOffsetMapSlot - Set the transaction slot for given offset in TPD
+ *							 offset map.
+ *
+ * Caller must ensure that it has required lock on tpd buffer which is going to
+ * be updated here.  We can't lock the buffer here as this API is supposed to
+ * be called from critical section and lock acquisition can fail.
+ */
+void
+TPDPageSetOffsetMapSlot(Buffer heapbuf, int trans_slot_id,
+						OffsetNumber offset)
+{
+	char   *tpd_entry_data;
+	int		num_entries = 0,
+			entry_size = 0;
+	Buffer	tpd_buf = InvalidBuffer;
+
+	tpd_entry_data = GetTPDEntryData(heapbuf, &num_entries, &entry_size,
+									 &tpd_buf);
+
+	/*
+	 * Caller would have checked that the entry is not pruned after taking
+	 * lock on the tpd page.
+	 */
+	Assert(tpd_entry_data);
+
+	Assert (offset <= num_entries);
+
+	if (entry_size == sizeof(uint8))
+	{
+		uint8 offset_tpd_e_loc = trans_slot_id;
+
+		/*
+		 * One byte access shouldn't cause unaligned access, but using memcpy
+		 * for the sake of consistency.
+		 */
+		memcpy(tpd_entry_data + (offset - 1),
+			   (char *) &offset_tpd_e_loc,
+			   sizeof(uint8));
+	}
+	else
+	{
+		uint32	offset_tpd_e_loc;
+
+		offset_tpd_e_loc = trans_slot_id;
+		memcpy(tpd_entry_data + (sizeof(uint32) * (offset - 1)),
+			   (char *) &offset_tpd_e_loc,
+			   sizeof(uint32));
+	}
+
+	MarkBufferDirty(tpd_buf);
+}
+
+/*
+ * TPDPageGetOffsetMap - Get the Offset map array of the TPD entry.
+ *
+ * This function copy the offset map into tpd_offset_map array allocated by the
+ * caller.
+ */
+void
+TPDPageGetOffsetMap(Buffer heapbuf, char *tpd_offset_map, int map_size)
+{
+	char	*tpd_entry_data;
+	int		 num_entries, entry_size;
+
+	tpd_entry_data = GetTPDEntryData(heapbuf, &num_entries, &entry_size, NULL);
+
+	/*
+	 * Caller would have checked that the entry is not pruned after taking
+	 * lock on the tpd page.
+	 */
+	Assert(tpd_entry_data);
+
+	Assert(map_size == num_entries * entry_size);
+
+	memcpy(tpd_offset_map, tpd_entry_data, map_size);
+}
+
+/*
+ * TPDPageGetOffsetMapSize - Get the Offset map size of the TPD entry.
+ *
+ * Caller must ensure that it has acquired lock on tpd buffer corresponding to
+ * passed heap buffer.
+ *
+ * Returns 0, if the tpd entry gets pruned away, otherwise, return the size of
+ * TPD offset-map.
+ */
+int
+TPDPageGetOffsetMapSize(Buffer heapbuf)
+{
+	int		 num_entries, entry_size;
+
+	if (GetTPDEntryData(heapbuf, &num_entries, &entry_size, NULL) == NULL)
+		return 0;
+
+	return (num_entries * entry_size);
+}
+
+/*
+ * TPDPageSetOffsetMap - Overwrite TPD offset map array with input offset map
+ *						 array.
+ *
+ * This function returns a pointer to an array of offset map, it is the
+ * responsibility of the caller to free it.
+ *
+ * Caller must ensure that it has acquired lock on the TPD buffer which is
+ * going to be updated here.
+ */
+void
+TPDPageSetOffsetMap(Buffer heapbuf, char *tpd_offset_map)
+{
+	char	*tpd_entry_data;
+	int		 num_entries = 0,
+			 entry_size = 0;
+	Buffer	 tpd_buf = InvalidBuffer;
+
+	/* This function should only be called during recovery. */
+	Assert(InRecovery);
+
+	tpd_entry_data = GetTPDEntryData(heapbuf, &num_entries, &entry_size,
+									 &tpd_buf);
+
+	/* Entry can't be pruned during recovery. */
+	Assert(tpd_entry_data);
+
+	memcpy(tpd_entry_data, tpd_offset_map, num_entries * entry_size);
+
+	MarkBufferDirty(tpd_buf);
+}
+
+/*
+ * TPDPageSetUndo - Set the transaction information for a given transaction
+ *		slot in the TPD entry.  The difference between this function and
+ *		TPDPageSetTransactionSlotInfo is that here along with transaction
+ *		info, we update the offset to transaction slot map in the TPD entry as
+ *		well.
+ *
+ * Caller is responsible for WAL logging this operation and release the TPD
+ * buffers.  We have thought of WAL logging this as a separate operation, but
+ * that won't work as the undorecord pointer can be bogus during WAL replay;
+ * that is because we regenerate the undo during WAL replay and it is quite
+ * possible that the system crashes after flushing this WAL record but before
+ * flushing WAL of actual heap operation.  Similarly, doing it after heap
+ * operation is not feasible as in that case the tuple's transaction
+ * information can get lost.
+ */
+void
+TPDPageSetUndo(Buffer heapbuf, int trans_slot_id, bool set_tpd_map_slot,
+			   uint32 epoch, TransactionId xid, UndoRecPtr urec_ptr,
+			   OffsetNumber *usedoff, int ucnt)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	Page	heappage = BufferGetPage(heapbuf);
+	ZHeapPageOpaque	zopaque;
+	TransInfo	trans_slot_info, last_trans_slot_info;
+	BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY;
+	Buffer	tpd_buf;
+	Page	tpdpage;
+	BlockNumber	tpdblk;
+	TPDEntryHeaderData	tpd_e_hdr;
+	TPDPageOpaque	tpdopaque;
+	uint64		tpd_latest_xid_epoch, current_xid_epoch;
+	Size		size_tpd_e_map;
+	uint32	tpd_e_num_map_entries;
+	int		trans_slot_loc;
+	int		buf_idx;
+	int		i;
+	char	*tpd_entry_data;
+	OffsetNumber	tpdItemOff;
+	ItemId	itemId;
+	uint16	tpd_e_offset;
+	bool	already_exists;
+
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpdblk = last_trans_slot_info.xid_epoch;
+	tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK;
+
+	buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND,
+						   &already_exists);
+
+	/* We must get a valid buffer. */
+	Assert(buf_idx != -1);
+	Assert(already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+
+	/*
+	 * Fetch the required TPD entry.  Ensure that we are operating on the
+	 * right buffer.
+	 */
+	tpdbufhdr = GetBufferDescriptor(tpd_buf - 1);
+	Assert(BufferIsValid(tpd_buf));
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr),
+								LW_EXCLUSIVE));
+	Assert(BufferGetBlockNumber(tpd_buf) == tpdblk);
+
+	tpdpage = BufferGetPage(tpd_buf);
+	itemId = PageGetItemId(tpdpage, tpdItemOff);
+
+	/*
+	 * TPD entry can't go away as we acquire the lock while reserving the slot
+	 * from TPD entry and keep it till we set the required transaction
+	 * information in the slot.
+	 */
+	Assert(ItemIdIsUsed(itemId));
+
+	tpd_e_offset = ItemIdGetOffset(itemId);
+
+	memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader);
+
+	/* TPD entry can't be pruned. */
+	Assert(tpd_e_hdr.blkno == BufferGetBlockNumber(heapbuf));
+
+	/* We should never access deleted entry. */
+	Assert(!TPDEntryIsDeleted(tpd_e_hdr));
+
+	tpd_e_num_map_entries = tpd_e_hdr.tpe_num_map_entries;
+	tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader;
+
+	if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint8);
+	else
+		size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint32);
+
+	/*
+	 * Update TPD entry map for all the modified offsets if we
+	 * have asked to do so.
+	 */
+	if (set_tpd_map_slot)
+	{
+		/*  */
+		if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE)
+		{
+			uint8	offset_tpd_e_loc;
+
+			offset_tpd_e_loc = (uint8) trans_slot_id;
+
+			for (i = 0; i < ucnt; i++)
+			{
+				/*
+				 * The item for which we want to update the transaction slot information
+				 * must be present in this TPD entry.
+				 */
+				Assert (usedoff[i] <= tpd_e_num_map_entries);
+				/*
+				 * One byte access shouldn't cause unaligned access, but using memcpy
+				 * for the sake of consistency.
+				 */
+				memcpy(tpd_entry_data + (usedoff[i] - 1),
+					   (char *) &offset_tpd_e_loc,
+					   sizeof(uint8));
+			}
+		}
+		else
+		{
+			uint32	offset_tpd_e_loc;
+
+			Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE);
+
+			offset_tpd_e_loc = trans_slot_id;
+			for (i = 0; i < ucnt; i++)
+			{
+				/*
+				 * The item for which we want to update the transaction slot
+				 * information must be present in this TPD entry.
+				 */
+				Assert (usedoff[i] <= tpd_e_num_map_entries);
+				memcpy(tpd_entry_data + (sizeof(uint32) * (usedoff[i] - 1)),
+					   (char *) &offset_tpd_e_loc,
+					   sizeof(uint32));
+			}
+		}
+	}
+
+	/* Update the required transaction slot information. */
+	trans_slot_loc = (trans_slot_id - ZHEAP_PAGE_TRANS_SLOTS - 1) *
+												sizeof(TransInfo);
+	trans_slot_info.xid_epoch = epoch;
+	trans_slot_info.xid = xid;
+	trans_slot_info.urec_ptr = urec_ptr;
+	memcpy(tpd_entry_data + size_tpd_e_map + trans_slot_loc,
+		   (char *) &trans_slot_info,
+		   sizeof(TransInfo));
+	/* Update latest transaction information on the page. */
+	tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+	tpd_latest_xid_epoch = (uint64) tpdopaque->tpd_latest_xid_epoch;
+	tpd_latest_xid_epoch = MakeEpochXid(tpd_latest_xid_epoch,
+										tpdopaque->tpd_latest_xid);
+	current_xid_epoch = (uint64) epoch;
+	current_xid_epoch = MakeEpochXid(current_xid_epoch, xid);
+	if (tpd_latest_xid_epoch < current_xid_epoch)
+	{
+		tpdopaque->tpd_latest_xid_epoch = epoch;
+		tpdopaque->tpd_latest_xid = xid;
+	}
+
+	MarkBufferDirty(tpd_buf);
+}
+
+/*
+ * TPDPageLock - Routine to lock the TPD page corresponding to heap page
+ *
+ * Caller should not already hold the lock.
+ *
+ * Returns false, if couldn't acquire lock because the page is pruned,
+ * otherwise, true.
+ */
+bool
+TPDPageLock(Relation relation, Buffer heapbuf)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	Page		heappage = BufferGetPage(heapbuf);
+	Page		tpdpage;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	last_trans_slot_info;
+	Buffer	tpd_buf;
+	BlockNumber	tpdblk,
+				lastblock;
+	int		buf_idx;
+	bool	already_exists;
+
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	/* The last in page has the address of the required TPD entry. */
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+	tpdblk = last_trans_slot_info.xid_epoch;
+
+	lastblock = RelationGetNumberOfBlocks(relation);
+
+	if (lastblock <= tpdblk)
+	{
+		/*
+		 * The required TPD block has been pruned and then truncated away
+		 * which means all transaction slots on that page are older than
+		 * oldestXidHavingUndo.  So, we can't lock the page.
+		 */
+		goto failed;
+	}
+
+	/*
+	 * Fetch the required TPD entry.  We need to lock the buffer in exclusive
+	 * mode as we later want to set the values in one of the transaction slot.
+	 */
+	buf_idx = GetTPDBuffer(relation, tpdblk, InvalidBuffer,
+						   TPD_BUF_FIND_OR_ENTER, &already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+	tpdpage = BufferGetPage(tpd_buf);
+
+	Assert(!already_exists);
+	LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Check whether TPD entry can exist on page? */
+	if (PageIsEmpty(tpdpage))
+	{
+		ReleaseLastTPDBuffer(tpd_buf);
+		goto failed;
+	}
+	else if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData)))
+	{
+		ReleaseLastTPDBuffer(tpd_buf);
+		goto failed;
+	}
+
+	return true;
+
+failed:
+	/*
+	 * The required TPD block has been pruned which means all transaction slots
+	 * on that page are older than oldestXidHavingUndo.  So, we can assume the
+	 * TPD transaction slots are frozen aka transactions are all-visible and
+	 * can clear the TPD slots from heap tuples.
+	 */
+	LogAndClearTPDLocation(relation, heapbuf, NULL);
+	return false;
+}
+
+/*
+ * XLogReadTPDBuffer - Read the TPD buffer.
+ */
+XLogRedoAction
+XLogReadTPDBuffer(XLogReaderState *record, uint8 block_id)
+{
+	Buffer	tpd_buf;
+	XLogRedoAction action;
+	bool	already_exists;
+
+	action = XLogReadBufferForRedo(record, block_id, &tpd_buf);
+
+	/*
+	 * Remember the buffer, so that it can be release later via
+	 * UnlockReleaseTPDBuffers.
+	 */
+	GetTPDBuffer(NULL, BufferGetBlockNumber(tpd_buf), tpd_buf,
+				 TPD_BUF_FIND_OR_KNOWN_ENTER, &already_exists);
+
+	return action;
+}
+
+/*
+ * RegisterTPDBuffer - Register the TPD buffer
+ *
+ * returns the block_id that can be used to register additional buffers in the
+ * caller.
+ */
+uint8
+RegisterTPDBuffer(Page heappage, uint8 block_id)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	last_trans_slot_info;
+	BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		tpd_buf;
+	BlockNumber	tpdblk;
+	int			buf_idx;
+	bool		already_exists;
+
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	/* Get the tpd block number from last transaction slot in heap page. */
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+	tpdblk = last_trans_slot_info.xid_epoch;
+
+	buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND,
+						   &already_exists);
+
+	/* We must get a valid buffer. */
+	Assert(buf_idx != -1);
+	Assert(already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+
+	/* Return same block id if this buffer is already registered. */
+	if (TPDBufferAlreadyRegistered(tpd_buf))
+		return block_id;
+
+	/* We must be in critical section to perform this action. */
+	Assert(CritSectionCount > 0);
+	tpdbufhdr = GetBufferDescriptor(tpd_buf - 1);
+	/* The TPD buffer must be valid and locked by me. */
+	Assert(BufferIsValid(tpd_buf));
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr),
+								LW_EXCLUSIVE));
+
+	XLogRegisterBuffer(block_id++, tpd_buf, REGBUF_STANDARD);
+
+	return block_id;
+}
+
+/*
+ * TPDPageSetLSN - Set LSN on TPD pages.
+ */
+void
+TPDPageSetLSN(Page heappage, XLogRecPtr recptr)
+{
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	ZHeapPageOpaque	zopaque;
+	TransInfo	last_trans_slot_info;
+	BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		tpd_buf;
+	BlockNumber	tpdblk;
+	int			buf_idx;
+	bool		already_exists;
+
+	phdr = (PageHeader) heappage;
+
+	/* Heap page must have TPD entry. */
+	Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT);
+
+	/* Get the tpd block number from last transaction slot in heap page. */
+	zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage);
+	last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+	tpdblk = last_trans_slot_info.xid_epoch;
+
+	buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND,
+						   &already_exists);
+
+	/* We must get a valid buffer. */
+	Assert(buf_idx != -1);
+	Assert(already_exists);
+	tpd_buf = tpd_buffers[buf_idx].buf;
+
+	/* Reset the registered buffer index. */
+	registered_tpd_buf_idx = 0;
+
+	/*
+	 * Before recording the LSN, ensure that the TPD buffer must be valid and
+	 * locked by me.
+	 */
+	tpdbufhdr = GetBufferDescriptor(tpd_buf - 1);
+	Assert(BufferIsValid(tpd_buf));
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr),
+								LW_EXCLUSIVE));
+	Assert(BufferGetBlockNumber(tpd_buf) == tpdblk);
+
+	PageSetLSN(BufferGetPage(tpd_buf), recptr);
+}
+
+/*
+ * ResetTPDBuffers  - Reset TPD buffer index. Required at the time of
+ * transaction abort or release TPD buffers.
+ */
+void
+ResetTPDBuffers(void)
+{
+	int i;
+
+	for (i = 0; i < tpd_buf_idx; i++)
+	{
+		tpd_buffers[i].buf = InvalidBuffer;
+		tpd_buffers[i].blk = InvalidBlockNumber;
+	}
+
+	tpd_buf_idx = 0;
+}
+/*
+ * UnlockReleaseTPDBuffers - Release all the TPD buffers locked by me.
+ */
+void
+UnlockReleaseTPDBuffers(void)
+{
+	Buffer		tpd_buf;
+	BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY;
+	int			i;
+
+	for (i = 0; i < tpd_buf_idx; i++)
+	{
+		tpd_buf = tpd_buffers[i].buf;
+		Assert(BufferIsValid(tpd_buf));
+		tpdbufhdr = GetBufferDescriptor(tpd_buf - 1);
+		Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr),
+									LW_EXCLUSIVE));
+		UnlockReleaseBuffer(tpd_buf);
+	}
+
+	ResetTPDBuffers();
+}
+
+/*
+ * PageGetTPDFreeSpace
+ *		Returns the size of the free (allocatable) space on a page.
+ *
+ * As of now, this is just a wrapper over PageGetFreeSpace, however in future,
+ * the space management in TPD pages could be different.
+ */
+Size
+PageGetTPDFreeSpace(Page page)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = PageGetFreeSpace(page);
+
+	return (Size) space;
+}
diff --git a/src/backend/access/zheap/tpdxlog.c b/src/backend/access/zheap/tpdxlog.c
new file mode 100644
index 0000000000..7b69ac47d3
--- /dev/null
+++ b/src/backend/access/zheap/tpdxlog.c
@@ -0,0 +1,522 @@
+/*-------------------------------------------------------------------------
+ *
+ * tpdxlog.c
+ *	  WAL replay logic for tpd.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/tpdxlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/tpd_xlog.h"
+#include "access/xlogutils.h"
+#include "access/zheapam_xlog.h"
+
+/*
+ * replay of tpd entry allocation
+ */
+static void
+tpd_xlog_allocate_entry(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_tpd_allocate_entry *xlrec;
+	Buffer	tpdbuffer;
+	Buffer	heap_page_buffer;
+	Buffer	metabuf = InvalidBuffer;
+	Buffer	last_used_buf = InvalidBuffer;
+	Buffer	old_tpd_buf = InvalidBuffer;
+	Page	tpdpage;
+	TPDPageOpaque tpdopaque;
+	XLogRedoAction action;
+
+	xlrec = (xl_tpd_allocate_entry *) XLogRecGetData(record);
+
+	/*
+	 * If we inserted the first and only tpd entry on the page, re-initialize
+	 * the page from scratch.
+	 */
+	if (XLogRecGetInfo(record) & XLOG_TPD_INIT_PAGE)
+	{
+		tpdbuffer = XLogInitBufferForRedo(record, 0);
+		tpdpage = BufferGetPage(tpdbuffer);
+		TPDInitPage(tpdpage, BufferGetPageSize(tpdbuffer));
+		action = BLK_NEEDS_REDO;
+	}
+	else
+		action = XLogReadBufferForRedo(record, 0, &tpdbuffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	*tpd_entry;
+		Size	size_tpd_entry;
+		OffsetNumber	offnum;
+
+		tpd_entry = XLogRecGetBlockData(record, 0, &size_tpd_entry);
+		tpdpage = BufferGetPage(tpdbuffer);
+		offnum = TPDPageAddEntry(tpdpage, tpd_entry, size_tpd_entry,
+								 xlrec->offnum);
+		if (offnum == InvalidOffsetNumber)
+			elog(PANIC, "failed to add TPD entry");
+		MarkBufferDirty(tpdbuffer);
+		PageSetLSN(tpdpage, lsn);
+
+		/* The TPD entry must be added at the provided offset. */
+		Assert(offnum == xlrec->offnum);
+
+		tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+		tpdopaque->tpd_prevblkno = xlrec->prevblk;
+
+		MarkBufferDirty(tpdbuffer);
+		PageSetLSN(tpdpage, lsn);
+	}
+	else if (action == BLK_RESTORED)
+	{
+		/*
+		 * Note that we still update the page even if it was restored from a full
+		 * page image, because the special space is not included in the image.
+		 */
+		tpdpage = BufferGetPage(tpdbuffer);
+
+		tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage);
+		tpdopaque->tpd_prevblkno = xlrec->prevblk;
+
+		MarkBufferDirty(tpdbuffer);
+		PageSetLSN(tpdpage, lsn);
+	}
+
+	if (XLogReadBufferForRedo(record, 1, &heap_page_buffer) == BLK_NEEDS_REDO)
+	{
+		/* Set the TPD location in last transaction slot of heap page. */
+		SetTPDLocation(heap_page_buffer, tpdbuffer, xlrec->offnum);
+		MarkBufferDirty(heap_page_buffer);
+
+		PageSetLSN(BufferGetPage(heap_page_buffer), lsn);
+	}
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		xl_zheap_metadata	*xlrecmeta;
+		char	   *ptr;
+		Size		len;
+
+		metabuf = XLogInitBufferForRedo(record, 2);
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		Assert(len == SizeOfMetaData);
+		Assert(BufferGetBlockNumber(metabuf) == ZHEAP_METAPAGE);
+		xlrecmeta = (xl_zheap_metadata *) ptr;
+
+		zheap_init_meta_page(metabuf, xlrecmeta->first_used_tpd_page,
+							 xlrecmeta->last_used_tpd_page);
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+
+		/*
+		 * We can have reference of block 3, iff we have reference for block
+		 * 2.
+		 */
+		if (XLogRecHasBlockRef(record, 3))
+		{
+			action = XLogReadBufferForRedo(record, 3, &last_used_buf);
+			/*
+			 * Note that we still update the page even if it was restored from a full
+			 * page image, because the special space is not included in the image.
+			 */
+			if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+			{
+				Page	last_used_page;
+				TPDPageOpaque last_tpdopaque;
+
+				last_used_page = BufferGetPage(last_used_buf);
+				last_tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(last_used_page);
+				last_tpdopaque->tpd_nextblkno = xlrec->nextblk;
+
+				/* old and last tpd buffer are same. */
+				if (xlrec->flags & XLOG_OLD_TPD_BUF_EQ_LAST_TPD_BUF)
+				{
+					TPDEntryHeader	old_tpd_entry;
+					Page	otpdpage;
+					char	*data;
+					OffsetNumber	*off_num;
+					Size	datalen PG_USED_FOR_ASSERTS_ONLY;
+					ItemId	old_item_id;
+
+					if (action == BLK_NEEDS_REDO)
+					{
+						data = XLogRecGetBlockData(record, 3, &datalen);
+
+						off_num = (OffsetNumber *)data;
+						Assert(datalen == sizeof(OffsetNumber));
+
+						otpdpage = BufferGetPage(last_used_buf);
+						old_item_id = PageGetItemId(otpdpage, *off_num);
+						old_tpd_entry = (TPDEntryHeader)PageGetItem(otpdpage, old_item_id);
+						old_tpd_entry->tpe_flags |= TPE_DELETED;
+					}
+
+					/* We can't have a separate reference for old tpd buffer. */
+					Assert(!XLogRecHasBlockRef(record, 4));
+				}
+
+				MarkBufferDirty(last_used_buf);
+				PageSetLSN(last_used_page, lsn);
+			}
+		}
+
+		/*
+		 * We can have reference of block 4, iff we have reference for block
+		 * 2.
+		 */
+		if (XLogRecHasBlockRef(record, 4))
+		{
+			TPDEntryHeader	old_tpd_entry;
+			Page	otpdpage;
+			char	*data;
+			OffsetNumber	*off_num;
+			Size	datalen PG_USED_FOR_ASSERTS_ONLY;
+			ItemId	old_item_id;
+
+			action = XLogReadBufferForRedo(record, 4, &old_tpd_buf);
+
+			if (action == BLK_NEEDS_REDO)
+			{
+				data = XLogRecGetBlockData(record, 4, &datalen);
+
+				off_num = (OffsetNumber *) data;
+				Assert(datalen == sizeof(OffsetNumber));
+
+				otpdpage = BufferGetPage(old_tpd_buf);
+				old_item_id = PageGetItemId(otpdpage, *off_num);
+				old_tpd_entry = (TPDEntryHeader) PageGetItem(otpdpage, old_item_id);
+				old_tpd_entry->tpe_flags |= TPE_DELETED;
+
+				MarkBufferDirty(old_tpd_buf);
+				PageSetLSN(BufferGetPage(old_tpd_buf), lsn);
+			}
+		}
+	}
+
+	if (BufferIsValid(tpdbuffer))
+		UnlockReleaseBuffer(tpdbuffer);
+	if (BufferIsValid(heap_page_buffer))
+		UnlockReleaseBuffer(heap_page_buffer);
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+	if (BufferIsValid(last_used_buf))
+		UnlockReleaseBuffer(last_used_buf);
+	if (BufferIsValid(old_tpd_buf))
+		UnlockReleaseBuffer(old_tpd_buf);
+}
+
+/*
+ * replay inplace update of TPD entry
+ */
+static void
+tpd_xlog_inplace_update_entry(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer	tpdbuf;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it (using a cleanup lock) and
+	 * we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true,
+										   &tpdbuf);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		tpdpage = (Page) BufferGetPage(tpdbuf);
+		ItemId		item_id;
+		OffsetNumber	*off_num;
+		char		*data;
+		char		*new_tpd_entry;
+		Size		datalen,
+					size_new_tpd_entry;
+		uint16		tpd_e_offset;
+
+		data = XLogRecGetBlockData(record, 0, &datalen);
+		off_num = (OffsetNumber *) data;
+		new_tpd_entry = (char *) ((char *) data + sizeof(OffsetNumber));
+		size_new_tpd_entry = datalen - sizeof(OffsetNumber);
+
+		item_id = PageGetItemId(tpdpage, *off_num);
+		tpd_e_offset = ItemIdGetOffset(item_id);
+		memcpy((char *) (tpdpage + tpd_e_offset),
+			   new_tpd_entry,
+			   size_new_tpd_entry);
+		ItemIdChangeLen(item_id, size_new_tpd_entry);
+
+		MarkBufferDirty(tpdbuf);
+		PageSetLSN(tpdpage, lsn);
+	}
+	if (BufferIsValid(tpdbuf))
+		UnlockReleaseBuffer(tpdbuf);
+}
+
+/*
+ * replay of pruning tpd page
+ */
+static void
+tpd_xlog_clean(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_tpd_clean *xlrec = (xl_tpd_clean *) XLogRecGetData(record);
+	Buffer	tpdbuf;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it (using a cleanup lock) and
+	 * we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true,
+										   &tpdbuf);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		tpdpage = (Page) BufferGetPage(tpdbuf);
+		Page		tmppage;
+		OffsetNumber *end;
+		OffsetNumber *nowunused;
+		OffsetNumber	*target_offnum;
+		OffsetNumber tmp_target_off;
+		Size			*space_required;
+		Size		tmp_spc_rqd;
+		Size		datalen;
+		int			nunused;
+
+		if (xlrec->flags & XLZ_CLEAN_CONTAINS_OFFSET)
+		{
+			target_offnum = (OffsetNumber *) ((char *) xlrec + SizeOfTPDClean);
+			space_required = (Size *) ((char *) target_offnum + sizeof(OffsetNumber));
+		}
+		else
+		{
+			target_offnum = &tmp_target_off;
+			*target_offnum = InvalidOffsetNumber;
+			space_required = &tmp_spc_rqd;
+			*space_required = 0;
+		}
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+		end = (OffsetNumber *) ((char *) nowunused + datalen);
+		nunused = (end - nowunused);
+
+		if (nunused >= 0)
+		{
+			/* Update all item pointers per the record, and repair fragmentation */
+			TPDPagePruneExecute(tpdbuf, nowunused, nunused);
+		}
+
+		tmppage = PageGetTempPageCopy(tpdpage);
+		TPDPageRepairFragmentation(tpdpage, tmppage, *target_offnum,
+								   *space_required);
+
+		/*
+		 * Note: we don't worry about updating the page's prunability hints.
+		 * At worst this will cause an extra prune cycle to occur soon.
+		 */
+
+		MarkBufferDirty(tpdbuf);
+		PageSetLSN(tpdpage, lsn);
+
+		pfree(tmppage);
+	}
+	if (BufferIsValid(tpdbuf))
+		UnlockReleaseBuffer(tpdbuf);
+}
+
+/*
+ * replay for clearing tpd location from heap page.
+ */
+static void
+tpd_xlog_clear_location(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer	buffer;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Page	page = (Page) BufferGetPage(buffer);
+
+		ClearTPDLocation(buffer);
+		MarkBufferDirty(buffer);
+		PageSetLSN(page, lsn);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for freeing tpd page.
+ */
+static void
+tpd_xlog_free_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	RelFileNode		rnode;
+	xl_tpd_free_page *xlrec = (xl_tpd_free_page *) XLogRecGetData(record);
+	Buffer	buffer = InvalidBuffer,
+			prevbuf = InvalidBuffer,
+			nextbuf = InvalidBuffer,
+			metabuf = InvalidBuffer;
+	BlockNumber		blkno;
+	Page	page;
+	XLogRedoAction action;
+	Size		freespace;
+
+	if (XLogRecHasBlockRef(record, 0))
+	{
+		action = XLogReadBufferForRedo(record, 0, &prevbuf);
+
+		/*
+		 * Note that we still update the page even if it was restored from a full
+		 * page image, because the special space is not included in the image.
+		 */
+		if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+		{
+			TPDPageOpaque	prevtpdopaque;
+			Page	prevpage = (Page) BufferGetPage(prevbuf);
+
+			prevtpdopaque = (TPDPageOpaque) PageGetSpecialPointer(prevpage);
+			prevtpdopaque->tpd_nextblkno = xlrec->nextblkno;
+
+			MarkBufferDirty(prevbuf);
+			PageSetLSN(prevpage, lsn);
+		}
+	}
+
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	action = XLogReadBufferForRedo(record, 1, &buffer);
+	page = (Page) BufferGetPage(buffer);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the special space is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		TPDPageOpaque	tpdopaque;
+
+		tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(page);
+
+		tpdopaque->tpd_prevblkno = InvalidBlockNumber;
+		tpdopaque->tpd_nextblkno = InvalidBlockNumber;
+		tpdopaque->tpd_latest_xid_epoch = 0;
+		tpdopaque->tpd_latest_xid = InvalidTransactionId;
+
+		MarkBufferDirty(buffer);
+		PageSetLSN(page, lsn);
+	}
+
+	Assert(PageIsEmpty(page));
+	Assert(blkno == BufferGetBlockNumber(buffer));
+	freespace = PageGetTPDFreeSpace(page);
+
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		action = XLogReadBufferForRedo(record, 2, &nextbuf);
+
+		if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+		{
+			TPDPageOpaque	nexttpdopaque;
+			Page	nextpage = (Page) BufferGetPage(nextbuf);
+
+			nexttpdopaque = (TPDPageOpaque) PageGetSpecialPointer(nextpage);
+			nexttpdopaque->tpd_prevblkno = xlrec->prevblkno;
+
+			MarkBufferDirty(nextbuf);
+			PageSetLSN(nextpage, lsn);
+		}
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		xl_zheap_metadata	*xlrecmeta;
+		char	   *ptr;
+		Size		len;
+
+		metabuf = XLogInitBufferForRedo(record, 3);
+		ptr = XLogRecGetBlockData(record, 3, &len);
+
+		Assert(len == SizeOfMetaData);
+		Assert(BufferGetBlockNumber(metabuf) == ZHEAP_METAPAGE);
+		xlrecmeta = (xl_zheap_metadata *) ptr;
+
+		zheap_init_meta_page(metabuf, xlrecmeta->first_used_tpd_page,
+							 xlrecmeta->last_used_tpd_page);
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+	if (BufferIsValid(nextbuf))
+		UnlockReleaseBuffer(nextbuf);
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+
+	/* Record the empty page in FSM. */
+	XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+}
+
+/*
+ * replay of pruning all the entries in tpd page.
+ */
+static void
+tpd_xlog_clean_all_entries(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer	buffer;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Page	page = (Page) BufferGetPage(buffer);
+
+		((PageHeader) page)->pd_lower = SizeOfPageHeaderData;
+		((PageHeader) page)->pd_upper = ((PageHeader) page)->pd_special;
+
+		MarkBufferDirty(buffer);
+		PageSetLSN(page, lsn);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+void
+tpd_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info & XLOG_TPD_OPMASK)
+	{
+		case XLOG_ALLOCATE_TPD_ENTRY:
+			tpd_xlog_allocate_entry(record);
+			break;
+		case XLOG_INPLACE_UPDATE_TPD_ENTRY:
+			tpd_xlog_inplace_update_entry(record);
+			break;
+		case XLOG_TPD_CLEAN:
+			tpd_xlog_clean(record);
+			break;
+		case XLOG_TPD_CLEAR_LOCATION:
+			tpd_xlog_clear_location(record);
+			break;
+		case XLOG_TPD_FREE_PAGE:
+			tpd_xlog_free_page(record);
+			break;
+		case XLOG_TPD_CLEAN_ALL_ENTRIES:
+			tpd_xlog_clean_all_entries(record);
+			break;
+		default:
+			elog(PANIC, "tpd_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/zheap/zheapam.c b/src/backend/access/zheap/zheapam.c
new file mode 100644
index 0000000000..c916bde3fb
--- /dev/null
+++ b/src/backend/access/zheap/zheapam.c
@@ -0,0 +1,11877 @@
+/*-------------------------------------------------------------------------
+ *
+ * zheapam.c
+ *	  zheap access method code
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/heap/zheapam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		zheap_insert	- insert zheap tuple into a relation
+ *
+ * NOTES
+ *	  This file contains the zheap_ routines which implement
+ *	  the POSTGRES zheap access method used for relations backed
+ *	  by undo storage.
+ *
+ *	  In zheap, we never generate subtransaction id and rather always use top
+ *	  transaction id.  The sub-transaction id is mainly required to detect the
+ *	  visibility of tuple when the sub-transaction state is different from
+ *	  main transaction state, say due to Rollback To SavePoint.  In zheap, we
+ *	  always perform undo actions to make sure that the tuple state reaches to
+ *	  the state where it is at the start of subtransaction in such a case.
+ *	  This will also help in avoiding the transaction slots to grow inside a
+ *	  page and will have lesser clog entries.  Another advantage is that it
+ *	  will help us retaining the undo records for one transaction together
+ *	  in undo log instead of those being interleaved which will avoid having
+ *	  more undo records that have UREC_INFO_TRANSACTION.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/bufmask.h"
+#include "access/htup_details.h"
+#include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/sysattr.h"
+#include "access/xact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/tpd.h"
+#include "access/tuptoaster.h"
+#include "access/undoinsert.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/undorecord.h"
+#include "access/visibilitymap.h"
+#include "access/zheap.h"
+#include "access/zhio.h"
+#include "access/zhtup.h"
+#include "access/zheapam_xlog.h"
+#include "access/zheap.h"
+#include "access/zheapscan.h"
+#include "access/zmultilocker.h"
+#include "catalog/catalog.h"
+#include "executor/tuptable.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/undoloop.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/datum.h"
+#include "utils/expandeddatum.h"
+#include "utils/inval.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/tqual.h"
+
+ /*
+  * Possible lock modes for a tuple.
+  */
+typedef enum LockOper
+{
+	/* SELECT FOR 'KEY SHARE/SHARE/NO KEY UPDATE/UPDATE' */
+	LockOnly,
+	/* Via EvalPlanQual where after locking we will update it */
+	LockForUpdate,
+	/* Update/Delete */
+	ForUpdate
+} LockOper;
+
+extern bool synchronize_seqscans;
+
+static ZHeapTuple zheap_prepare_insert(Relation relation, ZHeapTuple tup,
+									   int options);
+static Bitmapset *
+ZHeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
+							  ZHeapTuple oldtup, ZHeapTuple newtup);
+
+static void RelationPutZHeapTuple(Relation relation, Buffer buffer,
+								  ZHeapTuple tuple);
+static void log_zheap_update(Relation reln, UnpackedUndoRecord undorecord,
+					UnpackedUndoRecord newundorecord, UndoRecPtr urecptr,
+					UndoRecPtr newurecptr, Buffer oldbuf, Buffer newbuf,
+					ZHeapTuple oldtup, ZHeapTuple newtup,
+					int old_tup_trans_slot_id, int trans_slot_id,
+					int new_trans_slot_id, bool inplace_update,
+					bool all_visible_cleared, bool new_all_visible_cleared,
+					xl_undolog_meta *undometa);
+static HTSU_Result
+zheap_lock_updated_tuple(Relation rel, ZHeapTuple tuple, ItemPointer ctid,
+						 TransactionId xid, LockTupleMode mode, LockOper lockopr,
+						 CommandId cid, bool *rollback_and_relocked);
+static void zheap_lock_tuple_guts(Relation rel, Buffer buf, ZHeapTuple zhtup,
+					  TransactionId tup_xid, TransactionId xid,
+					  LockTupleMode mode, LockOper lockopr, uint32 epoch,
+					  int tup_trans_slot_id, int trans_slot_id,
+					  TransactionId single_locker_xid, int single_locker_trans_slot,
+					  UndoRecPtr prev_urecptr, CommandId cid,
+					  bool any_multi_locker_member_alive);
+static void compute_new_xid_infomask(ZHeapTuple zhtup, Buffer buf,
+						 TransactionId tup_xid, int tup_trans_slot,
+						 uint16 old_infomask, TransactionId add_to_xid,
+						 int trans_slot, TransactionId single_locker_xid,
+						 LockTupleMode mode, LockOper lockoper,
+						 uint16 *result_infomask, int *result_trans_slot);
+static ZHeapFreeOffsetRanges *
+ZHeapGetUsableOffsetRanges(Buffer buffer, ZHeapTuple *tuples, int ntuples,
+						   Size saveFreeSpace);
+static inline void CheckAndLockTPDPage(Relation relation, int new_trans_slot_id,
+									   int old_trans_slot_id, Buffer newbuf,
+									   Buffer oldbuf);
+
+/*
+ * zheap_compute_data_size
+ *		Determine size of the data area of a tuple to be constructed.
+ *
+ * We can't start with zero offset for first attribute as that has a
+ * hidden assumption that tuple header is MAXALIGNED which is not true
+ * for zheap.  For example, if the first attribute requires alignment
+ * (say it is four-byte varlena), then the code would assume the offset
+ * is aligned incase we start with zero offset for first attribute.  So,
+ * always start with the actual byte from where the first attribute starts.
+ */
+Size
+zheap_compute_data_size(TupleDesc tupleDesc, Datum *values, bool *isnull,
+						int t_hoff)
+{
+	Size		data_length = t_hoff;
+	int			i;
+	int			numberOfAttributes = tupleDesc->natts;
+
+	for (i = 0; i < numberOfAttributes; i++)
+	{
+		Datum		val;
+		Form_pg_attribute atti;
+
+		if (isnull[i])
+			continue;
+
+		val = values[i];
+		atti = TupleDescAttr(tupleDesc, i);
+
+		if (atti->attbyval)
+		{
+			/* byval attributes are stored unaligned in zheap. */
+			data_length += atti->attlen;
+		}
+		else if (ATT_IS_PACKABLE(atti) &&
+				 VARATT_CAN_MAKE_SHORT(DatumGetPointer(val)))
+		{
+			/*
+			 * we're anticipating converting to a short varlena header, so
+			 * adjust length and don't count any alignment
+			 */
+			data_length += VARATT_CONVERTED_SHORT_SIZE(DatumGetPointer(val));
+		}
+		else if (atti->attlen == -1 &&
+				 VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+		{
+			/*
+			 * we want to flatten the expanded value so that the constructed
+			 * tuple doesn't depend on it
+			 */
+			data_length = att_align_nominal(data_length, atti->attalign);
+			data_length += EOH_get_flat_size(DatumGetEOHP(val));
+		}
+		else
+		{
+			data_length = att_align_datum(data_length, atti->attalign,
+				atti->attlen, val);
+			data_length = att_addlength_datum(data_length, atti->attlen,
+				val);
+		}
+	}
+
+	return data_length - t_hoff;
+}
+
+/*
+ * zheap_fill_tuple
+ *		Load data portion of a tuple from values/isnull arrays
+ *
+ * We also fill the null bitmap (if any) and set the infomask bits
+ * that reflect the tuple's data contents.
+ *
+ * This function is same as heap_fill_tuple except for datatype of infomask
+ * parameter.
+ *
+ * NOTE: it is now REQUIRED that the caller have pre-zeroed the data area.
+ */
+void
+zheap_fill_tuple(TupleDesc tupleDesc,
+				 Datum *values, bool *isnull,
+				 char *data, Size data_size,
+				 uint16 *infomask, bits8 *bit)
+{
+	bits8	   *bitP;
+	int			bitmask;
+	int			i;
+	int			numberOfAttributes = tupleDesc->natts;
+
+#ifdef USE_ASSERT_CHECKING
+	char	   *start = data;
+#endif
+
+	if (bit != NULL)
+	{
+		bitP = &bit[-1];
+		bitmask = HIGHBIT;
+	}
+	else
+	{
+		/* just to keep compiler quiet */
+		bitP = NULL;
+		bitmask = 0;
+	}
+
+	*infomask &= ~(ZHEAP_HASNULL | ZHEAP_HASVARWIDTH | ZHEAP_HASEXTERNAL);
+
+	for (i = 0; i < numberOfAttributes; i++)
+	{
+		Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
+		Size		data_length;
+
+		if (bit != NULL)
+		{
+			if (bitmask != HIGHBIT)
+				bitmask <<= 1;
+			else
+			{
+				bitP += 1;
+				*bitP = 0x0;
+				bitmask = 1;
+			}
+
+			if (isnull[i])
+			{
+				*infomask |= ZHEAP_HASNULL;
+				continue;
+			}
+
+			*bitP |= bitmask;
+		}
+
+		/*
+		 * XXX we use the att_align macros on the pointer value itself, not on
+		 * an offset.  This is a bit of a hack.
+		 */
+
+		if (att->attbyval)
+		{
+			/* pass-by-value */
+			//data = (char *) att_align_nominal(data, att->attalign);
+			//store_att_byval(data, values[i], att->attlen);
+			memcpy(data, (char *) &values[i], att->attlen);
+			data_length = att->attlen;
+		}
+		else if (att->attlen == -1)
+		{
+			/* varlena */
+			Pointer		val = DatumGetPointer(values[i]);
+
+			*infomask |= ZHEAP_HASVARWIDTH;
+			if (VARATT_IS_EXTERNAL(val))
+			{
+				if (VARATT_IS_EXTERNAL_EXPANDED(val))
+				{
+					/*
+					 * we want to flatten the expanded value so that the
+					 * constructed tuple doesn't depend on it
+					 */
+					ExpandedObjectHeader *eoh = DatumGetEOHP(values[i]);
+
+					data = (char *) att_align_nominal(data,
+													  att->attalign);
+					data_length = EOH_get_flat_size(eoh);
+					EOH_flatten_into(eoh, data, data_length);
+				}
+				else
+				{
+					*infomask |= ZHEAP_HASEXTERNAL;
+					/* no alignment, since it's short by definition */
+					data_length = VARSIZE_EXTERNAL(val);
+					memcpy(data, val, data_length);
+				}
+			}
+			else if (VARATT_IS_SHORT(val))
+			{
+				/* no alignment for short varlenas */
+				data_length = VARSIZE_SHORT(val);
+				memcpy(data, val, data_length);
+			}
+			else if (VARLENA_ATT_IS_PACKABLE(att) &&
+					 VARATT_CAN_MAKE_SHORT(val))
+			{
+				/* convert to short varlena -- no alignment */
+				data_length = VARATT_CONVERTED_SHORT_SIZE(val);
+				SET_VARSIZE_SHORT(data, data_length);
+				memcpy(data + 1, VARDATA(val), data_length - 1);
+			}
+			else
+			{
+				/* full 4-byte header varlena */
+				data = (char *) att_align_nominal(data,
+												  att->attalign);
+				data_length = VARSIZE(val);
+				memcpy(data, val, data_length);
+			}
+		}
+		else if (att->attlen == -2)
+		{
+			/* cstring ... never needs alignment */
+			*infomask |= ZHEAP_HASVARWIDTH;
+			Assert(att->attalign == 'c');
+			data_length = strlen(DatumGetCString(values[i])) + 1;
+			memcpy(data, DatumGetPointer(values[i]), data_length);
+		}
+		else
+		{
+			/* fixed-length pass-by-reference */
+			data = (char *) att_align_nominal(data, att->attalign);
+			Assert(att->attlen > 0);
+			data_length = att->attlen;
+			memcpy(data, DatumGetPointer(values[i]), data_length);
+		}
+
+		data += data_length;
+	}
+
+	Assert((data - start) == data_size);
+}
+
+/*
+ * zheap_form_tuple
+ *		construct a tuple from the given values[] and isnull[] arrays.
+ *
+ *	This is similar to heap_form_tuple except for tuple header.  Currently,
+ *	we don't do anything special for Datum tuples, but eventually we need
+ *	to do something about it.
+ */
+ZHeapTuple
+zheap_form_tuple(TupleDesc tupleDescriptor,
+				 Datum *values,
+				 bool *isnull)
+{
+	ZHeapTuple	tuple;			/* return tuple */
+	ZHeapTupleHeader td;			/* tuple data */
+	Size		len,
+				data_len;
+	int			hoff;
+	bool		hasnull = false;
+	int			numberOfAttributes = tupleDescriptor->natts;
+	int			i;
+
+	if (numberOfAttributes > MaxTupleAttributeNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_TOO_MANY_COLUMNS),
+				 errmsg("number of columns (%d) exceeds limit (%d)",
+						numberOfAttributes, MaxTupleAttributeNumber)));
+
+	/*
+	 * Check for nulls
+	 */
+	for (i = 0; i < numberOfAttributes; i++)
+	{
+		if (isnull[i])
+		{
+			hasnull = true;
+			break;
+		}
+	}
+
+	/*
+	 * Determine total space needed
+	 */
+	len = offsetof(ZHeapTupleHeaderData, t_bits);
+
+	if (hasnull)
+		len += BITMAPLEN(numberOfAttributes);
+
+	/*
+	 * We don't MAXALIGN the tuple headers as we always make the copy of tuple
+	 * to support in-place updates.
+	 */
+	hoff = len;
+
+	data_len = zheap_compute_data_size(tupleDescriptor, values, isnull, hoff);
+
+	len += data_len;
+
+	/*
+	 * Allocate and zero the space needed.  Note that the tuple body and
+	 * ZHeapTupleData management structure are allocated in one chunk.
+	 */
+	tuple = MemoryContextAllocExtended(CurrentMemoryContext,
+									   ZHEAPTUPLESIZE + len,
+									   MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
+	tuple->t_data = td = (ZHeapTupleHeader) ((char *) tuple + ZHEAPTUPLESIZE);
+
+	/*
+	 * And fill in the information.  Note we fill the Datum fields even though
+	 * this tuple may never become a Datum.  This lets HeapTupleHeaderGetDatum
+	 * identify the tuple type if needed.
+	 */
+	tuple->t_len = len;
+	ItemPointerSetInvalid(&(tuple->t_self));
+	tuple->t_tableOid = InvalidOid;
+
+	ZHeapTupleHeaderSetNatts(td, numberOfAttributes);
+	td->t_hoff = hoff;
+
+	zheap_fill_tuple(tupleDescriptor,
+					 values,
+					 isnull,
+					 (char *) td + hoff,
+					 data_len,
+					 &td->t_infomask,
+					 (hasnull ? td->t_bits : NULL));
+
+	return tuple;
+}
+
+/*
+ * zheap_deform_tuple - similar to heap_deform_tuple, but for zheap tuples.
+ *
+ * Note that for zheap, cached offsets are not used and we always start
+ * deforming with the actual byte from where the first attribute starts.  See
+ * atop zheap_compute_data_size.
+ */
+void
+zheap_deform_tuple(ZHeapTuple tuple, TupleDesc tupleDesc,
+				   Datum *values, bool *isnull)
+{
+	ZHeapTupleHeader tup = tuple->t_data;
+	bool		hasnulls = ZHeapTupleHasNulls(tuple);
+	int			tdesc_natts = tupleDesc->natts;
+	int			natts;			/* number of atts to extract */
+	int			attnum;
+	char	   *tp;				/* ptr to tuple data */
+	long		off;			/* offset in tuple data */
+	bits8	   *bp = tup->t_bits;		/* ptr to null bitmap in tuple */
+
+	natts = ZHeapTupleHeaderGetNatts(tup);
+
+	/*
+	 * In inheritance situations, it is possible that the given tuple actually
+	 * has more fields than the caller is expecting.  Don't run off the end of
+	 * the caller's arrays.
+	 */
+	natts = Min(natts, tdesc_natts);
+
+	tp = (char *) tup;
+
+	off = tup->t_hoff;
+
+	for (attnum = 0; attnum < natts; attnum++)
+	{
+		Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum);
+
+		if (hasnulls && att_isnull(attnum, bp))
+		{
+			values[attnum] = (Datum) 0;
+			isnull[attnum] = true;
+			continue;
+		}
+
+		isnull[attnum] = false;
+
+		if (thisatt->attlen == -1)
+		{
+			off = att_align_pointer(off, thisatt->attalign, -1,
+									tp + off);
+		}
+		else if (!thisatt->attbyval)
+		{
+			/* not varlena, so safe to use att_align_nominal */
+			off = att_align_nominal(off, thisatt->attalign);
+		}
+
+		/*
+		 * Support fetching attributes for zheap.  The main difference as
+		 * compare to heap tuples is that we don't align passbyval attributes.
+		 * To compensate that we use memcpy to fetch passbyval attributes.
+		 */
+		if (thisatt->attbyval)
+			memcpy(&values[attnum], tp + off, thisatt->attlen);
+		else
+			values[attnum] = PointerGetDatum((char *) (tp + off));
+
+		off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+	}
+
+	/*
+	 * If tuple doesn't have all the atts indicated by tupleDesc, read the
+	 * rest as nulls or missing values as appropriate.
+	 */
+	for (; attnum < tdesc_natts; attnum++)
+		values[attnum] = getmissingattr(tupleDesc, attnum + 1, &isnull[attnum]);
+}
+
+void
+slot_deform_ztuple(TupleTableSlot *slot, ZHeapTuple tuple, uint32 *offp, int natts)
+{
+	TupleDesc	tupleDesc = slot->tts_tupleDescriptor;
+	Datum	   *values = slot->tts_values;
+	bool	   *isnull = slot->tts_isnull;
+	ZHeapTupleHeader tup = tuple->t_data;
+	bool		hasnulls = ZHeapTupleHasNulls(tuple);
+	int			attnum;
+	char	   *tp;				/* ptr to tuple data */
+	uint32		off;			/* offset in tuple data */
+	bits8	   *bp = tup->t_bits;	/* ptr to null bitmap in tuple */
+
+	/* We can only fetch as many attributes as the tuple has. */
+	natts = Min(HeapTupleHeaderGetNatts(tuple->t_data), natts);
+
+	/*
+	 * Check whether the first call for this tuple, and initialize or restore
+	 * loop state.
+	 */
+	attnum = slot->tts_nvalid;
+	if (attnum == 0)
+		off = 0; /* Start from the first attribute */
+	else
+		off = *offp; /* Restore state from previous execution */
+
+	tp = (char *) tup + tup->t_hoff + off;
+
+	for (; attnum < natts; attnum++)
+	{
+		Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum);
+
+		if (hasnulls && att_isnull(attnum, bp))
+		{
+			values[attnum] = (Datum) 0;
+			isnull[attnum] = true;
+			continue;
+		}
+
+		isnull[attnum] = false;
+
+		if (thisatt->attlen == -1)
+		{
+			tp = (char *) att_align_pointer(tp, thisatt->attalign, -1,
+											tp);
+		}
+		else if (!thisatt->attbyval)
+		{
+			/* not varlena, so safe to use att_align_nominal */
+			tp = (char *) att_align_nominal(tp, thisatt->attalign);
+		}
+		/* XXX: We don't align for byval attributes in zheap. */
+
+		/*
+		 * Support fetching attributes for zheap.  The main difference as
+		 * compare to heap tuples is that we don't align passbyval attributes.
+		 * To compensate that we use memcpy to fetch the source of passbyval
+		 * attributes.
+		 */
+		if (thisatt->attbyval)
+		{
+			Datum datum;
+
+			memcpy(&datum, tp, thisatt->attlen);
+			values[attnum] = fetch_att(&datum, true, thisatt->attlen);
+		}
+		else
+			values[attnum] = PointerGetDatum(tp);
+
+		tp = att_addlength_pointer(tp, thisatt->attlen, tp);
+	}
+
+	/*
+	 * Save state for next execution
+	 */
+	slot->tts_nvalid = attnum;
+	*offp = tp - ((char *) tup + tup->t_hoff);
+	/* For zheap, cached offsets are not used. */
+	/* ZBORKED: should just stop setting this */
+	slot->tts_flags |= TTS_FLAG_SLOW;
+}
+
+/*
+ * Subroutine for zheap_insert(). Prepares a tuple for insertion.
+ *
+ * This is similar to heap_prepare_insert except that we don't set
+ * information in tuple header as that needs to be either set in
+ * TPD entry or undorecord for this tuple.
+ */
+static ZHeapTuple
+zheap_prepare_insert(Relation relation, ZHeapTuple tup, int options)
+{
+
+	/*
+	 * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
+	 * If we skip writing/using WAL, we must force the relation down to disk
+	 * (using heap_sync) before it's safe to commit the transaction. This
+	 * requires writing out any dirty buffers of that relation and then doing
+	 * a forced fsync. For zheap, we've to fsync the corresponding undo buffers
+	 * as well. It is difficult to keep track of dirty undo buffers and fsync
+	 * them at end of the operation in some function similar to heap_sync.
+	 * But, if we're freezing the tuple during insertion, we can use the
+	 * HEAP_INSERT_SKIP_WAL optimization since we don't write undo for the same.
+	 */
+	Assert(!(options & HEAP_INSERT_SKIP_WAL) || (options & HEAP_INSERT_FROZEN));
+
+	/*
+	 * Parallel operations are required to be strictly read-only in a parallel
+	 * worker.  Parallel inserts are not safe even in the leader in the
+	 * general case, because group locking means that heavyweight locks for
+	 * relation extension or GIN page locks will not conflict between members
+	 * of a lock group, but we don't prohibit that case here because there are
+	 * useful special cases that we can safely allow, such as CREATE TABLE AS.
+	 */
+	if (IsParallelWorker())
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+				 errmsg("cannot insert tuples in a parallel worker")));
+
+	tup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+	tup->t_data->t_infomask2 &= ~ZHEAP_XACT_SLOT;
+
+	if (options & HEAP_INSERT_FROZEN)
+		ZHeapTupleHeaderSetXactSlot(tup->t_data, ZHTUP_SLOT_FROZEN);
+	tup->t_tableOid = RelationGetRelid(relation);
+
+	/*
+	 * If the new tuple is too big for storage or contains already toasted
+	 * out-of-line attributes from some other relation, invoke the toaster.
+	 */
+	if (relation->rd_rel->relkind != RELKIND_RELATION &&
+		relation->rd_rel->relkind != RELKIND_MATVIEW)
+	{
+		/* toast table entries should never be recursively toasted */
+		Assert(!ZHeapTupleHasExternal(tup));
+		return tup;
+	}
+	else if (ZHeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
+		 return ztoast_insert_or_update(relation, tup, NULL, options);
+	else
+		return tup;
+}
+
+/*
+ * xid_infomask_changed - It checks whether the relevant status for a tuple
+ *	xid has changed.
+ *
+ * Note the Xid field itself must be compared separately.
+ */
+static inline bool
+xid_infomask_changed(uint16 new_infomask, uint16 old_infomask)
+{
+	const uint16 interesting = ZHEAP_XID_LOCK_ONLY;
+
+	if ((new_infomask & interesting) != (old_infomask & interesting))
+		return true;
+
+	return false;
+}
+
+/*
+ * zheap_exec_pending_rollback - Execute pending rollback actions for the
+ *	given buffer (page).
+ *
+ * This function expects that the input buffer is locked.  We will release and
+ * reacquire the buffer lock in this function, the same can be done in all the
+ * callers of this function, but that is just a code duplication, so we instead
+ * do it here.
+ */
+bool
+zheap_exec_pending_rollback(Relation rel, Buffer buffer, int slot_no,
+							TransactionId xwait)
+{
+	UndoRecPtr urec_ptr;
+	TransactionId xid;
+	uint32	epoch;
+	int		out_slot_no PG_USED_FOR_ASSERTS_ONLY;
+
+	out_slot_no =  GetTransactionSlotInfo(buffer,
+										  InvalidOffsetNumber,
+										  slot_no,
+										  &epoch,
+										  &xid,
+										  &urec_ptr,
+										  true,
+										  true);
+
+	/* As the rollback is pending, the slot can't be frozen. */
+	Assert(out_slot_no != ZHTUP_SLOT_FROZEN);
+
+	if (xwait != xid)
+		return false;
+
+	/*
+	 * Release buffer lock before applying undo actions.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	process_and_execute_undo_actions_page(urec_ptr, rel, buffer, epoch, xid, slot_no);
+
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return true;
+}
+
+/*
+ * zbuffer_exec_pending_rollback - apply any pending rollback on the input buffer
+ *
+ * This method traverses all the transaction slots of the current page including
+ * tpd slots and applies any pending aborts on the page.
+ *
+ * It expects the caller has an exclusive lock on the relation. It also returns
+ * the corresponding TPD block number in case it has rolled back any transactions
+ * from the corresponding TPD page, if any.
+ */
+void
+zbuffer_exec_pending_rollback(Relation rel, Buffer buf, BlockNumber *tpd_blkno)
+{
+	int				slot_no;
+	int				total_trans_slots = 0;
+	uint64			epoch;
+	TransactionId	xid;
+	UndoRecPtr		urec_ptr;
+	TransInfo 		*trans_slots = NULL;
+	bool			any_tpd_slot_rolled_back = false;
+
+	Assert(tpd_blkno != NULL);
+
+	/*
+	 * Fetch all the transaction information from the page and its corresponding
+	 * TPD page.
+	 */
+	trans_slots = GetTransactionsSlotsForPage(rel, buf, &total_trans_slots, tpd_blkno);
+
+	for (slot_no = 0; slot_no < total_trans_slots; slot_no++)
+	{
+		epoch = trans_slots[slot_no].xid_epoch;
+		xid = trans_slots[slot_no].xid;
+		urec_ptr = trans_slots[slot_no].urec_ptr;
+
+		/*
+		 * There shouldn't be any other in-progress transaction as we hold an
+		 * exclusive lock on the relation.
+		 */
+		Assert(TransactionIdIsCurrentTransactionId(xid) ||
+			   !TransactionIdIsInProgress(xid));
+
+		/* If the transaction is aborted, apply undo actions */
+		if (TransactionIdIsValid(xid) && TransactionIdDidAbort(xid))
+		{
+			/* Remember if we've rolled back a transactio from a TPD-slot. */
+			if ((slot_no >= ZHEAP_PAGE_TRANS_SLOTS - 1) &&
+				BlockNumberIsValid(*tpd_blkno))
+				any_tpd_slot_rolled_back = true;
+			process_and_execute_undo_actions_page(urec_ptr, rel, buf, epoch,
+												  xid, slot_no);
+		}
+	}
+
+	/*
+	 * If we've not rolled back anything from TPD slot, there is no
+	 * need set the TPD buffer.
+	 */
+	if (!any_tpd_slot_rolled_back)
+		*tpd_blkno = InvalidBlockNumber;
+
+	/* be tidy */
+	pfree(trans_slots);
+}
+
+/*
+ * zheap_insert - insert tuple into a zheap
+ *
+ * The functionality related to heap is quite similar to heap_insert,
+ * additionaly this function inserts an undo record and updates the undo
+ * pointer in page header or in TPD entry for this page.
+ *
+ * XXX - Visibility map and page is all visible checks are required to support
+ * index-only scans on zheap.
+ */
+void
+zheap_insert(Relation relation, ZHeapTuple tup, CommandId cid,
+			 int options, BulkInsertState bistate)
+{
+	TransactionId xid = InvalidTransactionId;
+	uint32	epoch = 0;
+	ZHeapTuple	zheaptup;
+	UnpackedUndoRecord	undorecord;
+	Buffer		buffer;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		all_visible_cleared = false;
+	int			trans_slot_id = InvalidXactSlotId;
+	Page		page;
+	UndoRecPtr	urecptr = InvalidUndoRecPtr,
+				prev_urecptr = InvalidUndoRecPtr;
+	xl_undolog_meta	undometa;
+	uint8		vm_status = 0;
+	bool		lock_reacquired;
+	bool		skip_undo;
+
+	/*
+	 * We can skip inserting undo records if the tuples are to be marked
+	 * as frozen.
+	 */
+	skip_undo = (options & HEAP_INSERT_FROZEN);
+
+	if (!skip_undo)
+	{
+		/* We don't need a transaction id if we are skipping undo */
+		xid = GetTopTransactionId();
+		epoch = GetEpochForXid(xid);
+	}
+
+	/*
+	 * Assign an OID, and toast the tuple if necessary.
+	 *
+	 * Note: below this point, heaptup is the data we actually intend to store
+	 * into the relation; tup is the caller's original untoasted data.
+	 */
+	zheaptup = zheap_prepare_insert(relation, tup, options);
+
+reacquire_buffer:
+	/*
+	 * Find buffer to insert this tuple into.  If the page is all visible,
+	 * this will also pin the requisite visibility map page.
+	 */
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	buffer = RelationGetBufferForZTuple(relation, zheaptup->t_len,
+										InvalidBuffer, options, bistate,
+										&vmbuffer, NULL);
+	page = BufferGetPage(buffer);
+
+	if (!skip_undo)
+	{
+		/*
+		 * The transaction information of tuple needs to be set in transaction
+		 * slot, so needs to reserve the slot before proceeding with the actual
+		 * operation.  It will be costly to wait for getting the slot, but we do
+		 * that by releasing the buffer lock.
+		 *
+		 * We don't yet know the offset number of the inserting tuple so just pass
+		 * the max offset number + 1 so that if it need to get slot from the TPD
+		 * it can ensure that the TPD has sufficient map entries.
+		 */
+		trans_slot_id = PageReserveTransactionSlot(relation,
+												   buffer,
+												   PageGetMaxOffsetNumber(page) + 1,
+												   epoch,
+												   xid,
+												   &prev_urecptr,
+												   &lock_reacquired);
+		if (lock_reacquired)
+		{
+			UnlockReleaseBuffer(buffer);
+			goto reacquire_buffer;
+		}
+
+		if (trans_slot_id == InvalidXactSlotId)
+		{
+			UnlockReleaseBuffer(buffer);
+
+			pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+			pg_usleep(10000L);	/* 10 ms */
+			pgstat_report_wait_end();
+
+			goto reacquire_buffer;
+		}
+
+		/* transaction slot must be reserved before adding tuple to page */
+		Assert(trans_slot_id != InvalidXactSlotId);
+	}
+
+	if (options & HEAP_INSERT_SPECULATIVE)
+	{
+		/*
+		 * We can't skip writing undo speculative insertions as we have to
+		 * write the token in undo.
+		 */
+		Assert(!skip_undo);
+
+		/* Mark the tuple as speculatively inserted tuple. */
+		zheaptup->t_data->t_infomask |= ZHEAP_SPECULATIVE_INSERT;
+	}
+
+	/*
+	 * See heap_insert to know why checking conflicts is important
+	 * before actually inserting the tuple.
+	 */
+	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+
+	if (!skip_undo)
+	{
+		/*
+		 * Prepare an undo record.  Unlike other operations, insert operation
+		 * doesn't have a prior version to store in undo, so ideally, we don't
+		 * need to store any additional information like
+		 * UREC_INFO_PAYLOAD_CONTAINS_SLOT for TPD entries.  However, for the sake
+		 * of consistency with inserts via non-inplace updates, we keep the
+		 * additional information in this operation.  Also, we need such an
+		 * information in future where we need to know more information for undo
+		 * tuples and it would be good for forensic purpose as well.
+		 */
+		undorecord.uur_type = UNDO_INSERT;
+		undorecord.uur_info = 0;
+		undorecord.uur_prevlen = 0;
+		undorecord.uur_reloid = relation->rd_id;
+		undorecord.uur_prevxid = FrozenTransactionId;
+		undorecord.uur_xid = xid;
+		undorecord.uur_cid = cid;
+		undorecord.uur_fork = MAIN_FORKNUM;
+		undorecord.uur_blkprev = prev_urecptr;
+		undorecord.uur_block = BufferGetBlockNumber(buffer);
+		undorecord.uur_tuple.len = 0;
+
+		/*
+		 * Store the speculative insertion token in undo, so that we can retrieve
+		 * it during visibility check of the speculatively inserted tuples.
+		 *
+		 * Note that we don't need to WAL log this value as this is a temporary
+		 * information required only on master node to detect conflicts for
+		 * Insert .. On Conflict.
+		 */
+		if (options & HEAP_INSERT_SPECULATIVE)
+		{
+			uint32 specToken;
+
+			undorecord.uur_payload.len = sizeof(uint32);
+			specToken = GetSpeculativeInsertionToken();
+			initStringInfo(&undorecord.uur_payload);
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *)&specToken,
+								   sizeof(uint32));
+		}
+		else
+			undorecord.uur_payload.len = 0;
+
+		urecptr = PrepareUndoInsert(&undorecord,
+									InvalidTransactionId,
+									UndoPersistenceForRelation(relation),
+									&undometa);
+	}
+
+	/*
+	 * If there is a valid vmbuffer get its status.  The vmbuffer will not
+	 * be valid if operated page is newly extended, see
+	 * RelationGetBufferForZTuple. Also, anyway by default vm status
+	 * bits are clear for those pages hence no need to clear it again!
+	 */
+	if (BufferIsValid(vmbuffer))
+		vm_status = visibilitymap_get_status(relation,
+								BufferGetBlockNumber(buffer),
+								&vmbuffer);
+
+	/*
+	 * Lock the TPD page before starting critical section.  We might need
+	 * to access it in ZPageAddItemExtended.  Note that if the transaction
+	 * slot belongs to TPD entry, then the TPD page must be locked during
+	 * slot reservation.
+	 *
+	 * XXX We can optimize this by avoid taking TPD page lock unless the page
+	 * has some unused item which requires us to fetch the transaction
+	 * information from TPD.
+	 */
+	if (trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS &&
+		ZHeapPageHasTPDSlot((PageHeader) page) &&
+		PageHasFreeLinePointers((PageHeader) page))
+		TPDPageLock(relation, buffer);
+
+	/* NO EREPORT(ERROR) from here till changes are logged */
+	START_CRIT_SECTION();
+
+	if (!(options & HEAP_INSERT_FROZEN))
+		ZHeapTupleHeaderSetXactSlot(zheaptup->t_data, trans_slot_id);
+
+	RelationPutZHeapTuple(relation, buffer, zheaptup);
+
+	if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) ||
+		(vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE))
+	{
+		all_visible_cleared = true;
+		visibilitymap_clear(relation,
+						ItemPointerGetBlockNumber(&(zheaptup->t_self)),
+						vmbuffer, VISIBILITYMAP_VALID_BITS);
+	}
+
+	if (!skip_undo)
+	{
+		Assert(undorecord.uur_block == ItemPointerGetBlockNumber(&(zheaptup->t_self)));
+		undorecord.uur_offset = ItemPointerGetOffsetNumber(&(zheaptup->t_self));
+		InsertPreparedUndo();
+		PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, xid,
+					urecptr, NULL, 0);
+	}
+
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(relation))
+	{
+		xl_undo_header	xlundohdr;
+		xl_zheap_insert xlrec;
+		xl_zheap_header xlhdr;
+		XLogRecPtr	recptr;
+		Page		page = BufferGetPage(buffer);
+		uint8		info = XLOG_ZHEAP_INSERT;
+		int			bufflags = 0;
+		XLogRecPtr	RedoRecPtr;
+		bool		doPageWrites;
+
+		/*
+		 * If this is a catalog, we need to transmit combocids to properly
+		 * decode, so log that as well.
+		 */
+		if (RelationIsAccessibleInLogicalDecoding(relation))
+		{
+			/*
+			 * Fixme: This won't work as it needs to access cmin/cmax which
+			 * we probably needs to retrieve from TPD or UNDO.
+			 */
+			/*log_heap_new_cid(relation, zheaptup);*/
+		}
+
+		/*
+		 * If this is the single and first tuple on page, we can reinit the
+		 * page instead of restoring the whole thing.  Set flag, and hide
+		 * buffer references from XLogInsert.
+		 */
+		if (ItemPointerGetOffsetNumber(&(zheaptup->t_self)) == FirstOffsetNumber &&
+			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+		{
+			info |= XLOG_ZHEAP_INIT_PAGE;
+			bufflags |= REGBUF_WILL_INIT;
+		}
+
+		/*
+		 * Store the information required to generate undo record during
+		 * replay.
+		 */
+		xlundohdr.reloid = relation->rd_id;
+		xlundohdr.urec_ptr = urecptr;
+		xlundohdr.blkprev = prev_urecptr;
+
+		/* Heap related part. */
+		xlrec.offnum = ItemPointerGetOffsetNumber(&zheaptup->t_self);
+		xlrec.flags = 0;
+
+		if (all_visible_cleared)
+			xlrec.flags |= XLZ_INSERT_ALL_VISIBLE_CLEARED;
+		if (options & HEAP_INSERT_SPECULATIVE)
+			xlrec.flags |= XLZ_INSERT_IS_SPECULATIVE;
+		if (skip_undo)
+			xlrec.flags |= XLZ_INSERT_IS_FROZEN;
+		Assert(ItemPointerGetBlockNumber(&zheaptup->t_self) == BufferGetBlockNumber(buffer));
+
+		/*
+		 * For logical decoding, we need the tuple even if we're doing a full
+		 * page write, so make sure it's included even if we take a full-page
+		 * image. (XXX We could alternatively store a pointer into the FPW).
+		 *
+		 * Fixme - Current zheap doesn't support logical decoding, once it is
+		 * supported, we need to test and remove this Fixme.
+		 */
+		if (RelationIsLogicallyLogged(relation))
+		{
+			xlrec.flags |= XLZ_INSERT_CONTAINS_NEW_TUPLE;
+			bufflags |= REGBUF_KEEP_DATA;
+		}
+
+prepare_xlog:
+		if (!skip_undo)
+		{
+			/*
+			 * LOG undolog meta if this is the first WAL after the checkpoint.
+			 */
+			LogUndoMetaData(&undometa);
+		}
+
+		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+		XLogRegisterData((char *) &xlrec, SizeOfZHeapInsert);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			/*
+			 * We can't have a valid transaction slot when we are skipping
+			 * undo.
+			 */
+			Assert(!skip_undo);
+			xlrec.flags |= XLZ_INSERT_CONTAINS_TPD_SLOT;
+			XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id));
+		}
+
+		xlhdr.t_infomask2 = zheaptup->t_data->t_infomask2;
+		xlhdr.t_infomask = zheaptup->t_data->t_infomask;
+		xlhdr.t_hoff = zheaptup->t_data->t_hoff;
+
+		/*
+		 * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+		 * write the whole page to the xlog, we don't need to store
+		 * xl_heap_header in the xlog.
+		 */
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+		XLogRegisterBufData(0, (char *) &xlhdr, SizeOfZHeapHeader);
+		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+		XLogRegisterBufData(0,
+							(char *) zheaptup->t_data + SizeofZHeapTupleHeader,
+							zheaptup->t_len - SizeofZHeapTupleHeader);
+		if (xlrec.flags & XLZ_INSERT_CONTAINS_TPD_SLOT)
+			(void) RegisterTPDBuffer(page, 1);
+
+		/* filtering by origin on a row level is much more efficient */
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+		recptr = XLogInsertExtended(RM_ZHEAP_ID, info, RedoRecPtr,
+									doPageWrites);
+		if (recptr == InvalidXLogRecPtr)
+		{
+			ResetRegisteredTPDBuffers();
+			goto prepare_xlog;
+		}
+
+		PageSetLSN(page, recptr);
+		if (xlrec.flags & XLZ_INSERT_CONTAINS_TPD_SLOT)
+			TPDPageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	UnlockReleaseBuffer(buffer);
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	if (!skip_undo)
+	{
+		/* be tidy */
+		if (undorecord.uur_payload.len > 0)
+			pfree(undorecord.uur_payload.data);
+		UnlockReleaseUndoBuffers();
+	}
+	UnlockReleaseTPDBuffers();
+
+	/*
+	 * If tuple is cachable, mark it for invalidation from the caches in case
+	 * we abort.  Note it is OK to do this after releasing the buffer, because
+	 * the zheaptup data structure is all in local memory, not in the shared
+	 * buffer.
+	 *
+	 * Fixme - Cache invalidation API expects HeapTup, so either we need an
+	 * eqvivalent API for ZHeapTup or need to teach cache invalidation API's
+	 * to work with both the formats.
+	 */
+	/* CacheInvalidateHeapTuple(relation, zheaptup, NULL); */
+
+	/* Note: speculative insertions are counted too, even if aborted later */
+	pgstat_count_heap_insert(relation, 1);
+
+	/*
+	 * If zheaptup is a private copy, release it.  Don't forget to copy t_self
+	 * back to the caller's image, too.
+	 */
+	if (zheaptup != tup)
+	{
+		tup->t_self = zheaptup->t_self;
+
+		/*
+		 * Since, in ZHeap we have speculative flag in the tuple header only,
+		 * copy the speculative flag to the new tuple if required.
+		 */
+		if (ZHeapTupleHeaderIsSpeculative(zheaptup->t_data))
+			tup->t_data->t_infomask |= ZHEAP_SPECULATIVE_INSERT;
+
+		zheap_freetuple(zheaptup);
+	}
+}
+
+/*
+ *	simple_zheap_delete - delete a zheap tuple
+ *
+ * This routine may be used to delete a tuple when concurrent updates of
+ * the target tuple are not expected (for example, because we have a lock
+ * on the relation associated with the tuple).  Any failure is reported
+ * via ereport().
+ */
+void
+simple_zheap_delete(Relation relation, ItemPointer tid, Snapshot snapshot)
+{
+	HTSU_Result result;
+	HeapUpdateFailureData hufd;
+
+	result = zheap_delete(relation, tid,
+						 GetCurrentCommandId(true), InvalidSnapshot, snapshot,
+						 true, /* wait for commit */
+						 &hufd, false /* changingPart */);
+	switch (result)
+	{
+		case HeapTupleSelfUpdated:
+			/* Tuple was already updated in current command? */
+			elog(ERROR, "tuple already updated by self");
+			break;
+
+		case HeapTupleMayBeUpdated:
+			/* done successfully */
+			break;
+
+		case HeapTupleUpdated:
+			elog(ERROR, "tuple concurrently updated");
+			break;
+
+		default:
+			elog(ERROR, "unrecognized zheap_delete status: %u", result);
+			break;
+	}
+}
+
+/*
+ * zheap_delete - delete a tuple
+ *
+ * The functionality related to heap is quite similar to heap_delete,
+ * additionaly this function inserts an undo record and updates the undo
+ * pointer in page header or in TPD entry for this page.
+ *
+ * XXX - Visibility map and page is all visible checks are required to support
+ * index-only scans on zheap.
+ */
+HTSU_Result
+zheap_delete(Relation relation, ItemPointer tid,
+			 CommandId cid, Snapshot crosscheck, Snapshot snapshot, bool wait,
+			 HeapUpdateFailureData *hufd, bool changingPart)
+{
+	HTSU_Result result;
+	TransactionId xid = GetTopTransactionId();
+	TransactionId	tup_xid,
+					oldestXidHavingUndo,
+					single_locker_xid;
+	SubTransactionId	tup_subxid = InvalidSubTransactionId;
+	CommandId		tup_cid;
+	ItemId		lp;
+	ZHeapTupleData zheaptup;
+	UnpackedUndoRecord	undorecord;
+	Page		page;
+	BlockNumber blkno;
+	OffsetNumber offnum;
+	Buffer		buffer;
+	Buffer		vmbuffer = InvalidBuffer;
+	UndoRecPtr	urecptr, prev_urecptr;
+	ItemPointerData	ctid;
+	uint32		epoch = GetEpochForXid(xid);
+	int			tup_trans_slot_id,
+				trans_slot_id,
+				new_trans_slot_id,
+				single_locker_trans_slot;
+	uint16		new_infomask, temp_infomask;
+	bool		have_tuple_lock = false;
+	bool		in_place_updated_or_locked = false;
+	bool		all_visible_cleared = false;
+	bool		any_multi_locker_member_alive = false;
+	bool		lock_reacquired;
+	bool		hasSubXactLock = false;
+	bool		hasPayload = false;
+	xl_undolog_meta undometa;
+	uint8		vm_status;
+
+	Assert(ItemPointerIsValid(tid));
+
+	/*
+	 * Forbid this during a parallel operation, lest it allocate a combocid.
+	 * Other workers might need that combocid for visibility checks, and we
+	 * have no provision for broadcasting it to them.
+	 */
+	if (IsInParallelMode())
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+				 errmsg("cannot delete tuples during a parallel operation")));
+
+	blkno = ItemPointerGetBlockNumber(tid);
+	buffer = ReadBuffer(relation, blkno);
+	page = BufferGetPage(buffer);
+
+	/*
+	 * Before locking the buffer, pin the visibility map page mainly to avoid
+	 * doing I/O after locking the buffer.
+	 */
+	visibilitymap_pin(relation, blkno, &vmbuffer);
+
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	offnum = ItemPointerGetOffsetNumber(tid);
+	lp = PageGetItemId(page, offnum);
+	Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp));
+
+	/*
+	 * If TID is already delete marked due to pruning, then get new ctid, so
+	 * that we can delete the new tuple.  We will get new ctid if the tuple
+	 * was non-inplace-updated otherwise we will get same TID.
+	 */
+	if (ItemIdIsDeleted(lp))
+	{
+		ctid = *tid;
+		ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+		result = HeapTupleUpdated;
+		goto zheap_tuple_updated;
+	}
+
+	zheaptup.t_tableOid = RelationGetRelid(relation);
+	zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	zheaptup.t_len = ItemIdGetLength(lp);
+	zheaptup.t_self = *tid;
+
+	ctid = *tid;
+
+check_tup_satisfies_update:
+	any_multi_locker_member_alive = true;
+	result = ZHeapTupleSatisfiesUpdate(relation, &zheaptup, cid, buffer, &ctid,
+									   &tup_trans_slot_id, &tup_xid, &tup_subxid,
+									   &tup_cid, &single_locker_xid,
+									   &single_locker_trans_slot, false, false,
+									   snapshot, &in_place_updated_or_locked);
+
+	if (result == HeapTupleInvisible)
+	{
+		UnlockReleaseBuffer(buffer);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("attempted to delete invisible tuple")));
+	}
+	else if ((result == HeapTupleBeingUpdated ||
+			 ((result == HeapTupleMayBeUpdated) &&
+			  ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask))) &&
+			  wait)
+	{
+		List	*mlmembers = NIL;
+		TransactionId xwait;
+		SubTransactionId	xwait_subxid;
+		int		xwait_trans_slot;
+		uint16	infomask;
+		bool    isCommitted;
+		bool	can_continue = false;
+
+		lock_reacquired = false;
+		xwait_subxid = tup_subxid;
+
+		if (TransactionIdIsValid(single_locker_xid))
+		{
+			xwait = single_locker_xid;
+			xwait_trans_slot = single_locker_trans_slot;
+		}
+		else
+		{
+			xwait = tup_xid;
+			xwait_trans_slot = tup_trans_slot_id;
+		}
+
+		infomask = zheaptup.t_data->t_infomask;
+
+		/*
+		 * Sleep until concurrent transaction ends -- except when there's a
+		 * single locker and it's our own transaction.  Note we don't care
+		 * which lock mode the locker has, because we need the strongest one.
+		 *
+		 * Before sleeping, we need to acquire tuple lock to establish our
+		 * priority for the tuple (see zheap_lock_tuple).  LockTuple will
+		 * release us when we are next-in-line for the tuple.
+		 *
+		 * If we are forced to "start over" below, we keep the tuple lock;
+		 * this arranges that we stay at the head of the line while rechecking
+		 * tuple state.
+		 */
+		if (ZHeapTupleHasMultiLockers(infomask))
+		{
+			LockTupleMode	old_lock_mode;
+			TransactionId	update_xact;
+			bool			upd_xact_aborted = false;
+
+			/*
+			 * In ZHeapTupleSatisfiesUpdate, it's not possible to know if current
+			 * transaction has already locked the tuple for update because of
+			 * multilocker flag. In that case, we've to check whether the current
+			 * transaction has already locked the tuple for update.
+			 */
+
+			/*
+			 * Get the transaction slot and undo record pointer if we are already in a
+			 * transaction.
+			 */
+			trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid,
+													 &prev_urecptr, false, false,
+													 NULL);
+
+			if (trans_slot_id != InvalidXactSlotId)
+			{
+				List	*mlmembers;
+				ListCell   *lc;
+
+				/*
+				 * If any subtransaction of the current top transaction already holds
+				 * a lock as strong as or stronger than what we're requesting, we
+				 * effectively hold the desired lock already.  We *must* succeed
+				 * without trying to take the tuple lock, else we will deadlock
+				 * against anyone wanting to acquire a stronger lock.
+				 */
+				mlmembers = ZGetMultiLockMembersForCurrentXact(&zheaptup,
+													trans_slot_id, prev_urecptr);
+
+				foreach(lc, mlmembers)
+				{
+					ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+					/*
+					 * Only members of our own transaction must be present in
+					 * the list.
+					 */
+					Assert(TransactionIdIsCurrentTransactionId(mlmember->xid));
+
+					if (mlmember->mode >= LockTupleExclusive)
+					{
+						result = HeapTupleMayBeUpdated;
+						/*
+						 * There is no other active locker on the tuple except
+						 * current transaction id, so we can delete the tuple.
+						 */
+						goto zheap_tuple_updated;
+					}
+				}
+
+				list_free_deep(mlmembers);
+			}
+
+			old_lock_mode = get_old_lock_mode(infomask);
+
+			/*
+			 * For aborted updates, we must allow to reverify the tuple in
+			 * case it's values got changed.  See the similar handling in
+			 * zheap_update.
+			 */
+			if (!ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask))
+				ZHeapTupleGetTransInfo(&zheaptup, buffer, NULL, NULL, &update_xact,
+									   NULL, NULL, false);
+			else
+				update_xact = InvalidTransactionId;
+
+			if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode),
+								HWLOCKMODE_from_locktupmode(LockTupleExclusive)))
+			{
+				/*
+				 * There is a potential conflict.  It is quite possible
+				 * that by this time the locker has already been committed.
+				 * So we need to check for conflict with all the possible
+				 * lockers and wait for each of them after releasing a
+				 * buffer lock and acquiring a lock on a tuple.
+				 */
+				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+				mlmembers = ZGetMultiLockMembers(relation, &zheaptup, buffer,
+												 true);
+
+				/*
+				 * If there is no multi-lock members apart from the current transaction
+				 * then no need for tuplock, just go ahead.
+				 */
+				if (mlmembers != NIL)
+				{
+					heap_acquire_tuplock(relation, &(zheaptup.t_self), LockTupleExclusive,
+										 LockWaitBlock, &have_tuple_lock);
+					ZMultiLockMembersWait(relation, mlmembers, &zheaptup, buffer,
+										  update_xact, LockTupleExclusive, false,
+										  XLTW_Delete, NULL, &upd_xact_aborted);
+				}
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+				/*
+				 * If the aborted xact is for update, then we need to reverify
+				 * the tuple.
+				 */
+				if (upd_xact_aborted)
+					goto check_tup_satisfies_update;
+				lock_reacquired = true;
+
+				/*
+				 * There was no UPDATE in the Multilockers. No
+				 * TransactionIdIsInProgress() call needed here, since we called
+				 * ZMultiLockMembersWait() above.
+				 */
+				if (!TransactionIdIsValid(update_xact))
+					can_continue = true;
+			}
+		}
+		else if (!TransactionIdIsCurrentTransactionId(xwait))
+		{
+			/*
+			 * Wait for regular transaction to end; but first, acquire tuple
+			 * lock.
+			 */
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			heap_acquire_tuplock(relation, &(zheaptup.t_self), LockTupleExclusive,
+								 LockWaitBlock, &have_tuple_lock);
+			if (xwait_subxid != InvalidSubTransactionId)
+				SubXactLockTableWait(xwait, xwait_subxid, relation,
+									 &zheaptup.t_self, XLTW_Delete);
+			else
+				XactLockTableWait(xwait, relation, &zheaptup.t_self,
+								  XLTW_Delete);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			lock_reacquired = true;
+		}
+
+		if (lock_reacquired)
+		{
+			TransactionId	current_tup_xid;
+
+			/*
+			 * By the time, we require the lock on buffer, some other xact
+			 * could have updated this tuple.  We need take care of the cases
+			 * when page is pruned after we release the buffer lock. For this,
+			 * we check if ItemId is not deleted and refresh the tuple offset
+			 * position in page.  If TID is already delete marked due to
+			 * pruning, then get new ctid, so that we can update the new
+			 * tuple.
+			 *
+			 * We also need to ensure that no new lockers have been added in
+			 * the meantime, if there is any new locker, then start again.
+			 */
+			if (ItemIdIsDeleted(lp))
+			{
+				ctid = *tid;
+				ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+				result = HeapTupleUpdated;
+				goto zheap_tuple_updated;
+			}
+
+			zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+			zheaptup.t_len = ItemIdGetLength(lp);
+
+			if (ZHeapTupleHasMultiLockers(infomask))
+			{
+				List	*new_mlmembers;
+				new_mlmembers = ZGetMultiLockMembers(relation, &zheaptup,
+													 buffer, false);
+
+				/*
+				 * Ensure, no new lockers have been added, if so, then start
+				 * again.
+				 */
+				if (!ZMultiLockMembersSame(mlmembers, new_mlmembers))
+				{
+					list_free_deep(mlmembers);
+					list_free_deep(new_mlmembers);
+					goto check_tup_satisfies_update;
+				}
+
+				any_multi_locker_member_alive =
+					ZIsAnyMultiLockMemberRunning(new_mlmembers, &zheaptup,
+												 buffer);
+				list_free_deep(mlmembers);
+				list_free_deep(new_mlmembers);
+			}
+
+			/*
+			 * xwait is done, but if xwait had just locked the tuple then some
+			 * other xact could update/lock this tuple before we get to this
+			 * point.  Check for xid change, and start over if so.  We need to
+			 * do some special handling for lockers because their xid is never
+			 * stored on the tuples.  If there was a single locker on the
+			 * tuple and that locker is gone and some new locker has locked
+			 * the tuple, we won't be able to identify that by infomask/xid on
+			 * the tuple, rather we need to fetch the locker xid.
+			 */
+			ZHeapTupleGetTransInfo(&zheaptup, buffer, NULL, NULL,
+								   &current_tup_xid, NULL, NULL, false);
+			if (xid_infomask_changed(zheaptup.t_data->t_infomask, infomask) ||
+				!TransactionIdEquals(current_tup_xid, xwait))
+			{
+				if (ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask) &&
+					!ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask) &&
+					TransactionIdIsValid(single_locker_xid))
+				{
+					TransactionId current_single_locker_xid = InvalidTransactionId;
+
+					(void) GetLockerTransInfo(relation, &zheaptup, buffer, NULL,
+											  NULL, &current_single_locker_xid,
+											  NULL, NULL);
+					if (!TransactionIdEquals(single_locker_xid,
+											 current_single_locker_xid))
+						goto check_tup_satisfies_update;
+
+				}
+				else
+					goto check_tup_satisfies_update;
+			}
+
+			/* Aborts of multi-lockers are already dealt above. */
+			if(!ZHeapTupleHasMultiLockers(infomask))
+			{
+				bool	has_update = false;
+
+				if (!ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask))
+					has_update = true;
+
+				isCommitted = TransactionIdDidCommit(xwait);
+
+				/*
+				 * For aborted transaction, if the undo actions are not applied
+				 * yet, then apply them before modifying the page.
+				 */
+				if (!isCommitted)
+					zheap_exec_pending_rollback(relation,
+												buffer,
+												xwait_trans_slot,
+												xwait);
+
+				/*
+				 * For aborted updates, we must allow to reverify the tuple in
+				 * case it's values got changed.
+				 */
+				if (!isCommitted && has_update)
+					goto check_tup_satisfies_update;
+
+				if (!has_update)
+					can_continue = true;
+			}
+		}
+		else
+		{
+			/*
+			 * We can proceed with the delete, when there's a single locker
+			 * and it's our own transaction.
+			 */
+			if (ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask))
+				can_continue = true;
+		}
+
+		/*
+		 * We may overwrite if previous xid is aborted or committed, but only
+		 * locked the tuple without updating it.
+		 */
+		if (result != HeapTupleMayBeUpdated)
+			result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated;
+	}
+	else if (result == HeapTupleUpdated
+			 && ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask))
+	{
+		/*
+		 * Get the transaction slot and undo record pointer if we are already in a
+		 * transaction.
+		 */
+		trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid,
+												 &prev_urecptr, false, false,
+												 NULL);
+
+		if (trans_slot_id != InvalidXactSlotId)
+		{
+			List	*mlmembers;
+			ListCell   *lc;
+
+			/*
+			 * If any subtransaction of the current top transaction already holds
+			 * a lock as strong as or stronger than what we're requesting, we
+			 * effectively hold the desired lock already.  We *must* succeed
+			 * without trying to take the tuple lock, else we will deadlock
+			 * against anyone wanting to acquire a stronger lock.
+			 */
+			mlmembers = ZGetMultiLockMembersForCurrentXact(&zheaptup,
+												trans_slot_id, prev_urecptr);
+
+			foreach(lc, mlmembers)
+			{
+				ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+				/*
+				 * Only members of our own transaction must be present in
+				 * the list.
+				 */
+				Assert(TransactionIdIsCurrentTransactionId(mlmember->xid));
+
+				if (mlmember->mode >= LockTupleExclusive)
+				{
+					result = HeapTupleMayBeUpdated;
+					/*
+					 * There is no other active locker on the tuple except
+					 * current transaction id, so we can delete the tuple.
+					 */
+					break;
+				}
+			}
+
+			list_free_deep(mlmembers);
+		}
+
+	}
+
+	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
+	{
+		/* Perform additional check for transaction-snapshot mode RI updates */
+		if (!ZHeapTupleSatisfies(&zheaptup, crosscheck, buffer, NULL))
+			result = HeapTupleUpdated;
+	}
+
+zheap_tuple_updated:
+	if (result != HeapTupleMayBeUpdated)
+	{
+		Assert(result == HeapTupleSelfUpdated ||
+			   result == HeapTupleUpdated ||
+			   result == HeapTupleBeingUpdated);
+		Assert(ItemIdIsDeleted(lp) ||
+			   IsZHeapTupleModified(zheaptup.t_data->t_infomask));
+
+		/* If item id is deleted, tuple can't be marked as moved. */
+		if (!ItemIdIsDeleted(lp) &&
+			ZHeapTupleIsMoved(zheaptup.t_data->t_infomask))
+			ItemPointerSetMovedPartitions(&hufd->ctid);
+		else
+			hufd->ctid = ctid;
+		hufd->xmax = tup_xid;
+		if (result == HeapTupleSelfUpdated)
+			hufd->cmax = tup_cid;
+		else
+			hufd->cmax = InvalidCommandId;
+		UnlockReleaseBuffer(buffer);
+		hufd->in_place_updated_or_locked = in_place_updated_or_locked;
+		if (have_tuple_lock)
+			UnlockTupleTuplock(relation, &(zheaptup.t_self), LockTupleExclusive);
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+		return result;
+	}
+
+	/*
+	 * Acquire subtransaction lock, if current transaction is a
+	 * subtransaction.
+	 */
+	if (IsSubTransaction())
+	{
+		SubXactLockTableInsert(GetCurrentSubTransactionId());
+		hasSubXactLock = true;
+	}
+
+	/*
+	 * The transaction information of tuple needs to be set in transaction
+	 * slot, so needs to reserve the slot before proceeding with the actual
+	 * operation.  It will be costly to wait for getting the slot, but we do
+	 * that by releasing the buffer lock.
+	 */
+	trans_slot_id = PageReserveTransactionSlot(relation, buffer,
+											   PageGetMaxOffsetNumber(page),
+											   epoch, xid, &prev_urecptr,
+											   &lock_reacquired);
+	if (lock_reacquired)
+		goto check_tup_satisfies_update;
+
+	if (trans_slot_id == InvalidXactSlotId)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+		pg_usleep(10000L);	/* 10 ms */
+		pgstat_report_wait_end();
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Also take care of cases when page is pruned after we release the
+		 * buffer lock. For this we check if ItemId is not deleted and refresh
+		 * the tuple offset position in page.  If TID is already delete marked
+		 * due to pruning, then get new ctid, so that we can delete the new
+		 * tuple.
+		 */
+		if (ItemIdIsDeleted(lp))
+		{
+			ctid = *tid;
+			ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+			result = HeapTupleUpdated;
+			goto zheap_tuple_updated;
+		}
+
+		zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		zheaptup.t_len = ItemIdGetLength(lp);
+		goto check_tup_satisfies_update;
+	}
+
+	/* transaction slot must be reserved before adding tuple to page */
+	Assert(trans_slot_id != InvalidXactSlotId);
+
+	/*
+	 * It's possible that tuple slot is now marked as frozen. Hence, we refetch
+	 * the tuple here.
+	 */
+	Assert(!ItemIdIsDeleted(lp));
+	zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	zheaptup.t_len = ItemIdGetLength(lp);
+
+	/*
+	 * If the slot is marked as frozen, the latest modifier of the tuple must be
+	 * frozen.
+	 */
+	if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (zheaptup.t_data)) == ZHTUP_SLOT_FROZEN)
+	{
+		tup_trans_slot_id = ZHTUP_SLOT_FROZEN;
+		tup_xid = InvalidTransactionId;
+	}
+
+	temp_infomask = zheaptup.t_data->t_infomask;
+
+	/* Compute the new xid and infomask to store into the tuple. */
+	compute_new_xid_infomask(&zheaptup, buffer, tup_xid, tup_trans_slot_id,
+							 temp_infomask, xid, trans_slot_id,
+							 single_locker_xid, LockTupleExclusive, ForUpdate,
+							 &new_infomask, &new_trans_slot_id);
+	/*
+	 * There must not be any stronger locker than the current operation,
+	 * otherwise it would have waited for it to finish.
+	 */
+	Assert(new_trans_slot_id == trans_slot_id);
+
+	/*
+	 * If the last transaction that has updated the tuple is already too
+	 * old, then consider it as frozen which means it is all-visible.  This
+	 * ensures that we don't need to store epoch in the undo record to check
+	 * if the undo tuple belongs to previous epoch and hence all-visible.  See
+	 * comments atop of file ztqual.c.
+	 */
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+	if (TransactionIdPrecedes(tup_xid, oldestXidHavingUndo))
+		tup_xid = FrozenTransactionId;
+
+	CheckForSerializableConflictIn(relation, &(zheaptup.t_self), buffer);
+
+	/*
+	 * Prepare an undo record.  We need to separately store the latest
+	 * transaction id that has changed the tuple to ensure that we don't
+	 * try to process the tuple in undo chain that is already discarded.
+	 * See GetTupleFromUndo.
+	 */
+	undorecord.uur_type = UNDO_DELETE;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = relation->rd_id;
+	undorecord.uur_prevxid = tup_xid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = cid;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = prev_urecptr;
+	undorecord.uur_block = blkno;
+	undorecord.uur_offset = offnum;
+
+	initStringInfo(&undorecord.uur_tuple);
+
+	/*
+	 * Copy the entire old tuple including it's header in the undo record.
+	 * We need this to reconstruct the tuple if current tuple is not
+	 * visible to some other transaction.  We choose to write the complete
+	 * tuple in undo record for delete operation so that we can reuse the
+	 * space after the transaction performing the operation commits.
+	 */
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &zheaptup.t_len,
+						   sizeof(uint32));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &zheaptup.t_self,
+						   sizeof(ItemPointerData));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &zheaptup.t_tableOid,
+						   sizeof(Oid));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) zheaptup.t_data,
+						   zheaptup.t_len);
+	/*
+	 * Store the transaction slot number for undo tuple in undo record, if
+	 * the slot belongs to TPD entry.  We can always get the current tuple's
+	 * transaction slot number by referring offset->slot map in TPD entry,
+	 * however that won't be true for tuple in undo.
+	 */
+	if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+	{
+		undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT;
+		initStringInfo(&undorecord.uur_payload);
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) &tup_trans_slot_id,
+							   sizeof(tup_trans_slot_id));
+		hasPayload = true;
+	}
+
+	/*
+	 * Store subtransaction id in undo record.  See SubXactLockTableWait
+	 * to know why we need to store subtransaction id in undo.
+	 */
+	if (hasSubXactLock)
+	{
+		SubTransactionId subxid = GetCurrentSubTransactionId();
+
+		if (!hasPayload)
+		{
+			initStringInfo(&undorecord.uur_payload);
+			hasPayload = true;
+		}
+
+		undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT;
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) &subxid,
+							   sizeof(subxid));
+	}
+
+	if (!hasPayload)
+		undorecord.uur_payload.len = 0;
+
+	urecptr = PrepareUndoInsert(&undorecord,
+								InvalidTransactionId,
+								UndoPersistenceForRelation(relation),
+								&undometa);
+	/* We must have a valid vmbuffer. */
+	Assert(BufferIsValid(vmbuffer));
+	vm_status = visibilitymap_get_status(relation,
+								BufferGetBlockNumber(buffer), &vmbuffer);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * If all the members were lockers and are all gone, we can do away
+	 * with the MULTI_LOCKERS bit.
+	 */
+
+	if (ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask) &&
+		!any_multi_locker_member_alive)
+		zheaptup.t_data->t_infomask &= ~ZHEAP_MULTI_LOCKERS;
+
+	if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) ||
+		(vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE))
+	{
+		all_visible_cleared = true;
+		visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+							vmbuffer, VISIBILITYMAP_VALID_BITS);
+	}
+
+	InsertPreparedUndo();
+	PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, xid,
+				urecptr, NULL, 0);
+
+	/*
+	 * If this transaction commits, the tuple will become DEAD sooner or
+	 * later.  If the transaction finally aborts, the subsequent page pruning
+	 * will be a no-op and the hint will be cleared.
+	 */
+	ZPageSetPrunable(page, xid);
+
+	ZHeapTupleHeaderSetXactSlot(zheaptup.t_data, new_trans_slot_id);
+	zheaptup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+	zheaptup.t_data->t_infomask |= ZHEAP_DELETED | new_infomask;
+
+	/* Signal that this is actually a move into another partition */
+	if (changingPart)
+		ZHeapTupleHeaderSetMovedPartitions(zheaptup.t_data);
+
+	MarkBufferDirty(buffer);
+
+	/*
+	 * Do xlog stuff
+	 */
+	if (RelationNeedsWAL(relation))
+	{
+		ZHeapTupleHeader	zhtuphdr = NULL;
+		xl_undo_header	xlundohdr;
+		xl_zheap_delete xlrec;
+		xl_zheap_header	xlhdr;
+		XLogRecPtr	recptr;
+		XLogRecPtr	RedoRecPtr;
+		uint32		totalundotuplen = 0;
+		Size		dataoff;
+		bool		doPageWrites;
+
+		/*
+		 * Store the information required to generate undo record during
+		 * replay.
+		 */
+		xlundohdr.reloid = undorecord.uur_reloid;
+		xlundohdr.urec_ptr = urecptr;
+		xlundohdr.blkprev = prev_urecptr;
+
+		xlrec.prevxid = tup_xid;
+		xlrec.offnum = ItemPointerGetOffsetNumber(&zheaptup.t_self);
+		xlrec.infomask = zheaptup.t_data->t_infomask;
+		xlrec.trans_slot_id = trans_slot_id;
+		xlrec.flags = all_visible_cleared ? XLZ_DELETE_ALL_VISIBLE_CLEARED : 0;
+
+		if (changingPart)
+			xlrec.flags |= XLZ_DELETE_IS_PARTITION_MOVE;
+		if (hasSubXactLock)
+			xlrec.flags |= XLZ_DELETE_CONTAINS_SUBXACT;
+
+		/*
+		 * If full_page_writes is enabled, and the buffer image is not
+		 * included in the WAL then we can rely on the tuple in the page to
+		 * regenerate the undo tuple during recovery as the tuple state must
+		 * be same as now, otherwise we need to store it explicitly.
+		 *
+		 * Since we don't yet have the insert lock, including the page
+		 * image decision could change later and in that case we need prepare
+		 * the WAL record again.
+		 */
+prepare_xlog:
+		/* LOG undolog meta if this is the first WAL after the checkpoint. */
+		LogUndoMetaData(&undometa);
+
+		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+		if (!doPageWrites || XLogCheckBufferNeedsBackup(buffer))
+		{
+			xlrec.flags |= XLZ_HAS_DELETE_UNDOTUPLE;
+
+			totalundotuplen = *((uint32 *) &undorecord.uur_tuple.data[0]);
+			dataoff = sizeof(uint32) + sizeof(ItemPointerData) + sizeof(Oid);
+			zhtuphdr = (ZHeapTupleHeader) &undorecord.uur_tuple.data[dataoff];
+
+			xlhdr.t_infomask2 = zhtuphdr->t_infomask2;
+			xlhdr.t_infomask = zhtuphdr->t_infomask;
+			xlhdr.t_hoff = zhtuphdr->t_hoff;
+		}
+		if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			xlrec.flags |= XLZ_DELETE_CONTAINS_TPD_SLOT;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+		XLogRegisterData((char *) &xlrec, SizeOfZHeapDelete);
+		if (xlrec.flags & XLZ_DELETE_CONTAINS_TPD_SLOT)
+			XLogRegisterData((char *) &tup_trans_slot_id,
+							 sizeof(tup_trans_slot_id));
+		if (xlrec.flags & XLZ_HAS_DELETE_UNDOTUPLE)
+		{
+			XLogRegisterData((char *) &xlhdr, SizeOfZHeapHeader);
+			/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+			XLogRegisterData((char *) zhtuphdr + SizeofZHeapTupleHeader,
+							totalundotuplen - SizeofZHeapTupleHeader);
+		}
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			(void) RegisterTPDBuffer(page, 1);
+
+		/* filtering by origin on a row level is much more efficient */
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+		recptr = XLogInsertExtended(RM_ZHEAP_ID, XLOG_ZHEAP_DELETE,
+									RedoRecPtr, doPageWrites);
+		if (recptr == InvalidXLogRecPtr)
+		{
+			ResetRegisteredTPDBuffers();
+			goto prepare_xlog;
+		}
+		PageSetLSN(page, recptr);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			TPDPageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(undorecord.uur_tuple.data);
+	if (undorecord.uur_payload.len > 0)
+		pfree(undorecord.uur_payload.data);
+
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+
+	UnlockReleaseUndoBuffers();
+	/*
+	 * If the tuple has toasted out-of-line attributes, we need to delete
+	 * those items too.  We have to do this before releasing the buffer
+	 * because we need to look at the contents of the tuple, but it's OK to
+	 * release the content lock on the buffer first.
+	 */
+	if (relation->rd_rel->relkind != RELKIND_RELATION &&
+		relation->rd_rel->relkind != RELKIND_MATVIEW)
+	{
+		/* toast table entries should never be recursively toasted */
+		Assert(!ZHeapTupleHasExternal(&zheaptup));
+	}
+	else if (ZHeapTupleHasExternal(&zheaptup))
+		ztoast_delete(relation, &zheaptup, false);
+
+	/* Now we can release the buffer */
+	ReleaseBuffer(buffer);
+	UnlockReleaseTPDBuffers();
+
+	/*
+	 * Release the lmgr tuple lock, if we had it.
+	 */
+	if (have_tuple_lock)
+		UnlockTupleTuplock(relation, &(zheaptup.t_self), LockTupleExclusive);
+
+	pgstat_count_heap_delete(relation);
+
+	return HeapTupleMayBeUpdated;
+}
+
+/*
+ * zheap_update - update a tuple
+ *
+ * This function either updates the tuple in-place or it deletes the old
+ * tuple and new tuple for non-in-place updates.  Additionaly this function
+ * inserts an undo record and updates the undo pointer in page header or in
+ * TPD entry for this page.
+ *
+ * XXX - Visibility map and page is all visible needs to be maintained for
+ * index-only scans on zheap.
+ *
+ * For input and output values, see heap_update.
+ */
+HTSU_Result
+zheap_update(Relation relation, ItemPointer otid, ZHeapTuple newtup,
+			 CommandId cid, Snapshot crosscheck, Snapshot snapshot, bool wait,
+			 HeapUpdateFailureData *hufd, LockTupleMode *lockmode)
+{
+	HTSU_Result result;
+	TransactionId xid = GetTopTransactionId();
+	TransactionId tup_xid,
+				  save_tup_xid,
+				  oldestXidHavingUndo,
+				  single_locker_xid;
+	SubTransactionId	tup_subxid = InvalidSubTransactionId;
+	CommandId	tup_cid;
+	Bitmapset  *inplace_upd_attrs = NULL;
+	Bitmapset  *key_attrs = NULL;
+	Bitmapset  *interesting_attrs = NULL;
+	Bitmapset  *modified_attrs = NULL;
+	ItemId		lp;
+	ZHeapTupleData oldtup;
+	ZHeapTuple	zheaptup;
+	UndoRecPtr	urecptr, prev_urecptr, new_prev_urecptr;
+	UndoRecPtr	new_urecptr = InvalidUndoRecPtr;
+	UnpackedUndoRecord	undorecord, new_undorecord;
+	Page		page;
+	BlockNumber block;
+	ItemPointerData	ctid;
+	Buffer		buffer,
+				newbuf,
+				vmbuffer = InvalidBuffer,
+				vmbuffer_new = InvalidBuffer;
+	Size		newtupsize,
+				oldtupsize,
+				pagefree;
+	uint32		epoch = GetEpochForXid(xid);
+	int			tup_trans_slot_id,
+				trans_slot_id,
+				new_trans_slot_id,
+				result_trans_slot_id,
+				single_locker_trans_slot;
+	uint16		old_infomask;
+	uint16		new_infomask, temp_infomask;
+	uint16		infomask_old_tuple = 0;
+	uint16		infomask_new_tuple = 0;
+	OffsetNumber	old_offnum, max_offset;
+	bool		all_visible_cleared = false;
+	bool		new_all_visible_cleared = false;
+	bool		have_tuple_lock = false;
+	bool		is_index_updated = false;
+	bool		use_inplace_update = false;
+	bool		in_place_updated_or_locked = false;
+	bool		key_intact = false;
+	bool		checked_lockers = false;
+	bool		locker_remains = false;
+	bool		any_multi_locker_member_alive = false;
+	bool		lock_reacquired;
+	bool		need_toast;
+	bool		hasSubXactLock = false;
+	xl_undolog_meta	undometa;
+	uint8		vm_status;
+	uint8		vm_status_new = 0;
+
+	Assert(ItemPointerIsValid(otid));
+
+	/*
+	 * Forbid this during a parallel operation, lest it allocate a combocid.
+	 * Other workers might need that combocid for visibility checks, and we
+	 * have no provision for broadcasting it to them.
+	 */
+	if (IsInParallelMode())
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+				 errmsg("cannot update tuples during a parallel operation")));
+
+	/*
+	 * Fetch the list of attributes to be checked for various operations.
+	 *
+	 * For in-place update considerations, this is wasted effort if we fail to
+	 * update or have to put the new tuple on a different page.  But we must
+	 * compute the list before obtaining buffer lock --- in the worst case, if
+	 * we are doing an update on one of the relevant system catalogs, we could
+	 * deadlock if we try to fetch the list later.  Note, that as of now
+	 * system catalogs are always stored in heap, so we might not hit the
+	 * deadlock case, but it can be supported in future.  In any case, the
+	 * relcache caches the data so this is usually pretty cheap.
+	 *
+	 * Note that we get a copy here, so we need not worry about relcache flush
+	 * happening midway through.
+	 */
+	inplace_upd_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_HOT);
+	key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
+
+	block = ItemPointerGetBlockNumber(otid);
+	buffer = ReadBuffer(relation, block);
+	page = BufferGetPage(buffer);
+
+	interesting_attrs = NULL;
+
+	/*
+	 * Before locking the buffer, pin the visibility map page mainly to avoid
+	 * doing I/O after locking the buffer.
+	 */
+	visibilitymap_pin(relation, block, &vmbuffer);
+
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	old_offnum = ItemPointerGetOffsetNumber(otid);
+	lp = PageGetItemId(page, old_offnum);
+	Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp));
+
+	/*
+	 * If TID is already delete marked due to pruning, then get new ctid, so
+	 * that we can update the new tuple.  We will get new ctid if the tuple
+	 * was non-inplace-updated otherwise we will get same TID.
+	 */
+	if (ItemIdIsDeleted(lp))
+	{
+		ctid = *otid;
+		ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+		result = HeapTupleUpdated;
+
+		/*
+		 * Since tuple data is gone let's be conservative about lock mode.
+		 *
+		 * XXX We could optimize here by checking whether the key column is
+		 * not updated and if so, then use lower lock level, but this case
+		 * should be rare enough that it won't matter.
+		 */
+		*lockmode = LockTupleExclusive;
+		goto zheap_tuple_updated;
+	}
+
+	/*
+	 * Fill in enough data in oldtup for ZHeapDetermineModifiedColumns to work
+	 * properly.
+	 */
+	oldtup.t_tableOid = RelationGetRelid(relation);
+	oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	oldtup.t_len = ItemIdGetLength(lp);
+	oldtup.t_self = *otid;
+
+	/* the new tuple is ready, except for this: */
+	newtup->t_tableOid = RelationGetRelid(relation);
+
+	interesting_attrs = bms_add_members(interesting_attrs, inplace_upd_attrs);
+	interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+
+	/* Determine columns modified by the update. */
+	modified_attrs = ZHeapDetermineModifiedColumns(relation, interesting_attrs,
+												   &oldtup, newtup);
+
+	is_index_updated = bms_overlap(modified_attrs, inplace_upd_attrs);
+
+	if (relation->rd_rel->relkind != RELKIND_RELATION &&
+		relation->rd_rel->relkind != RELKIND_MATVIEW)
+	{
+		/* toast table entries should never be recursively toasted */
+		Assert(!ZHeapTupleHasExternal(&oldtup));
+		Assert(!ZHeapTupleHasExternal(newtup));
+		need_toast = false;
+	}
+	else
+		need_toast = (newtup->t_len >= TOAST_TUPLE_THRESHOLD ||
+					 ZHeapTupleHasExternal(&oldtup) ||
+					 ZHeapTupleHasExternal(newtup));
+
+	oldtupsize = SHORTALIGN(oldtup.t_len);
+	newtupsize = SHORTALIGN(newtup->t_len);
+
+	/*
+	 * inplace updates can be done only if the length of new tuple is lesser
+	 * than or equal to old tuple and there are no index column updates and
+	 * the tuple does not require TOAST-ing.
+	 */
+	if ((newtupsize <= oldtupsize) && !is_index_updated && !need_toast)
+		use_inplace_update = true;
+	else
+		use_inplace_update = false;
+
+	/*
+	 * Similar to heap, if we're not updating any "key" column, we can grab a
+	 * weaker lock type.  See heap_update.
+	 */
+	if (!bms_overlap(modified_attrs, key_attrs))
+	{
+		*lockmode = LockTupleNoKeyExclusive;
+		key_intact = true;
+	}
+	else
+	{
+		*lockmode = LockTupleExclusive;
+		key_intact = false;
+	}
+
+	/*
+	 * ctid needs to be fetched from undo chain.  You might think that it will
+	 * be always same as the passed in ctid as the old tuple is already visible
+	 * out snapshot.  However, it is quite possible that after checking the
+	 * visibility of old tuple, some concurrent session would have performed
+	 * non in-place update and in such a case we need can only get it via
+	 * undo.
+	 */
+	ctid = *otid;
+
+check_tup_satisfies_update:
+	checked_lockers = false;
+	locker_remains = false;
+	any_multi_locker_member_alive = true;
+	result = ZHeapTupleSatisfiesUpdate(relation, &oldtup, cid, buffer, &ctid,
+									   &tup_trans_slot_id, &tup_xid, &tup_subxid,
+									   &tup_cid, &single_locker_xid,
+									   &single_locker_trans_slot, false, false,
+									   snapshot, &in_place_updated_or_locked);
+
+	if (result == HeapTupleInvisible)
+	{
+		UnlockReleaseBuffer(buffer);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("attempted to update invisible tuple")));
+	}
+	else if ((result == HeapTupleBeingUpdated ||
+			 ((result == HeapTupleMayBeUpdated) &&
+			  ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask))) &&
+			  wait)
+	{
+		List	*mlmembers;
+		TransactionId xwait;
+		SubTransactionId xwait_subxid;
+		int			xwait_trans_slot;
+		uint16		infomask;
+		bool		can_continue = false;
+
+		xwait_subxid = tup_subxid;
+
+		if (TransactionIdIsValid(single_locker_xid))
+		{
+			xwait = single_locker_xid;
+			xwait_trans_slot = single_locker_trans_slot;
+		}
+		else
+		{
+			xwait = tup_xid;
+			xwait_trans_slot = tup_trans_slot_id;
+		}
+
+		/* must copy state data before unlocking buffer */
+		infomask = oldtup.t_data->t_infomask;
+
+		if (ZHeapTupleHasMultiLockers(infomask))
+		{
+			TransactionId update_xact;
+			LockTupleMode	old_lock_mode;
+			int		remain = 0;
+			bool		isAborted;
+			bool		upd_xact_aborted = false;
+
+			/*
+			 * In ZHeapTupleSatisfiesUpdate, it's not possible to know if current
+			 * transaction has already locked the tuple for update because of
+			 * multilocker flag. In that case, we've to check whether the current
+			 * transaction has already locked the tuple for update.
+			 */
+
+			/*
+			 * Get the transaction slot and undo record pointer if we are already in a
+			 * transaction.
+			 */
+			trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid,
+													 &prev_urecptr, false, false,
+													 NULL);
+
+			if (trans_slot_id != InvalidXactSlotId)
+			{
+				List	*mlmembers;
+				ListCell   *lc;
+
+				/*
+				 * If any subtransaction of the current top transaction already holds
+				 * a lock as strong as or stronger than what we're requesting, we
+				 * effectively hold the desired lock already.  We *must* succeed
+				 * without trying to take the tuple lock, else we will deadlock
+				 * against anyone wanting to acquire a stronger lock.
+				 */
+				mlmembers = ZGetMultiLockMembersForCurrentXact(&oldtup,
+													trans_slot_id, prev_urecptr);
+
+				foreach(lc, mlmembers)
+				{
+					ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+					/*
+					 * Only members of our own transaction must be present in
+					 * the list.
+					 */
+					Assert(TransactionIdIsCurrentTransactionId(mlmember->xid));
+
+					if (mlmember->mode >= *lockmode)
+					{
+						result = HeapTupleMayBeUpdated;
+
+						/*
+						 * There is no other active locker on the tuple except
+						 * current transaction id, so we can update the tuple.
+						 * However, we need to propagate lockers information.
+						 */
+						checked_lockers = true;
+						locker_remains = true;
+						goto zheap_tuple_updated;
+					}
+				}
+
+				list_free_deep(mlmembers);
+			}
+
+			old_lock_mode = get_old_lock_mode(infomask);
+
+			/*
+			 * For the conflicting lockers, we need to be careful about
+			 * applying pending undo actions for aborted transactions; if we
+			 * leave any transaction whether locker or updater, it can lead to
+			 * inconsistency.  Basically, in such a case after waiting for all
+			 * the conflicting transactions we might clear the multilocker
+			 * flag and proceed with update and it is quite possible that after
+			 * the update, undo worker rollbacks some of the previous locker
+			 * which can overwrite the tuple (Note, till multilocker bit is set,
+			 * the rollback actions won't overwrite the tuple).
+			 *
+			 * OTOH for non-conflicting lockers, as we don't clear the
+			 * multi-locker flag, there is no urgency to perform undo actions
+			 * for aborts of lockers.  The work involved in finding and
+			 * aborting lockers is non-trivial (w.r.t performance), so it is
+			 * better to avoid it.
+			 *
+			 * After abort, if it is only a locker, then it will be completely
+			 * gone; but if it is an update, then after applying pending
+			 * actions, the tuple might get changed and we must allow to
+			 * reverify the tuple in case it's values got changed.
+			 */
+			if (!ZHEAP_XID_IS_LOCKED_ONLY(oldtup.t_data->t_infomask))
+				ZHeapTupleGetTransInfo(&oldtup, buffer, NULL, NULL, &update_xact,
+									   NULL, NULL, false);
+			else
+				update_xact = InvalidTransactionId;
+
+			if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode),
+									HWLOCKMODE_from_locktupmode(*lockmode)))
+			{
+				TransactionId	current_tup_xid;
+
+				/*
+				 * There is a potential conflict.  It is quite possible
+				 * that by this time the locker has already been committed.
+				 * So we need to check for conflict with all the possible
+				 * lockers and wait for each of them after releasing a
+				 * buffer lock and acquiring a lock on a tuple.
+				 */
+				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+				mlmembers = ZGetMultiLockMembers(relation, &oldtup, buffer,
+												 true);
+
+				/*
+				 * If there is no multi-lock members apart from the current transaction
+				 * then no need for tuplock, just go ahead.
+				 */
+				if (mlmembers != NIL)
+				{
+					heap_acquire_tuplock(relation, &(oldtup.t_self), *lockmode,
+										 LockWaitBlock, &have_tuple_lock);
+					ZMultiLockMembersWait(relation, mlmembers, &oldtup, buffer,
+										  update_xact, *lockmode, false,
+										  XLTW_Update, &remain,
+										  &upd_xact_aborted);
+				}
+				checked_lockers = true;
+				locker_remains = remain != 0;
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+				/*
+				 * If the aborted xact is for update, then we need to reverify
+				 * the tuple.
+				 */
+				if (upd_xact_aborted)
+					goto check_tup_satisfies_update;
+
+				/*
+				 * Also take care of cases when page is pruned after we
+				 * release the buffer lock. For this we check if ItemId is not
+				 * deleted and refresh the tuple offset position in page.  If
+				 * TID is already delete marked due to pruning, then get new
+				 * ctid, so that we can update the new tuple.
+				 *
+				 * We also need to ensure that no new lockers have been added
+				 * in the meantime, if there is any new locker, then start
+				 * again.
+				 */
+				if (ItemIdIsDeleted(lp))
+				{
+					ctid = *otid;
+					ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+					result = HeapTupleUpdated;
+					goto zheap_tuple_updated;
+				}
+
+				oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+				oldtup.t_len = ItemIdGetLength(lp);
+
+				if (ZHeapTupleHasMultiLockers(infomask))
+				{
+					List	*new_mlmembers;
+					new_mlmembers = ZGetMultiLockMembers(relation, &oldtup,
+														 buffer, false);
+
+					/*
+					 * Ensure, no new lockers have been added, if so, then start
+					 * again.
+					 */
+					if (!ZMultiLockMembersSame(mlmembers, new_mlmembers))
+					{
+						list_free_deep(mlmembers);
+						list_free_deep(new_mlmembers);
+						goto check_tup_satisfies_update;
+					}
+
+					any_multi_locker_member_alive =
+						ZIsAnyMultiLockMemberRunning(new_mlmembers, &oldtup,
+													 buffer);
+					list_free_deep(mlmembers);
+					list_free_deep(new_mlmembers);
+				}
+
+				/*
+				 * xwait is done, but if xwait had just locked the tuple then some
+				 * other xact could update this tuple before we get to this point.
+				 * Check for xid change, and start over if so.
+				 */
+				ZHeapTupleGetTransInfo(&oldtup, buffer, NULL, NULL, &current_tup_xid,
+									   NULL, NULL, false);
+				if (xid_infomask_changed(oldtup.t_data->t_infomask, infomask) ||
+					!TransactionIdEquals(current_tup_xid, xwait))
+					goto check_tup_satisfies_update;
+			}
+			else if (TransactionIdIsValid(update_xact))
+			{
+				isAborted = TransactionIdDidAbort(update_xact);
+
+				/*
+				 * For aborted transaction, if the undo actions are not applied
+				 * yet, then apply them before modifying the page.
+				 */
+				if (isAborted &&
+					zheap_exec_pending_rollback(relation, buffer,
+												xwait_trans_slot, xwait))
+					goto check_tup_satisfies_update;
+			}
+
+			/*
+			 * There was no UPDATE in the Multilockers. No
+			 * TransactionIdIsInProgress() call needed here, since we called
+			 * ZMultiLockMembersWait() above.
+			 */
+			if (!TransactionIdIsValid(update_xact))
+				can_continue = true;
+		}
+		else if (TransactionIdIsCurrentTransactionId(xwait))
+		{
+			/*
+			 * The only locker is ourselves; we can avoid grabbing the tuple
+			 * lock here, but must preserve our locking information.
+			 */
+			checked_lockers = true;
+			locker_remains = true;
+			can_continue = true;
+		}
+		else if (ZHEAP_XID_IS_KEYSHR_LOCKED(infomask) && key_intact)
+		{
+			/*
+			 * If it's just a key-share locker, and we're not changing the key
+			 * columns, we don't need to wait for it to end; but we need to
+			 * preserve it as locker.
+			 */
+			checked_lockers = true;
+			locker_remains = true;
+			can_continue = true;
+		}
+		else
+		{
+			bool	isCommitted;
+			bool	has_update = false;
+			TransactionId	current_tup_xid;
+
+			/*
+			 * Wait for regular transaction to end; but first, acquire tuple
+			 * lock.
+			 */
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			heap_acquire_tuplock(relation, &(oldtup.t_self), *lockmode,
+								 LockWaitBlock, &have_tuple_lock);
+			if (xwait_subxid != InvalidSubTransactionId)
+				SubXactLockTableWait(xwait, xwait_subxid, relation,
+									 &oldtup.t_self, XLTW_Update);
+			else
+				XactLockTableWait(xwait, relation, &oldtup.t_self,
+								  XLTW_Update);
+			checked_lockers = true;
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 * Also take care of cases when page is pruned after we release the
+			 * buffer lock. For this we check if ItemId is not deleted and refresh
+			 * the tuple offset position in page.  If TID is already delete marked
+			 * due to pruning, then get new ctid, so that we can update the new
+			 * tuple.
+			 */
+			if (ItemIdIsDeleted(lp))
+			{
+				ctid = *otid;
+				ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+				result = HeapTupleUpdated;
+				goto zheap_tuple_updated;
+			}
+
+			oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+			oldtup.t_len = ItemIdGetLength(lp);
+
+			/*
+			 * xwait is done, but if xwait had just locked the tuple then some
+			 * other xact could update/lock this tuple before we get to this
+			 * point.  Check for xid change, and start over if so.  We need to
+			 * do some special handling for lockers because their xid is never
+			 * stored on the tuples.  If there was a single locker on the
+			 * tuple and that locker is gone and some new locker has locked
+			 * the tuple, we won't be able to identify that by infomask/xid on
+			 * the tuple, rather we need to fetch the locker xid.
+			 */
+			ZHeapTupleGetTransInfo(&oldtup, buffer, NULL, NULL,
+								   &current_tup_xid, NULL, NULL, false);
+			if (xid_infomask_changed(oldtup.t_data->t_infomask, infomask) ||
+				!TransactionIdEquals(current_tup_xid, xwait))
+			{
+				if (ZHEAP_XID_IS_LOCKED_ONLY(oldtup.t_data->t_infomask) &&
+					!ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask) &&
+					TransactionIdIsValid(single_locker_xid))
+				{
+					TransactionId current_single_locker_xid = InvalidTransactionId;
+
+					(void) GetLockerTransInfo(relation, &oldtup, buffer, NULL,
+											  NULL, &current_single_locker_xid,
+											  NULL, NULL);
+					if (!TransactionIdEquals(single_locker_xid,
+											 current_single_locker_xid))
+						goto check_tup_satisfies_update;
+
+				}
+				else
+					goto check_tup_satisfies_update;
+			}
+
+			if (!ZHEAP_XID_IS_LOCKED_ONLY(oldtup.t_data->t_infomask))
+				has_update = true;
+
+			/*
+			 * We may overwrite if previous xid is aborted, or if it is committed
+			 * but only locked the tuple without updating it.
+			 */
+			isCommitted = TransactionIdDidCommit(xwait);
+
+			/*
+			 * For aborted transaction, if the undo actions are not applied
+			 * yet, then apply them before modifying the page.
+			 */
+			if (!isCommitted)
+				zheap_exec_pending_rollback(relation, buffer,
+											xwait_trans_slot, xwait);
+
+			/*
+			 * For aborted updates, we must allow to reverify the tuple in
+			 * case it's values got changed.
+			 */
+			if (!isCommitted && has_update)
+				goto check_tup_satisfies_update;
+
+			if (!has_update)
+				can_continue = true;
+		}
+
+		/*
+		 * We may overwrite if previous xid is aborted or committed, but only
+		 * locked the tuple without updating it.
+		 */
+		if (result != HeapTupleMayBeUpdated)
+			result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated;
+	}
+	else if (result == HeapTupleMayBeUpdated)
+	{
+		/*
+		 * There is no active locker on the tuple, so we avoid grabbing
+		 * the lock on new tuple.
+		 */
+		checked_lockers = true;
+		locker_remains = false;
+	}
+	else if (result == HeapTupleUpdated &&
+			 ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask))
+	{
+		/*
+		 * If a tuple is updated and is visible to our snapshot, we allow to update
+		 * it;  Else, we return HeapTupleUpdated and visit EvalPlanQual path to
+		 * check whether the quals still match.  In that path, we also lock the
+		 * tuple so that nobody can update it before us.
+		 *
+		 * In ZHeapTupleSatisfiesUpdate, it's not possible to know if current
+		 * transaction has already locked the tuple for update because of
+		 * multilocker flag. In that case, we've to check whether the current
+		 * transaction has already locked the tuple for update.
+		 */
+
+		/*
+		 * Get the transaction slot and undo record pointer if we are already in a
+		 * transaction.
+		 */
+		trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid,
+												 &prev_urecptr, false, false,
+												 NULL);
+
+		if (trans_slot_id != InvalidXactSlotId)
+		{
+			List	*mlmembers;
+			ListCell   *lc;
+
+			/*
+			 * If any subtransaction of the current top transaction already holds
+			 * a lock as strong as or stronger than what we're requesting, we
+			 * effectively hold the desired lock already.  We *must* succeed
+			 * without trying to take the tuple lock, else we will deadlock
+			 * against anyone wanting to acquire a stronger lock.
+			 */
+			mlmembers = ZGetMultiLockMembersForCurrentXact(&oldtup,
+												trans_slot_id, prev_urecptr);
+
+			foreach(lc, mlmembers)
+			{
+				ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+				/*
+				 * Only members of our own transaction must be present in
+				 * the list.
+				 */
+				Assert(TransactionIdIsCurrentTransactionId(mlmember->xid));
+
+				if (mlmember->mode >= *lockmode)
+				{
+					result = HeapTupleMayBeUpdated;
+
+					/*
+					 * There is no other active locker on the tuple except
+					 * current transaction id, so we can update the tuple.
+					 */
+					checked_lockers = true;
+					locker_remains = false;
+					break;
+				}
+			}
+
+			list_free_deep(mlmembers);
+		}
+
+	}
+
+	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
+	{
+		/* Perform additional check for transaction-snapshot mode RI updates */
+		if (!ZHeapTupleSatisfies(&oldtup, crosscheck, buffer, NULL))
+			result = HeapTupleUpdated;
+	}
+
+zheap_tuple_updated:
+	if (result != HeapTupleMayBeUpdated)
+	{
+		Assert(result == HeapTupleSelfUpdated ||
+			   result == HeapTupleUpdated ||
+			   result == HeapTupleBeingUpdated);
+		Assert(ItemIdIsDeleted(lp) ||
+			   IsZHeapTupleModified(oldtup.t_data->t_infomask));
+
+		/* If item id is deleted, tuple can't be marked as moved. */
+		if (!ItemIdIsDeleted(lp) &&
+			ZHeapTupleIsMoved(oldtup.t_data->t_infomask))
+			ItemPointerSetMovedPartitions(&hufd->ctid);
+		else
+			hufd->ctid = ctid;
+		hufd->xmax = tup_xid;
+		if (result == HeapTupleSelfUpdated)
+			hufd->cmax = tup_cid;
+		else
+			hufd->cmax = InvalidCommandId;
+		UnlockReleaseBuffer(buffer);
+		hufd->in_place_updated_or_locked = in_place_updated_or_locked;
+		if (have_tuple_lock)
+			UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+		bms_free(inplace_upd_attrs);
+		bms_free(key_attrs);
+		return result;
+	}
+
+	/* Acquire subtransaction lock, if current transaction is a subtransaction. */
+	if (IsSubTransaction())
+	{
+		SubXactLockTableInsert(GetCurrentSubTransactionId());
+		hasSubXactLock = true;
+	}
+
+	/*
+	 * If it is a non inplace update then check we have sufficient free space
+	 * to insert in same page. If not try defragmentation and recheck the
+	 * freespace again.
+	 */
+	if (!use_inplace_update && !is_index_updated && !need_toast)
+	{
+		bool	pruned;
+
+		/* Here, we pass delta space required to accomodate the new tuple. */
+		pruned = zheap_page_prune_opt(relation, buffer, old_offnum,
+									  (newtupsize - oldtupsize));
+
+		oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+
+		/*
+		 * Check if the non-inplace update is due to non-index update and we
+		 * are able to perform pruning, then we must be able to perform
+		 * inplace update.
+		 */
+		if (pruned)
+			use_inplace_update = true;
+	}
+
+	max_offset = PageGetMaxOffsetNumber(BufferGetPage(buffer));
+	pagefree = PageGetZHeapFreeSpace(page);
+
+	/*
+	 * Incase of the non in-place update we also need to
+	 * reserve a map for the new tuple.
+	 */
+	if (!use_inplace_update)
+		max_offset += 1;
+
+	/*
+	 * The transaction information of tuple needs to be set in transaction
+	 * slot, so needs to reserve the slot before proceeding with the actual
+	 * operation.  It will be costly to wait for getting the slot, but we do
+	 * that by releasing the buffer lock.
+	 */
+	trans_slot_id = PageReserveTransactionSlot(relation, buffer, max_offset,
+											   epoch, xid, &prev_urecptr,
+											   &lock_reacquired);
+	if (lock_reacquired)
+		goto check_tup_satisfies_update;
+
+	if (trans_slot_id == InvalidXactSlotId)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+		pg_usleep(10000L);	/* 10 ms */
+		pgstat_report_wait_end();
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Also take care of cases when page is pruned after we release the
+		 * buffer lock. For this we check if ItemId is not deleted and refresh
+		 * the tuple offset position in page.  If TID is already delete marked
+		 * due to pruning, then get new ctid, so that we can update the new
+		 * tuple.
+		 */
+		if (ItemIdIsDeleted(lp))
+		{
+			ctid = *otid;
+			ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+			result = HeapTupleUpdated;
+			goto zheap_tuple_updated;
+		}
+
+		oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		oldtup.t_len = ItemIdGetLength(lp);
+
+		goto check_tup_satisfies_update;
+	}
+
+	/* transaction slot must be reserved before adding tuple to page */
+	Assert(trans_slot_id != InvalidXactSlotId);
+
+	/*
+	 * It's possible that tuple slot is now marked as frozen. Hence, we refetch
+	 * the tuple here.
+	 */
+	Assert(!ItemIdIsDeleted(lp));
+	oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	oldtup.t_len = ItemIdGetLength(lp);
+
+	/*
+	 * If the slot is marked as frozen, the latest modifier of the tuple must be
+	 * frozen.
+	 */
+	if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (oldtup.t_data)) == ZHTUP_SLOT_FROZEN)
+	{
+		tup_trans_slot_id = ZHTUP_SLOT_FROZEN;
+		tup_xid = InvalidTransactionId;
+	}
+
+	/*
+	 * Save the xid that has updated the tuple to compute infomask for
+	 * tuple.
+	 */
+	save_tup_xid = tup_xid;
+
+	/*
+	 * If the last transaction that has updated the tuple is already too
+	 * old, then consider it as frozen which means it is all-visible.  This
+	 * ensures that we don't need to store epoch in the undo record to check
+	 * if the undo tuple belongs to previous epoch and hence all-visible.  See
+	 * comments atop of file ztqual.c.
+	 */
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+	if (TransactionIdPrecedes(tup_xid, oldestXidHavingUndo))
+	{
+		tup_xid = FrozenTransactionId;
+	}
+
+	/*
+	 * updated tuple doesn't fit on current page or the toaster needs
+	 * to be activated
+	 */
+	if ((!use_inplace_update && newtupsize > pagefree) || need_toast)
+	{
+		uint16	lock_old_infomask;
+		BlockNumber	oldblk, newblk;
+		int     slot_id;
+
+		/*
+		 * To prevent concurrent sessions from updating the tuple, we have to
+		 * temporarily mark it locked, while we release the lock.
+		 */
+		undorecord.uur_info = 0;
+		undorecord.uur_prevlen = 0;
+		undorecord.uur_reloid = relation->rd_id;
+		undorecord.uur_prevxid = tup_xid;
+		undorecord.uur_xid = xid;
+		undorecord.uur_cid = cid;
+		undorecord.uur_fork = MAIN_FORKNUM;
+		undorecord.uur_blkprev = prev_urecptr;
+		undorecord.uur_block = ItemPointerGetBlockNumber(&(oldtup.t_self));
+		undorecord.uur_offset = ItemPointerGetOffsetNumber(&(oldtup.t_self));
+
+		initStringInfo(&undorecord.uur_tuple);
+		initStringInfo(&undorecord.uur_payload);
+
+		/*
+		 * Here, we are storing old tuple header which is required to
+		 * reconstruct the old copy of tuple.
+		 */
+		appendBinaryStringInfo(&undorecord.uur_tuple,
+							   (char *) oldtup.t_data,
+							   SizeofZHeapTupleHeader);
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) (lockmode),
+							   sizeof(LockTupleMode));
+		/*
+		 * Store the transaction slot number for undo tuple in undo record, if
+		 * the slot belongs to TPD entry.  We can always get the current tuple's
+		 * transaction slot number by referring offset->slot map in TPD entry,
+		 * however that won't be true for tuple in undo.
+		 */
+		if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT;
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &tup_trans_slot_id,
+								   sizeof(tup_trans_slot_id));
+		}
+
+		/*
+		 * Store subtransaction id in undo record.  See SubXactLockTableWait
+		 * to know why we need to store subtransaction id in undo.
+		 */
+		if (hasSubXactLock)
+		{
+			SubTransactionId subxid = GetCurrentSubTransactionId();
+
+			undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT;
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &subxid,
+								   sizeof(subxid));
+		}
+
+		urecptr = PrepareUndoInsert(&undorecord,
+									InvalidTransactionId,
+									UndoPersistenceForRelation(relation),
+									&undometa);
+
+		temp_infomask = oldtup.t_data->t_infomask;
+
+		/* Compute the new xid and infomask to store into the tuple. */
+		compute_new_xid_infomask(&oldtup, buffer, save_tup_xid,
+								 tup_trans_slot_id, temp_infomask,
+								 xid, trans_slot_id, single_locker_xid,
+								 *lockmode, LockForUpdate, &lock_old_infomask,
+								 &result_trans_slot_id);
+
+		if (ZHeapTupleHasMultiLockers(lock_old_infomask))
+			undorecord.uur_type = UNDO_XID_MULTI_LOCK_ONLY;
+		else
+			undorecord.uur_type = UNDO_XID_LOCK_FOR_UPDATE;
+
+		START_CRIT_SECTION();
+
+		/*
+		 * If all the members were lockers and are all gone, we can do away
+		 * with the MULTI_LOCKERS bit.
+		 */
+
+		if (ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask) &&
+			!any_multi_locker_member_alive)
+			oldtup.t_data->t_infomask &= ~ZHEAP_MULTI_LOCKERS;
+
+		InsertPreparedUndo();
+
+		/*
+		 * We never set the locker slot on the tuple, so pass set_tpd_map_slot
+		 * flag as false from the locker.  From all other places it should
+		 * always be passed as true so that the proper slot get set in the TPD
+		 * offset map if its a TPD slot.
+		 */
+		PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch,
+					xid, urecptr, NULL, 0);
+
+		ZHeapTupleHeaderSetXactSlot(oldtup.t_data, result_trans_slot_id);
+
+		oldtup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+		oldtup.t_data->t_infomask |= lock_old_infomask;
+
+		/* Set prev_urecptr to the latest undo record in the slot. */
+		prev_urecptr = urecptr;
+
+		MarkBufferDirty(buffer);
+
+		/*
+		 * Do xlog stuff
+		 */
+		if (RelationNeedsWAL(relation))
+		{
+			xl_zheap_lock	xlrec;
+			xl_undo_header  xlundohdr;
+			XLogRecPtr      recptr;
+			XLogRecPtr		RedoRecPtr;
+			bool			doPageWrites;
+
+			/*
+			 * Store the information required to generate undo record during
+			 * replay.
+			 */
+			xlundohdr.reloid = undorecord.uur_reloid;
+			xlundohdr.urec_ptr = urecptr;
+			xlundohdr.blkprev = undorecord.uur_blkprev;
+
+			xlrec.prev_xid = tup_xid;
+			xlrec.offnum = ItemPointerGetOffsetNumber(&(oldtup.t_self));
+			xlrec.infomask = oldtup.t_data->t_infomask;
+			xlrec.trans_slot_id = result_trans_slot_id;
+			xlrec.flags = 0;
+
+			if (result_trans_slot_id != trans_slot_id)
+			{
+				Assert(result_trans_slot_id == tup_trans_slot_id);
+				xlrec.flags |= XLZ_LOCK_TRANS_SLOT_FOR_UREC;
+			}
+			else if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+				xlrec.flags |= XLZ_LOCK_CONTAINS_TPD_SLOT;
+
+			if (hasSubXactLock)
+				 xlrec.flags |= XLZ_LOCK_CONTAINS_SUBXACT;
+			if (undorecord.uur_type == UNDO_XID_LOCK_FOR_UPDATE)
+				xlrec.flags |= XLZ_LOCK_FOR_UPDATE;
+
+prepare_xlog:
+			/* LOG undolog meta if this is the first WAL after the checkpoint. */
+			LogUndoMetaData(&undometa);
+
+			GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+				(void) RegisterTPDBuffer(page, 1);
+			XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+			XLogRegisterData((char *) &xlrec, SizeOfZHeapLock);
+
+			/*
+			 * We always include old tuple header for undo in WAL record
+			 * irrespective of full page image is taken or not. This is done
+			 * since savings for not including a zheap tuple header are less
+			 * compared to code complexity. However in future, if required we
+			 * can do it similar to what we have done in zheap_update or
+			 * zheap_delete.
+			 */
+			XLogRegisterData((char *) undorecord.uur_tuple.data,
+							 SizeofZHeapTupleHeader);
+			XLogRegisterData((char *) (lockmode), sizeof(LockTupleMode));
+			if (xlrec.flags & XLZ_LOCK_TRANS_SLOT_FOR_UREC)
+				XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id));
+			else if (xlrec.flags & XLZ_LOCK_CONTAINS_TPD_SLOT)
+				XLogRegisterData((char *) &tup_trans_slot_id, sizeof(tup_trans_slot_id));
+
+			recptr = XLogInsertExtended(RM_ZHEAP_ID, XLOG_ZHEAP_LOCK, RedoRecPtr,
+										doPageWrites);
+			if (recptr == InvalidXLogRecPtr)
+			{
+				ResetRegisteredTPDBuffers();
+				goto prepare_xlog;
+			}
+
+			PageSetLSN(page, recptr);
+			if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+				TPDPageSetLSN(page, recptr);
+		}
+		END_CRIT_SECTION();
+
+		pfree(undorecord.uur_tuple.data);
+		pfree(undorecord.uur_payload.data);
+
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		UnlockReleaseUndoBuffers();
+		UnlockReleaseTPDBuffers();
+
+		/*
+		 * Let the toaster do its thing, if needed.
+		 *
+		 * Note: below this point, zheaptup is the data we actually intend to
+		 * store into the relation; newtup is the caller's original untoasted
+		 * data.
+		 */
+		if (need_toast)
+		{
+			zheaptup = ztoast_insert_or_update(relation, newtup, &oldtup, 0);
+			newtupsize = SHORTALIGN(zheaptup->t_len);	/* short aligned */
+		}
+		else
+			zheaptup = newtup;
+reacquire_buffer:
+		/*
+		 * Get a new page for inserting tuple.  We will need to acquire buffer
+		 * locks on both old and new pages.  See heap_update.
+		 */
+		if (BufferIsValid(vmbuffer_new))
+		{
+			ReleaseBuffer(vmbuffer_new);
+			vmbuffer_new = InvalidBuffer;
+		}
+
+		if (newtupsize > pagefree)
+		{
+			newbuf = RelationGetBufferForZTuple(relation, zheaptup->t_len,
+												buffer, 0, NULL,
+												&vmbuffer_new, &vmbuffer);
+		}
+		else
+		{
+			/* Re-acquire the lock on the old tuple's page. */
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			/* Re-check using the up-to-date free space */
+			pagefree = PageGetZHeapFreeSpace(page);
+			if (newtupsize > pagefree)
+			{
+				/*
+				 * Rats, it doesn't fit anymore.  We must now unlock and
+				 * relock to avoid deadlock.  Fortunately, this path should
+				 * seldom be taken.
+				 */
+				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+				newbuf = RelationGetBufferForZTuple(relation, zheaptup->t_len,
+													buffer, 0, NULL,
+													&vmbuffer_new, &vmbuffer);
+			}
+			else
+			{
+				/* OK, it fits here, so we're done. */
+				newbuf = buffer;
+			}
+		}
+
+		max_offset = PageGetMaxOffsetNumber(BufferGetPage(newbuf));
+		oldblk = BufferGetBlockNumber(buffer);
+		newblk = BufferGetBlockNumber(newbuf);
+
+		/*
+		 * If we have got the new block than reserve the slot in same order in
+		 * which buffers are locked (ascending).
+		 */
+		if (oldblk == newblk)
+		{
+			new_trans_slot_id = PageReserveTransactionSlot(relation,
+														   newbuf,
+														   max_offset + 1,
+														   epoch,
+														   xid,
+														   &new_prev_urecptr,
+														   &lock_reacquired);
+			/*
+			 * We should get the same slot what we reserved previously because
+			 * our transaction information should already be there.  But, there
+			 * is possibility that our slot might have moved to the TPD in such
+			 * case we should get previous slot_no + 1.
+			 */
+			Assert((new_trans_slot_id == trans_slot_id) ||
+					(ZHeapPageHasTPDSlot((PageHeader)page) &&
+					 new_trans_slot_id == trans_slot_id + 1));
+
+			trans_slot_id = new_trans_slot_id;
+		}
+		else if (oldblk < newblk)
+		{
+			slot_id = PageReserveTransactionSlot(relation,
+												 buffer,
+												 old_offnum,
+												 epoch,
+												 xid,
+												 &prev_urecptr,
+												 &lock_reacquired);
+			Assert((slot_id == trans_slot_id) ||
+					(ZHeapPageHasTPDSlot((PageHeader)page) &&
+					 slot_id == trans_slot_id + 1));
+
+			trans_slot_id = slot_id;
+
+			/* reserve the transaction slot on a new page */
+			new_trans_slot_id = PageReserveTransactionSlot(relation,
+														   newbuf,
+														   max_offset + 1,
+														   epoch,
+														   xid,
+														   &new_prev_urecptr,
+														   &lock_reacquired);
+		}
+		else
+		{
+			/* reserve the transaction slot on a new page */
+			new_trans_slot_id = PageReserveTransactionSlot(relation,
+														   newbuf,
+														   max_offset + 1,
+														   epoch,
+														   xid,
+														   &new_prev_urecptr,
+														   &lock_reacquired);
+
+			/* reserve the transaction slot on a old page */
+			slot_id = PageReserveTransactionSlot(relation,
+												 buffer,
+												 old_offnum,
+												 epoch,
+												 xid,
+												 &prev_urecptr,
+												 &lock_reacquired);
+			Assert((slot_id == trans_slot_id) ||
+					(ZHeapPageHasTPDSlot((PageHeader)page) &&
+					 slot_id == trans_slot_id + 1));
+			trans_slot_id = slot_id;
+		}
+
+		if (lock_reacquired)
+			goto reacquire_buffer;
+
+		if (new_trans_slot_id == InvalidXactSlotId)
+		{
+			/* release the new buffer and lock on old buffer */
+			UnlockReleaseBuffer(newbuf);
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			UnlockReleaseTPDBuffers();
+
+			pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+			pg_usleep(10000L);	/* 10 ms */
+			pgstat_report_wait_end();
+
+			goto reacquire_buffer;
+		}
+
+		/*
+		 * After we release the lock on page, it could be pruned.  As we have
+		 * lock on the tuple, it couldn't be removed underneath us, but its
+		 * position could be changes, so need to refresh the tuple position.
+		 *
+		 * XXX Though the length of the tuple wouldn't have changed, but there
+		 * is no harm in refrehsing it for the sake of consistency of code.
+		 */
+		oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		oldtup.t_len = ItemIdGetLength(lp);
+		tup_trans_slot_id = trans_slot_id;
+		tup_xid = xid;
+	}
+	else
+	{
+		/* No TOAST work needed, and it'll fit on same page */
+		newbuf = buffer;
+		new_trans_slot_id = trans_slot_id;
+		zheaptup = newtup;
+	}
+
+	CheckForSerializableConflictIn(relation, &(oldtup.t_self), buffer);
+
+	/*
+	 * Prepare an undo record for old tuple.  We need to separately store the
+	 * latest transaction id that has changed the tuple to ensure that we
+	 * don't try to process the tuple in undo chain that is already discarded.
+	 * See GetTupleFromUndo.
+	 */
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = relation->rd_id;
+	undorecord.uur_prevxid = tup_xid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = cid;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = prev_urecptr;
+	undorecord.uur_block = ItemPointerGetBlockNumber(&(oldtup.t_self));
+	undorecord.uur_offset = ItemPointerGetOffsetNumber(&(oldtup.t_self));
+	undorecord.uur_payload.len = 0;
+
+	initStringInfo(&undorecord.uur_tuple);
+
+	/*
+	 * Copy the entire old tuple including it's header in the undo record.
+	 * We need this to reconstruct the old tuple if current tuple is not
+	 * visible to some other transaction.  We choose to write the complete
+	 * tuple in undo record for update operation so that we can reuse the
+	 * space of old tuples for non-inplace-updates after the transaction
+	 * performing the operation commits.
+	 */
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+							(char *) &oldtup.t_len,
+							sizeof(uint32));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+							(char *) &oldtup.t_self,
+							sizeof(ItemPointerData));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+							(char *) &oldtup.t_tableOid,
+							sizeof(Oid));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+							(char *) oldtup.t_data,
+							oldtup.t_len);
+
+	if (use_inplace_update)
+	{
+		bool	hasPayload = false;
+
+		undorecord.uur_type = UNDO_INPLACE_UPDATE;
+
+		/*
+		 * Store the transaction slot number for undo tuple in undo record, if
+		 * the slot belongs to TPD entry.  We can always get the current tuple's
+		 * transaction slot number by referring offset->slot map in TPD entry,
+		 * however that won't be true for tuple in undo.
+		 */
+		if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT;
+			initStringInfo(&undorecord.uur_payload);
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &tup_trans_slot_id,
+								   sizeof(tup_trans_slot_id));
+			hasPayload = true;
+		}
+
+		/*
+		 * Store subtransaction id in undo record.  See SubXactLockTableWait
+		 * to know why we need to store subtransaction id in undo.
+		 */
+		if (hasSubXactLock)
+		{
+			SubTransactionId subxid = GetCurrentSubTransactionId();
+
+			if (!hasPayload)
+			{
+				initStringInfo(&undorecord.uur_payload);
+				hasPayload = true;
+			}
+
+			undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT;
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &subxid,
+								   sizeof(subxid));
+		}
+
+		if (!hasPayload)
+			undorecord.uur_payload.len = 0;
+
+		urecptr = PrepareUndoInsert(&undorecord,
+									InvalidTransactionId,
+									UndoPersistenceForRelation(relation),
+									&undometa);
+	}
+	else
+	{
+		Size	payload_len;
+		UnpackedUndoRecord	undorec[2];
+
+		undorecord.uur_type = UNDO_UPDATE;
+
+		/*
+		 * we need to initialize the length of payload before actually knowing
+		 * the value to ensure that the required space is reserved in undo.
+		 */
+		payload_len = sizeof(ItemPointerData);
+		if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT;
+			payload_len += sizeof(tup_trans_slot_id);
+		}
+
+		/*
+		 * Store subtransaction id in undo record.  See SubXactLockTableWait
+		 * to know why we need to store subtransaction id in undo.
+		 */
+		if (hasSubXactLock)
+		{
+			undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT;
+			payload_len += sizeof(SubTransactionId);
+		}
+
+		undorecord.uur_payload.len = payload_len;
+
+		/* prepare an undo record for new tuple */
+		new_undorecord.uur_type = UNDO_INSERT;
+		new_undorecord.uur_info = 0;
+		new_undorecord.uur_prevlen = 0;
+		new_undorecord.uur_reloid = relation->rd_id;
+		new_undorecord.uur_prevxid = xid;
+		new_undorecord.uur_xid = xid;
+		new_undorecord.uur_cid = cid;
+		new_undorecord.uur_fork = MAIN_FORKNUM;
+		new_undorecord.uur_block = BufferGetBlockNumber(newbuf);
+		new_undorecord.uur_payload.len = 0;
+		new_undorecord.uur_tuple.len = 0;
+
+		if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			new_undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT;
+			initStringInfo(&new_undorecord.uur_payload);
+			appendBinaryStringInfo(&new_undorecord.uur_payload,
+								   (char *) &new_trans_slot_id,
+								   sizeof(new_trans_slot_id));
+		}
+		else
+			new_undorecord.uur_payload.len = 0;
+
+		undorec[0] = undorecord;
+		undorec[1] = new_undorecord;
+		UndoSetPrepareSize(undorec, 2, InvalidTransactionId,
+						   UndoPersistenceForRelation(relation), &undometa);
+
+		/* copy updated record (uur_info might got updated )*/
+		undorecord = undorec[0];
+		new_undorecord = undorec[1];
+
+		urecptr = PrepareUndoInsert(&undorecord,
+									InvalidTransactionId,
+									UndoPersistenceForRelation(relation),
+									NULL);
+
+		initStringInfo(&undorecord.uur_payload);
+
+		/* Make more room for tuple location if needed */
+		enlargeStringInfo(&undorecord.uur_payload, payload_len);
+
+		if (buffer == newbuf)
+			new_undorecord.uur_blkprev = urecptr;
+		else
+			new_undorecord.uur_blkprev = new_prev_urecptr;
+
+		new_urecptr = PrepareUndoInsert(&new_undorecord,
+										InvalidTransactionId,
+										UndoPersistenceForRelation(relation),
+										NULL);
+
+		/* Check and lock the TPD page before starting critical section. */
+		CheckAndLockTPDPage(relation, new_trans_slot_id, trans_slot_id,
+							newbuf, buffer);
+
+	}
+
+	/*
+	 * We can't rely on any_multi_locker_member_alive to clear the multi locker
+	 * bit, if the the lock on the buffer is released inbetween.
+	 */
+	temp_infomask = oldtup.t_data->t_infomask;
+
+	/* Compute the new xid and infomask to store into the tuple. */
+	compute_new_xid_infomask(&oldtup, buffer, save_tup_xid, tup_trans_slot_id,
+							 temp_infomask, xid, trans_slot_id,
+							 single_locker_xid, *lockmode, ForUpdate,
+							 &old_infomask, &result_trans_slot_id);
+
+	/*
+	 * There must not be any stronger locker than the current operation,
+	 * otherwise it would have waited for it to finish.
+	 */
+	Assert(result_trans_slot_id == trans_slot_id);
+
+	/*
+	 * Propagate the lockers information to the new tuple.  Since we're doing
+	 * an update, the only possibility is that the lockers had FOR KEY SHARE
+	 * lock.  For in-place updates, we are not creating any new version, so
+	 * we don't need to propagate anything.
+	 */
+	if ((checked_lockers && !locker_remains) || use_inplace_update)
+		new_infomask = 0;
+	else
+	{
+		/*
+		 * We should also set the multilocker flag if it was there previously,
+		 * else, we set the tuple as locked-only.
+		 */
+		new_infomask = ZHEAP_XID_KEYSHR_LOCK;
+		if (ZHeapTupleHasMultiLockers(old_infomask))
+			new_infomask |= ZHEAP_MULTI_LOCKERS | ZHEAP_XID_LOCK_ONLY;
+		else
+			new_infomask |= ZHEAP_XID_LOCK_ONLY;
+	}
+
+	if (use_inplace_update)
+	{
+		infomask_old_tuple = infomask_new_tuple =
+					old_infomask | new_infomask | ZHEAP_INPLACE_UPDATED;
+	}
+	else
+	{
+		infomask_old_tuple = old_infomask | ZHEAP_UPDATED;
+		infomask_new_tuple = new_infomask;
+	}
+
+	/* We must have a valid buffer. */
+	Assert(BufferIsValid(vmbuffer));
+	vm_status = visibilitymap_get_status(relation,
+								BufferGetBlockNumber(buffer), &vmbuffer);
+
+	/*
+	 * If the page is new, then there will no valid vmbuffer_new and the
+	 * visisbilitymap is reset already, hence, need not to clear anything.
+	 */
+	if (newbuf != buffer && BufferIsValid(vmbuffer_new))
+		vm_status_new = visibilitymap_get_status(relation,
+								BufferGetBlockNumber(newbuf), &vmbuffer_new);
+
+	START_CRIT_SECTION();
+
+	if (buffer == newbuf)
+	{
+		/*
+		 * If all the members were lockers and are all gone, we can do away
+		 * with the MULTI_LOCKERS bit.
+		 */
+		if (ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask) &&
+			!any_multi_locker_member_alive)
+			oldtup.t_data->t_infomask &= ~ZHEAP_MULTI_LOCKERS;
+	}
+
+	if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) ||
+		(vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE))
+	{
+		all_visible_cleared = true;
+		visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+							vmbuffer, VISIBILITYMAP_VALID_BITS);
+	}
+
+	if (newbuf != buffer)
+	{
+		if ((vm_status_new & VISIBILITYMAP_ALL_VISIBLE) ||
+			(vm_status_new & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE))
+		{
+			new_all_visible_cleared = true;
+			visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
+					vmbuffer_new, VISIBILITYMAP_VALID_BITS);
+		}
+	}
+
+	/*
+	 * A page can be pruned for non-inplace updates or inplace updates that
+	 * results in shorter tuples.  If this transaction commits, the tuple will
+	 * become DEAD sooner or later.  If the transaction finally aborts, the
+	 * subsequent page pruning will be a no-op and the hint will be cleared.
+	 */
+	if (!use_inplace_update || (zheaptup->t_len < oldtup.t_len))
+		ZPageSetPrunable(page, xid);
+
+	/* oldtup should be pointing to right place in page */
+	Assert(oldtup.t_data == (ZHeapTupleHeader) PageGetItem(page, lp));
+
+	ZHeapTupleHeaderSetXactSlot(oldtup.t_data, result_trans_slot_id);
+	oldtup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+	oldtup.t_data->t_infomask |= infomask_old_tuple;
+
+	/* keep the new tuple copy updated for the caller */
+	ZHeapTupleHeaderSetXactSlot(zheaptup->t_data, new_trans_slot_id);
+	zheaptup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+	zheaptup->t_data->t_infomask |= infomask_new_tuple;
+
+	if (use_inplace_update)
+	{
+		/*
+		 * For inplace updates, we copy the entire data portion including null
+		 * bitmap of new tuple.
+		 *
+		 * For the special case where we are doing inplace updates even when
+		 * the new tuple is bigger, we need to adjust the old tuple's location
+		 * so that new tuple can be copied at that location as it is.
+		 */
+		ItemIdChangeLen(lp, zheaptup->t_len);
+		memcpy((char *) oldtup.t_data + SizeofZHeapTupleHeader,
+			   (char *) zheaptup->t_data + SizeofZHeapTupleHeader,
+			   zheaptup->t_len - SizeofZHeapTupleHeader);
+
+		/*
+		 * Copy everything from new tuple in infomask apart from visibility
+		 * flags.
+		 */
+		oldtup.t_data->t_infomask = oldtup.t_data->t_infomask &
+											ZHEAP_VIS_STATUS_MASK;
+		oldtup.t_data->t_infomask |= (zheaptup->t_data->t_infomask &
+										~ZHEAP_VIS_STATUS_MASK);
+		/* Copy number of attributes in infomask2 of new tuple. */
+		oldtup.t_data->t_infomask2 &= ~ZHEAP_NATTS_MASK;
+		oldtup.t_data->t_infomask2 |=
+					newtup->t_data->t_infomask2 & ZHEAP_NATTS_MASK;
+		/* also update the tuple length and self pointer */
+		oldtup.t_len = zheaptup->t_len;
+		oldtup.t_data->t_hoff = zheaptup->t_data->t_hoff;
+		ItemPointerCopy(&oldtup.t_self, &zheaptup->t_self);
+	}
+	else
+	{
+		/* insert tuple at new location */
+		RelationPutZHeapTuple(relation, newbuf, zheaptup);
+
+		/* update new tuple location in undo record */
+		appendBinaryStringInfoNoExtend(&undorecord.uur_payload,
+									   (char *) &zheaptup->t_self,
+									   sizeof(ItemPointerData));
+		if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			appendBinaryStringInfoNoExtend(&undorecord.uur_payload,
+										  (char *) &tup_trans_slot_id,
+										  sizeof(tup_trans_slot_id));
+		if (hasSubXactLock)
+		{
+			SubTransactionId subxid = GetCurrentSubTransactionId();
+
+			appendBinaryStringInfoNoExtend(&undorecord.uur_payload,
+										   (char *) &subxid,
+										   sizeof(subxid));
+		}
+
+		new_undorecord.uur_offset = ItemPointerGetOffsetNumber(&(zheaptup->t_self));
+	}
+
+	InsertPreparedUndo();
+	if (use_inplace_update)
+		PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch,
+					xid, urecptr, NULL, 0);
+	else
+	{
+		if (newbuf == buffer)
+		{
+			OffsetNumber usedoff[2];
+
+			usedoff[0] = undorecord.uur_offset;
+			usedoff[1] = new_undorecord.uur_offset;
+
+			PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch,
+						xid, new_urecptr, usedoff, 2);
+		}
+		else
+		{
+			/* set transaction slot information for old page */
+			PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch,
+						xid, urecptr, NULL, 0);
+			/* set transaction slot information for new page */
+			PageSetUNDO(new_undorecord,
+						newbuf,
+						new_trans_slot_id,
+						true,
+						epoch,
+						xid,
+						new_urecptr,
+						NULL,
+						0);
+
+			MarkBufferDirty(newbuf);
+		}
+	}
+
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(relation))
+	{
+		/*
+		 * For logical decoding we need combocids to properly decode the
+		 * catalog.
+		 */
+		if (RelationIsAccessibleInLogicalDecoding(relation))
+		{
+			/*
+			 * Fixme: This won't work as it needs to access cmin/cmax which
+			 * we probably needs to retrieve from UNDO.
+			 */
+			/*log_heap_new_cid(relation, &oldtup);
+			log_heap_new_cid(relation, heaptup);*/
+		}
+
+		log_zheap_update(relation, undorecord, new_undorecord,
+						 urecptr, new_urecptr, buffer, newbuf,
+						 &oldtup, zheaptup, tup_trans_slot_id,
+						 trans_slot_id, new_trans_slot_id,
+						 use_inplace_update, all_visible_cleared,
+						 new_all_visible_cleared, &undometa);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(undorecord.uur_tuple.data);
+	if (undorecord.uur_payload.len > 0)
+		pfree(undorecord.uur_payload.data);
+
+	if (!use_inplace_update && new_undorecord.uur_payload.len > 0)
+		pfree(new_undorecord.uur_payload.data);
+
+	if (newbuf != buffer)
+		LockBuffer(newbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Fixme - need to support cache invalidation API's for zheaptuples.
+	 */
+	/* CacheInvalidateHeapTuple(relation, &oldtup, heaptup); */
+
+	if (BufferIsValid(vmbuffer_new))
+		ReleaseBuffer(vmbuffer_new);
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	if (newbuf != buffer)
+		ReleaseBuffer(newbuf);
+	ReleaseBuffer(buffer);
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+
+	/*
+	 * Release the lmgr tuple lock, if we had it.
+	 */
+	if (have_tuple_lock)
+		UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
+
+	/*
+	 * As of now, we only count non-inplace updates as that are required to
+	 * decide whether to trigger autovacuum.
+	 */
+	if (!use_inplace_update)
+		pgstat_count_heap_update(relation, false);
+	else
+		pgstat_count_zheap_update(relation);
+
+	/*
+	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
+	 * back to the caller's image, too.
+	 */
+	if (zheaptup != newtup)
+	{
+		newtup->t_self = zheaptup->t_self;
+		zheap_freetuple(zheaptup);
+	}
+	bms_free(inplace_upd_attrs);
+	bms_free(interesting_attrs);
+	bms_free(modified_attrs);
+
+	bms_free(key_attrs);
+	return HeapTupleMayBeUpdated;
+}
+
+/*
+ * log_zheap_update - Perform XLogInsert for a zheap-update operation.
+ *
+ * We need to store enough information in the WAL record so that undo records
+ * can be regenerated at the WAL replay time.
+ *
+ * Caller must already have modified the buffer(s) and marked them dirty.
+ */
+static void
+log_zheap_update(Relation reln, UnpackedUndoRecord undorecord,
+				 UnpackedUndoRecord newundorecord, UndoRecPtr urecptr,
+				 UndoRecPtr newurecptr, Buffer oldbuf, Buffer newbuf,
+				 ZHeapTuple oldtup, ZHeapTuple newtup,
+				 int old_tup_trans_slot_id, int trans_slot_id,
+				 int new_trans_slot_id, bool inplace_update,
+				 bool all_visible_cleared, bool new_all_visible_cleared,
+				 xl_undolog_meta *undometa)
+{
+	xl_undo_header	xlundohdr,
+					xlnewundohdr;
+	xl_zheap_header	xlundotuphdr,
+					xlhdr;
+	xl_zheap_update xlrec;
+	ZHeapTuple	difftup;
+	ZHeapTupleHeader	zhtuphdr;
+	uint16		prefix_suffix[2];
+	uint16		prefixlen = 0,
+				suffixlen = 0;
+	XLogRecPtr	recptr;
+	XLogRecPtr	RedoRecPtr;
+	bool		doPageWrites;
+	char	*oldp = NULL;
+	char	*newp = NULL;
+	int		oldlen, newlen;
+	uint32	totalundotuplen;
+	Size	dataoff;
+	int		bufflags = REGBUF_STANDARD;
+	uint8	info = XLOG_ZHEAP_UPDATE;
+
+	totalundotuplen = *((uint32 *) &undorecord.uur_tuple.data[0]);
+	dataoff = sizeof(uint32) + sizeof(ItemPointerData) + sizeof(Oid);
+	zhtuphdr = (ZHeapTupleHeader) &undorecord.uur_tuple.data[dataoff];
+
+	if (inplace_update)
+	{
+		/*
+		 * For inplace updates the old tuple is in undo record and the
+		 * new tuple is replaced in page where old tuple was present.
+		 */
+		oldp = (char *) zhtuphdr + zhtuphdr->t_hoff;
+		oldlen = totalundotuplen - zhtuphdr->t_hoff;
+		newp = (char *) oldtup->t_data + oldtup->t_data->t_hoff;
+		newlen = oldtup->t_len - oldtup->t_data->t_hoff;
+
+		difftup = oldtup;
+	}
+	else if (oldbuf == newbuf)
+	{
+		oldp = (char *) oldtup->t_data + oldtup->t_data->t_hoff;
+		oldlen = oldtup->t_len - oldtup->t_data->t_hoff;
+		newp = (char *) newtup->t_data + newtup->t_data->t_hoff;
+		newlen = newtup->t_len - newtup->t_data->t_hoff;
+
+		difftup = newtup;
+	}
+	else
+	{
+		difftup = newtup;
+	}
+
+	/*
+	 * See log_heap_update to know under what some circumstances we can use
+	 * prefix-suffix compression.
+	 */
+	if (oldbuf == newbuf && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		Assert(oldp != NULL && newp != NULL);
+
+		/* Check for common prefix between undo and old tuple */
+		for (prefixlen = 0; prefixlen < Min(oldlen, newlen); prefixlen++)
+		{
+			if (oldp[prefixlen] != newp[prefixlen])
+				break;
+		}
+
+		/*
+		 * Storing the length of the prefix takes 2 bytes, so we need to save
+		 * at least 3 bytes or there's no point.
+		 */
+		if (prefixlen < 3)
+			prefixlen = 0;
+
+		/* Same for suffix */
+		for (suffixlen = 0; suffixlen < Min(oldlen, newlen) - prefixlen; suffixlen++)
+		{
+			if (oldp[oldlen - suffixlen - 1] != newp[newlen - suffixlen - 1])
+				break;
+		}
+		if (suffixlen < 3)
+			suffixlen = 0;
+	}
+
+	/*
+	 * Store the information required to generate undo record during
+	 * replay.
+	 */
+	xlundohdr.reloid = undorecord.uur_reloid;
+	xlundohdr.urec_ptr = urecptr;
+	xlundohdr.blkprev = undorecord.uur_blkprev;
+
+	xlrec.prevxid = undorecord.uur_prevxid;
+	xlrec.old_offnum = ItemPointerGetOffsetNumber(&oldtup->t_self);
+	xlrec.old_infomask = oldtup->t_data->t_infomask;
+	xlrec.old_trans_slot_id = trans_slot_id;
+	xlrec.new_offnum = ItemPointerGetOffsetNumber(&difftup->t_self);
+	xlrec.flags = 0;
+	if (all_visible_cleared)
+		xlrec.flags |= XLZ_UPDATE_OLD_ALL_VISIBLE_CLEARED;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XLZ_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (prefixlen > 0)
+		xlrec.flags |= XLZ_UPDATE_PREFIX_FROM_OLD;
+	if (suffixlen > 0)
+		xlrec.flags |= XLZ_UPDATE_SUFFIX_FROM_OLD;
+	if (undorecord.uur_info & UREC_INFO_PAYLOAD_CONTAINS_SUBXACT)
+		xlrec.flags |= XLZ_UPDATE_CONTAINS_SUBXACT;
+
+	if (!inplace_update)
+	{
+		Page		page = BufferGetPage(newbuf);
+
+		xlrec.flags |= XLZ_NON_INPLACE_UPDATE;
+
+		xlnewundohdr.reloid = newundorecord.uur_reloid;
+		xlnewundohdr.urec_ptr = newurecptr;
+		xlnewundohdr.blkprev = newundorecord.uur_blkprev;
+
+		Assert(newtup);
+		/* If new tuple is the single and first tuple on page... */
+		if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
+			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+		{
+			info |= XLOG_ZHEAP_INIT_PAGE;
+			bufflags |= REGBUF_WILL_INIT;
+		}
+	}
+
+	/*
+	 * If full_page_writes is enabled, and the buffer image is not
+	 * included in the WAL then we can rely on the tuple in the page to
+	 * regenerate the undo tuple during recovery.  For detail comments related
+	 * to handling of full_page_writes get changed at run time, refer comments
+	 * in zheap_delete.
+	 */
+prepare_xlog:
+	/* LOG undolog meta if this is the first WAL after the checkpoint. */
+	LogUndoMetaData(undometa);
+
+	GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+	if (!doPageWrites || XLogCheckBufferNeedsBackup(oldbuf))
+	{
+		xlrec.flags |= XLZ_HAS_UPDATE_UNDOTUPLE;
+
+		xlundotuphdr.t_infomask2 = zhtuphdr->t_infomask2;
+		xlundotuphdr.t_infomask = zhtuphdr->t_infomask;
+		xlundotuphdr.t_hoff = zhtuphdr->t_hoff;
+	}
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+	XLogRegisterData((char *) &xlrec, SizeOfZHeapUpdate);
+	if (old_tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+	{
+		xlrec.flags |= XLZ_UPDATE_OLD_CONTAINS_TPD_SLOT;
+		XLogRegisterData((char *) &old_tup_trans_slot_id,
+						 sizeof(old_tup_trans_slot_id));
+	}
+	if (!inplace_update)
+	{
+		XLogRegisterData((char *) &xlnewundohdr, SizeOfUndoHeader);
+		if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			xlrec.flags |= XLZ_UPDATE_NEW_CONTAINS_TPD_SLOT;
+			XLogRegisterData((char *) &new_trans_slot_id,
+							 sizeof(new_trans_slot_id));
+		}
+	}
+	if (xlrec.flags & XLZ_HAS_UPDATE_UNDOTUPLE)
+	{
+		XLogRegisterData((char *) &xlundotuphdr, SizeOfZHeapHeader);
+		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+		XLogRegisterData((char *) zhtuphdr + SizeofZHeapTupleHeader,
+						 totalundotuplen - SizeofZHeapTupleHeader);
+	}
+
+	XLogRegisterBuffer(0, newbuf, bufflags);
+	if (oldbuf != newbuf)
+	{
+		uint8	block_id;
+
+		XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+		block_id = 2;
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			block_id = RegisterTPDBuffer(BufferGetPage(oldbuf), block_id);
+		if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			RegisterTPDBuffer(BufferGetPage(newbuf), block_id);
+	}
+	else
+	{
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		{
+			/*
+			 * Block id '1' is reserved for oldbuf if that is different from
+			 * newbuf.
+			 */
+			RegisterTPDBuffer(BufferGetPage(oldbuf), 2);
+		}
+	}
+
+	/*
+	 * Prepare WAL data for the new tuple.
+	 */
+	if (prefixlen > 0 || suffixlen > 0)
+	{
+		if (prefixlen > 0 && suffixlen > 0)
+		{
+			prefix_suffix[0] = prefixlen;
+			prefix_suffix[1] = suffixlen;
+			XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
+		}
+		else if (prefixlen > 0)
+		{
+			XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+		}
+		else
+		{
+			XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+		}
+	}
+
+	xlhdr.t_infomask2 = difftup->t_data->t_infomask2;
+	xlhdr.t_infomask = difftup->t_data->t_infomask;
+	xlhdr.t_hoff = difftup->t_data->t_hoff;
+	Assert(SizeofZHeapTupleHeader + prefixlen + suffixlen <= difftup->t_len);
+
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+	 *
+	 * The 'data' doesn't include the common prefix or suffix.
+	 */
+	XLogRegisterBufData(0, (char *) &xlhdr, SizeOfZHeapHeader);
+	if (prefixlen == 0)
+	{
+		XLogRegisterBufData(0,
+							((char *) difftup->t_data) + SizeofZHeapTupleHeader,
+							difftup->t_len - SizeofZHeapTupleHeader - suffixlen);
+	}
+	else
+	{
+		/*
+		 * Have to write the null bitmap and data after the common prefix as
+		 * two separate rdata entries.
+		 */
+		/* bitmap [+ padding] [+ oid] */
+		if (difftup->t_data->t_hoff - SizeofZHeapTupleHeader > 0)
+		{
+			XLogRegisterBufData(0,
+								((char *) difftup->t_data) + SizeofZHeapTupleHeader,
+								difftup->t_data->t_hoff - SizeofZHeapTupleHeader);
+		}
+
+		/* data after common prefix */
+		XLogRegisterBufData(0,
+			  ((char *) difftup->t_data) + difftup->t_data->t_hoff + prefixlen,
+			  difftup->t_len - difftup->t_data->t_hoff - prefixlen - suffixlen);
+	}
+
+	/* filtering by origin on a row level is much more efficient */
+	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+	recptr = XLogInsertExtended(RM_ZHEAP_ID, info, RedoRecPtr, doPageWrites);
+	if (recptr == InvalidXLogRecPtr)
+	{
+		ResetRegisteredTPDBuffers();
+		goto prepare_xlog;
+	}
+
+	if (newbuf != oldbuf)
+	{
+		PageSetLSN(BufferGetPage(newbuf), recptr);
+		if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			TPDPageSetLSN(BufferGetPage(newbuf), recptr);
+	}
+	PageSetLSN(BufferGetPage(oldbuf), recptr);
+	if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+		TPDPageSetLSN(BufferGetPage(oldbuf), recptr);
+}
+
+/*
+ * zheap_lock_tuple - lock a tuple.
+ *
+ *	The functionality is same as heap_lock_tuple except that here we always
+ *	make a copy of the tuple before returning to the caller.  We maintain
+ *	the pin on buffer to keep the specs same as heap_lock_tuple.
+ *
+ *	eval - indicates whether the tuple will be evaluated to see if it still
+ *	matches the qualification.
+ *
+ * XXX - Here, we are purposefully not doing anything for visibility map
+ * as it is not clear whether we ever need all_frozen kind of concept for
+ * zheap.
+ */
+HTSU_Result
+zheap_lock_tuple(Relation relation, ItemPointer tid,
+				 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
+				 bool follow_updates, bool eval, Snapshot snapshot,
+				 ZHeapTuple tuple, Buffer *buffer, HeapUpdateFailureData *hufd)
+{
+	HTSU_Result result;
+	ZHeapTupleData	zhtup;
+	UndoRecPtr	prev_urecptr;
+	ItemId		lp;
+	Page		page;
+	ItemPointerData	ctid;
+	TransactionId xid,
+				  tup_xid,
+				  single_locker_xid;
+	SubTransactionId tup_subxid = InvalidSubTransactionId;
+	CommandId	tup_cid;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	uint32		epoch;
+	int			tup_trans_slot_id,
+				trans_slot_id,
+				single_locker_trans_slot;
+	OffsetNumber	offnum;
+	LockOper	lockopr;
+	bool		require_sleep;
+	bool		have_tuple_lock = false;
+	bool		in_place_updated_or_locked = false;
+	bool		any_multi_locker_member_alive = false;
+	bool		lock_reacquired;
+	bool		rollback_and_relocked;
+
+	xid = GetTopTransactionId();
+	epoch = GetEpochForXid(xid);
+	lockopr = eval ? LockForUpdate : LockOnly;
+
+	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+
+	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	page = BufferGetPage(*buffer);
+	offnum = ItemPointerGetOffsetNumber(tid);
+	lp = PageGetItemId(page, offnum);
+	Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp));
+
+	/*
+	 * If TID is already delete marked due to pruning, then get new ctid, so
+	 * that we can lock the new tuple.  We will get new ctid if the tuple
+	 * was non-inplace-updated otherwise we will get same TID.
+	 */
+	if (ItemIdIsDeleted(lp))
+	{
+		ctid = *tid;
+		ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+		result = HeapTupleUpdated;
+		goto failed;
+	}
+
+	zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	zhtup.t_len = ItemIdGetLength(lp);
+	zhtup.t_tableOid = RelationGetRelid(relation);
+	zhtup.t_self = *tid;
+
+	/*
+	 * Get the transaction slot and undo record pointer if we are already in a
+	 * transaction.
+	 */
+	trans_slot_id = PageGetTransactionSlotId(relation, *buffer, epoch, xid,
+											 &urec_ptr, false, false, NULL);
+
+	/*
+	 * ctid needs to be fetched from undo chain.  See zheap_update.
+	 */
+	ctid = *tid;
+
+check_tup_satisfies_update:
+	any_multi_locker_member_alive = true;
+	result = ZHeapTupleSatisfiesUpdate(relation, &zhtup, cid, *buffer, &ctid,
+									   &tup_trans_slot_id, &tup_xid, &tup_subxid,
+									   &tup_cid, &single_locker_xid,
+									   &single_locker_trans_slot, false, eval,
+									   snapshot, &in_place_updated_or_locked);
+	if (result == HeapTupleInvisible)
+	{
+		/* ZBORKED? this previously didn't set up a tuple */
+		tuple->t_tableOid = RelationGetRelid(relation);
+		tuple->t_len = zhtup.t_len;
+		tuple->t_self = zhtup.t_self;
+		tuple->t_data = palloc0(tuple->t_len);
+		memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len);
+
+		/* Give caller an opportunity to throw a more specific error. */
+		result = HeapTupleInvisible;
+		goto out_locked;
+	}
+	else if (result == HeapTupleBeingUpdated ||
+			 result == HeapTupleUpdated ||
+			 (result == HeapTupleMayBeUpdated &&
+			  ZHeapTupleHasMultiLockers(zhtup.t_data->t_infomask)))
+	{
+		TransactionId	xwait;
+		SubTransactionId	xwait_subxid;
+		int				xwait_trans_slot;
+		uint16			infomask;
+
+		xwait_subxid = tup_subxid;
+
+		if (TransactionIdIsValid(single_locker_xid))
+		{
+			xwait = single_locker_xid;
+			xwait_trans_slot = single_locker_trans_slot;
+		}
+		else
+		{
+			xwait = tup_xid;
+			xwait_trans_slot = tup_trans_slot_id;
+		}
+
+		infomask = zhtup.t_data->t_infomask;
+
+		/*
+		 * make a copy of the tuple before releasing the lock as some other
+		 * backend can perform in-place update this tuple once we release the
+		 * lock.
+		 */
+		tuple->t_tableOid = RelationGetRelid(relation);
+		tuple->t_len = zhtup.t_len;
+		tuple->t_self = zhtup.t_self;
+		tuple->t_data = palloc0(tuple->t_len);
+		memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len);
+
+		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+
+		/*
+		 * If any subtransaction of the current top transaction already holds
+		 * a lock as strong as or stronger than what we're requesting, we
+		 * effectively hold the desired lock already.  We *must* succeed
+		 * without trying to take the tuple lock, else we will deadlock
+		 * against anyone wanting to acquire a stronger lock.
+		 */
+		if (ZHeapTupleHasMultiLockers(infomask))
+		{
+			List	*mlmembers;
+			ListCell   *lc;
+
+			if (trans_slot_id != InvalidXactSlotId)
+			{
+				mlmembers = ZGetMultiLockMembersForCurrentXact(&zhtup,
+													trans_slot_id, urec_ptr);
+
+				foreach(lc, mlmembers)
+				{
+					ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+					/*
+					 * Only members of our own transaction must be present in
+					 * the list.
+					 */
+					Assert(TransactionIdIsCurrentTransactionId(mlmember->xid));
+
+					if (mlmember->mode >= mode)
+					{
+						list_free_deep(mlmembers);
+						result = HeapTupleMayBeUpdated;
+						goto out_unlocked;
+					}
+				}
+
+				list_free_deep(mlmembers);
+			}
+		}
+		else if (TransactionIdIsCurrentTransactionId(xwait))
+		{
+			switch (mode)
+			{
+				case LockTupleKeyShare:
+					Assert(ZHEAP_XID_IS_KEYSHR_LOCKED(infomask) ||
+						   ZHEAP_XID_IS_SHR_LOCKED(infomask) ||
+						   ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) ||
+						   ZHEAP_XID_IS_EXCL_LOCKED(infomask));
+					{
+						result = HeapTupleMayBeUpdated;
+						goto out_unlocked;
+					}
+					break;
+				case LockTupleShare:
+					if (ZHEAP_XID_IS_SHR_LOCKED(infomask) ||
+						ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) ||
+						ZHEAP_XID_IS_EXCL_LOCKED(infomask))
+					{
+						result = HeapTupleMayBeUpdated;
+						goto out_unlocked;
+					}
+					break;
+				case LockTupleNoKeyExclusive:
+					if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) ||
+						ZHEAP_XID_IS_EXCL_LOCKED(infomask))
+					{
+						result = HeapTupleMayBeUpdated;
+						goto out_unlocked;
+					}
+					break;
+				case LockTupleExclusive:
+					if (ZHEAP_XID_IS_EXCL_LOCKED(infomask))
+					{
+						result = HeapTupleMayBeUpdated;
+						goto out_unlocked;
+					}
+					break;
+			}
+		}
+
+		/*
+		 * Initially assume that we will have to wait for the locking
+		 * transaction(s) to finish.  We check various cases below in which
+		 * this can be turned off.
+		 */
+		require_sleep = true;
+		if (mode == LockTupleKeyShare)
+		{
+			if (!(ZHEAP_XID_IS_EXCL_LOCKED(infomask)))
+			{
+				bool		updated;
+
+				updated = !ZHEAP_XID_IS_LOCKED_ONLY(infomask);
+
+				/*
+				 * If there are updates, follow the update chain; bail out if
+				 * that cannot be done.
+				 */
+				if (follow_updates && updated)
+				{
+					if (!ZHeapTupleIsMoved(zhtup.t_data->t_infomask) &&
+						!ItemPointerEquals(&zhtup.t_self, &ctid))
+					{
+						HTSU_Result res;
+
+						res = zheap_lock_updated_tuple(relation, &zhtup, &ctid,
+													   xid, mode, lockopr, cid,
+													   &rollback_and_relocked);
+
+						/*
+						 * If the update was by some aborted transaction and its
+						 * pending undo actions are applied now, then check the
+						 * latest copy of the tuple.
+						 */
+						if (rollback_and_relocked)
+						{
+							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							goto check_tup_satisfies_update;
+						}
+						else if (res != HeapTupleMayBeUpdated)
+						{
+							result = res;
+							/* recovery code expects to have buffer lock held */
+							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							goto failed;
+						}
+					}
+				}
+
+				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+
+				/*
+				 * Also take care of cases when page is pruned after we release
+				 * the buffer lock. For this we check if ItemId is not deleted
+				 * and refresh the tuple offset position in page.  If TID is
+				 * already delete marked due to pruning, then get new ctid, so
+				 * that we can lock the new tuple.
+				 */
+				if (ItemIdIsDeleted(lp))
+				{
+					ctid = *tid;
+					ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+					result = HeapTupleUpdated;
+					goto failed;
+				}
+
+				zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+				zhtup.t_len = ItemIdGetLength(lp);
+
+				/*
+				 * Make sure it's still an appropriate lock, else start over.
+				 * Also, if it wasn't updated before we released the lock, but
+				 * is updated now, we start over too; the reason is that we
+				 * now need to follow the update chain to lock the new
+				 * versions.
+				 */
+				if (!(ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask)) &&
+					((ZHEAP_XID_IS_EXCL_LOCKED(zhtup.t_data->t_infomask)) ||
+					 !updated))
+					goto check_tup_satisfies_update;
+
+				/* Skip sleeping */
+				require_sleep = false;
+
+				/*
+				 * Note we allow Xid to change here; other updaters/lockers
+				 * could have modified it before we grabbed the buffer lock.
+				 * However, this is not a problem, because with the recheck we
+				 * just did we ensure that they still don't conflict with the
+				 * lock we want.
+				 */
+			}
+		}
+		else if (mode == LockTupleShare)
+		{
+			/*
+			 * If we're requesting Share, we can similarly avoid sleeping if
+			 * there's no update and no exclusive lock present.
+			 */
+			if (ZHEAP_XID_IS_LOCKED_ONLY(infomask) &&
+				!ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) &&
+				!ZHEAP_XID_IS_EXCL_LOCKED(infomask))
+			{
+				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+
+				/*
+				 * Also take care of cases when page is pruned after we release
+				 * the buffer lock. For this we check if ItemId is not deleted
+				 * and refresh the tuple offset position in page.  If TID is
+				 * already delete marked due to pruning, then get new ctid, so
+				 * that we can lock the new tuple.
+				 */
+				if (ItemIdIsDeleted(lp))
+				{
+					ctid = *tid;
+					ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+					result = HeapTupleUpdated;
+					goto failed;
+				}
+
+				zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+				zhtup.t_len = ItemIdGetLength(lp);
+
+				/*
+				 * Make sure it's still an appropriate lock, else start over.
+				 * See above about allowing xid to change.
+				 */
+				if (!ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask) ||
+					ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(zhtup.t_data->t_infomask) ||
+					ZHEAP_XID_IS_EXCL_LOCKED(zhtup.t_data->t_infomask))
+					goto check_tup_satisfies_update;
+
+				/* Skip sleeping */
+				require_sleep = false;
+			}
+		}
+		else if (mode == LockTupleNoKeyExclusive)
+		{
+			LockTupleMode	old_lock_mode;
+			TransactionId	current_tup_xid;
+			bool	buf_lock_reacquired = false;
+
+			old_lock_mode = get_old_lock_mode(infomask);
+
+			/*
+			 * If we're requesting NoKeyExclusive, we might also be able to
+			 * avoid sleeping; just ensure that there is no conflicting lock
+			 * already acquired.
+			 */
+			if (ZHeapTupleHasMultiLockers(infomask))
+			{
+				if (!DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode),
+									HWLOCKMODE_from_locktupmode(mode)))
+				{
+					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					buf_lock_reacquired = true;
+				}
+			}
+			else if (old_lock_mode == LockTupleKeyShare)
+			{
+				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				buf_lock_reacquired = true;
+			}
+
+			if (buf_lock_reacquired)
+			{
+				/*
+				 * Also take care of cases when page is pruned after we release
+				 * the buffer lock. For this we check if ItemId is not deleted
+				 * and refresh the tuple offset position in page.  If TID is
+				 * already delete marked due to pruning, then get new ctid, so
+				 * that we can lock the new tuple.
+				 */
+				if (ItemIdIsDeleted(lp))
+				{
+					ctid = *tid;
+					ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+					result = HeapTupleUpdated;
+					goto failed;
+				}
+
+				zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+				zhtup.t_len = ItemIdGetLength(lp);
+
+				ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, &current_tup_xid,
+										   NULL, NULL, false);
+
+				if (xid_infomask_changed(zhtup.t_data->t_infomask, infomask) ||
+					!TransactionIdEquals(current_tup_xid, xwait))
+					goto check_tup_satisfies_update;
+				/* Skip sleeping */
+				require_sleep = false;
+			}
+		}
+
+		/*
+		 * As a check independent from those above, we can also avoid sleeping
+		 * if the current transaction is the sole locker of the tuple.  Note
+		 * that the strength of the lock already held is irrelevant; this is
+		 * not about recording the lock (which will be done regardless of this
+		 * optimization, below).  Also, note that the cases where we hold a
+		 * lock stronger than we are requesting are already handled above
+		 * by not doing anything.
+		 */
+		if (require_sleep &&
+			!ZHeapTupleHasMultiLockers(infomask) &&
+			TransactionIdIsCurrentTransactionId(xwait))
+		{
+			TransactionId	current_tup_xid;
+
+			/*
+			 * ... but if the xid changed in the meantime, start over
+			 *
+			 * Also take care of cases when page is pruned after we release
+			 * the buffer lock. For this we check if ItemId is not deleted and
+			 * refresh the tuple offset position in page.  If TID is already
+			 * delete marked due to pruning, then get new ctid, so that we can
+			 * lock the new tuple.
+			 */
+			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			if (ItemIdIsDeleted(lp))
+			{
+				ctid = *tid;
+				ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+				result = HeapTupleUpdated;
+				goto failed;
+			}
+
+			zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+			zhtup.t_len = ItemIdGetLength(lp);
+
+			ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, &current_tup_xid,
+								   NULL, NULL, false);
+			if (xid_infomask_changed(zhtup.t_data->t_infomask, infomask) ||
+				!TransactionIdEquals(current_tup_xid, xwait))
+				goto check_tup_satisfies_update;
+			require_sleep = false;
+		}
+
+		if (require_sleep && result == HeapTupleUpdated)
+		{
+			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			goto failed;
+		}
+		else if (require_sleep)
+		{
+			List	*mlmembers = NIL;
+			bool	upd_xact_aborted = false;
+			TransactionId   current_tup_xid;
+
+			/*
+			 * Acquire tuple lock to establish our priority for the tuple, or
+			 * die trying.  LockTuple will release us when we are next-in-line
+			 * for the tuple.  We must do this even if we are share-locking.
+			 *
+			 * If we are forced to "start over" below, we keep the tuple lock;
+			 * this arranges that we stay at the head of the line while
+			 * rechecking tuple state.
+			 */
+			if (!heap_acquire_tuplock(relation, tid, mode, wait_policy,
+									  &have_tuple_lock))
+			{
+				/*
+				 * This can only happen if wait_policy is Skip and the lock
+				 * couldn't be obtained.
+				 */
+				result = HeapTupleWouldBlock;
+				/* recovery code expects to have buffer lock held */
+				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				goto failed;
+			}
+
+			if (ZHeapTupleHasMultiLockers(infomask))
+			{
+				LockTupleMode	old_lock_mode;
+				TransactionId	update_xact;
+
+				old_lock_mode = get_old_lock_mode(infomask);
+
+				/*
+				 * For aborted updates, we must allow to reverify the tuple in
+				 * case it's values got changed.
+				 */
+				if (!ZHEAP_XID_IS_LOCKED_ONLY(infomask))
+					ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, &update_xact,
+										   NULL, NULL, true);
+				else
+					update_xact = InvalidTransactionId;
+
+				if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode),
+									HWLOCKMODE_from_locktupmode(mode)))
+				{
+					/*
+					 * There is a potential conflict.  It is quite possible
+					 * that by this time the locker has already been committed.
+					 * So we need to check for conflict with all the possible
+					 * lockers and wait for each of them.
+					 */
+					mlmembers = ZGetMultiLockMembers(relation, &zhtup,
+													 *buffer, true);
+
+					/* wait for multixact to end, or die trying  */
+					switch (wait_policy)
+					{
+						case LockWaitBlock:
+							ZMultiLockMembersWait(relation, mlmembers, &zhtup,
+												  *buffer, update_xact, mode,
+												  false, XLTW_Lock, NULL,
+												  &upd_xact_aborted);
+							break;
+						case LockWaitSkip:
+							if (!ConditionalZMultiLockMembersWait(relation,
+																  mlmembers,
+																  *buffer,
+																  update_xact,
+																  mode,
+																  NULL,
+																  &upd_xact_aborted))
+							{
+								result = HeapTupleWouldBlock;
+								/* recovery code expects to have buffer lock held */
+								LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+								goto failed;
+							}
+							break;
+						case LockWaitError:
+							if (!ConditionalZMultiLockMembersWait(relation,
+																  mlmembers,
+																  *buffer,
+																  update_xact,
+																  mode,
+																  NULL,
+																  &upd_xact_aborted))
+								ereport(ERROR,
+										(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+										 errmsg("could not obtain lock on row in relation \"%s\"",
+												RelationGetRelationName(relation))));
+
+							break;
+					}
+				}
+			}
+			else
+			{
+				/* wait for regular transaction to end, or die trying */
+				switch (wait_policy)
+				{
+					case LockWaitBlock:
+						{
+							if (xwait_subxid != InvalidSubTransactionId)
+								SubXactLockTableWait(xwait, xwait_subxid, relation,
+													 &zhtup.t_self, XLTW_Lock);
+							else
+								XactLockTableWait(xwait, relation, &zhtup.t_self,
+												  XLTW_Lock);
+						}
+						break;
+					case LockWaitSkip:
+						if (xwait_subxid != InvalidSubTransactionId)
+						{
+							if (!ConditionalSubXactLockTableWait(xwait, xwait_subxid))
+							{
+								result = HeapTupleWouldBlock;
+								/* recovery code expects to have buffer lock held */
+								LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+								goto failed;
+							}
+						}
+						else if (!ConditionalXactLockTableWait(xwait))
+						{
+								result = HeapTupleWouldBlock;
+								/* recovery code expects to have buffer lock held */
+								LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+								goto failed;
+						}
+						break;
+					case LockWaitError:
+						if (xwait_subxid != InvalidSubTransactionId)
+						{
+							if (!ConditionalSubXactLockTableWait(xwait, xwait_subxid))
+									ereport(ERROR,
+									(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+										errmsg("could not obtain lock on row in relation \"%s\"",
+										RelationGetRelationName(relation))));
+						}
+						else if (!ConditionalXactLockTableWait(xwait))
+							ereport(ERROR,
+									(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+										errmsg("could not obtain lock on row in relation \"%s\"",
+										RelationGetRelationName(relation))));
+						break;
+				}
+			}
+
+			/* if there are updates, follow the update chain */
+			if (follow_updates && !ZHEAP_XID_IS_LOCKED_ONLY(infomask))
+			{
+				HTSU_Result res;
+
+				if (!ZHeapTupleIsMoved(zhtup.t_data->t_infomask) &&
+					!ItemPointerEquals(&zhtup.t_self, &ctid))
+				{
+					res = zheap_lock_updated_tuple(relation, &zhtup, &ctid,
+												   xid, mode, lockopr, cid,
+												   &rollback_and_relocked);
+
+					/*
+					 * If the update was by some aborted transaction and its
+					 * pending undo actions are applied now, then check the
+					 * latest copy of the tuple.
+					 */
+					if (rollback_and_relocked)
+					{
+						LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+						goto check_tup_satisfies_update;
+					}
+					else if (res != HeapTupleMayBeUpdated)
+					{
+						result = res;
+						/* recovery code expects to have buffer lock held */
+						LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+						goto failed;
+					}
+				}
+			}
+
+			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 * Also take care of cases when page is pruned after we release
+			 * the buffer lock. For this we check if ItemId is not deleted and
+			 * refresh the tuple offset position in page.  If TID is already
+			 * delete marked due to pruning, then get new ctid, so that we can
+			 * lock the new tuple.
+			 */
+			if (ItemIdIsDeleted(lp))
+			{
+				ctid = *tid;
+				ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+				result = HeapTupleUpdated;
+				goto failed;
+			}
+
+			zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+			zhtup.t_len = ItemIdGetLength(lp);
+
+			if (ZHeapTupleHasMultiLockers(infomask))
+			{
+				List	*new_mlmembers;
+
+				/*
+				 * If the aborted xact is for update, then we need to reverify
+				 * the tuple.
+				 */
+				if (upd_xact_aborted)
+					goto check_tup_satisfies_update;
+
+				new_mlmembers = ZGetMultiLockMembers(relation, &zhtup,
+													 *buffer, false);
+
+				/*
+				 * Ensure, no new lockers have been added, if so, then start
+				 * again.
+				 */
+				if (!ZMultiLockMembersSame(mlmembers, new_mlmembers))
+				{
+					list_free_deep(mlmembers);
+					list_free_deep(new_mlmembers);
+					goto check_tup_satisfies_update;
+				}
+
+				any_multi_locker_member_alive =
+					ZIsAnyMultiLockMemberRunning(new_mlmembers, &zhtup,
+												 *buffer);
+				list_free_deep(mlmembers);
+				list_free_deep(new_mlmembers);
+			}
+
+			/*
+			 * xwait is done, but if xwait had just locked the tuple then some
+			 * other xact could update/lock this tuple before we get to this
+			 * point.  Check for xid change, and start over if so.  We need to
+			 * do some special handling for lockers because their xid is never
+			 * stored on the tuples.  If there was a single locker on the
+			 * tuple and that locker is gone and some new locker has locked
+			 * the tuple, we won't be able to identify that by infomask/xid on
+			 * the tuple, rather we need to fetch the locker xid.
+			 */
+			ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL,
+								   &current_tup_xid, NULL, NULL, false);
+			if (xid_infomask_changed(zhtup.t_data->t_infomask, infomask) ||
+				!TransactionIdEquals(current_tup_xid, xwait))
+			{
+				if (ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask) &&
+					!ZHeapTupleHasMultiLockers(zhtup.t_data->t_infomask) &&
+					TransactionIdIsValid(single_locker_xid))
+				{
+					TransactionId current_single_locker_xid = InvalidTransactionId;
+
+					(void) GetLockerTransInfo(relation, &zhtup, *buffer, NULL,
+											  NULL, &current_single_locker_xid,
+											  NULL, NULL);
+					if (!TransactionIdEquals(single_locker_xid,
+											 current_single_locker_xid))
+						goto check_tup_satisfies_update;
+
+				}
+				else
+					goto check_tup_satisfies_update;
+			}
+		}
+
+		if (TransactionIdIsValid(xwait) && TransactionIdDidAbort(xwait))
+		{
+			/*
+			 * For aborted transaction, if the undo actions are not applied
+			 * yet, then apply them before modifying the page.
+			 */
+			if (!TransactionIdIsCurrentTransactionId(xwait))
+				zheap_exec_pending_rollback(relation, *buffer,
+											xwait_trans_slot, xwait);
+
+			/*
+			 * For aborted updates, we must allow to reverify the tuple in
+			 * case it's values got changed.
+			 */
+			if (!ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask))
+				goto check_tup_satisfies_update;
+		}
+
+		/*
+		 * We may lock if previous xid committed or aborted but only locked
+		 * the tuple without updating it; or if we didn't have to wait at all
+		 * for whatever reason.
+		 */
+		if (!require_sleep ||
+			ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask) ||
+			result == HeapTupleMayBeUpdated)
+			result = HeapTupleMayBeUpdated;
+		else
+			result = HeapTupleUpdated;
+	}
+	else if (result == HeapTupleMayBeUpdated)
+	{
+		TransactionId	xwait;
+		uint16			infomask;
+
+		if (TransactionIdIsValid(single_locker_xid))
+			xwait = single_locker_xid;
+		else
+			xwait = tup_xid;
+
+		infomask = zhtup.t_data->t_infomask;
+
+		/*
+		 * If any subtransaction of the current top transaction already holds
+		 * a lock as strong as or stronger than what we're requesting, we
+		 * effectively hold the desired lock already.  We *must* succeed
+		 * without trying to take the tuple lock, else we will deadlock
+		 * against anyone wanting to acquire a stronger lock.
+		 *
+		 * Note that inplace-updates without key updates are considered
+		 * equivalent to lock mode LockTupleNoKeyExclusive.
+		 */
+		if (ZHeapTupleHasMultiLockers(infomask))
+		{
+			List	*mlmembers;
+			ListCell   *lc;
+
+			if (trans_slot_id != InvalidXactSlotId)
+			{
+				mlmembers = ZGetMultiLockMembersForCurrentXact(&zhtup,
+									trans_slot_id, urec_ptr);
+
+				foreach(lc, mlmembers)
+				{
+					ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+					/*
+					 * Only members of our own transaction must be present in
+					 * the list.
+					 */
+					Assert(TransactionIdIsCurrentTransactionId(mlmember->xid));
+
+					if (mlmember->mode >= mode)
+					{
+						list_free_deep(mlmembers);
+						result = HeapTupleMayBeUpdated;
+						goto out_locked;
+					}
+				}
+
+				list_free_deep(mlmembers);
+			}
+		}
+		else if (TransactionIdIsCurrentTransactionId(xwait))
+		{
+			tuple->t_tableOid = RelationGetRelid(relation);
+			tuple->t_len = zhtup.t_len;
+			tuple->t_self = zhtup.t_self;
+			tuple->t_data = palloc0(tuple->t_len);
+			memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len);
+
+			switch (mode)
+			{
+				case LockTupleKeyShare:
+					if (ZHEAP_XID_IS_KEYSHR_LOCKED(infomask) ||
+						ZHEAP_XID_IS_SHR_LOCKED(infomask) ||
+						ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) ||
+						ZHEAP_XID_IS_EXCL_LOCKED(infomask) ||
+						ZHeapTupleIsInPlaceUpdated(infomask))
+					{
+						goto out_locked;
+					}
+					break;
+				case LockTupleShare:
+					if (ZHEAP_XID_IS_SHR_LOCKED(infomask) ||
+						ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) ||
+						ZHEAP_XID_IS_EXCL_LOCKED(infomask) ||
+						ZHeapTupleIsInPlaceUpdated(infomask))
+						{
+							goto out_locked;
+						}
+						break;
+				case LockTupleNoKeyExclusive:
+						if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) ||
+							ZHeapTupleIsInPlaceUpdated(infomask))
+						{
+							goto out_locked;
+						}
+						break;
+				case LockTupleExclusive:
+					if (ZHeapTupleIsInPlaceUpdated(infomask) &&
+						ZHEAP_XID_IS_EXCL_LOCKED(infomask))
+					{
+						goto out_locked;
+					}
+					break;
+			}
+		}
+	}
+
+failed:
+	if (result != HeapTupleMayBeUpdated)
+	{
+		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated ||
+			   result == HeapTupleWouldBlock);
+		Assert(ItemIdIsDeleted(lp) ||
+			   IsZHeapTupleModified(zhtup.t_data->t_infomask));
+
+		/* If item id is deleted, tuple can't be marked as moved. */
+		if (!ItemIdIsDeleted(lp) &&
+			ZHeapTupleIsMoved(zhtup.t_data->t_infomask))
+			ItemPointerSetMovedPartitions(&hufd->ctid);
+		else
+			hufd->ctid = ctid;
+		hufd->xmax = tup_xid;
+		if (result == HeapTupleSelfUpdated)
+			hufd->cmax = tup_cid;
+		else
+			hufd->cmax = InvalidCommandId;
+		hufd->in_place_updated_or_locked = in_place_updated_or_locked;
+		goto out_locked;
+	}
+
+	/*
+	 * The transaction information of tuple needs to be set in transaction
+	 * slot, so needs to reserve the slot before proceeding with the actual
+	 * operation.  It will be costly to wait for getting the slot, but we do
+	 * that by releasing the buffer lock.
+	 */
+	trans_slot_id = PageReserveTransactionSlot(relation, *buffer,
+											   PageGetMaxOffsetNumber(page),
+											   epoch, xid, &prev_urecptr,
+											   &lock_reacquired);
+	if (lock_reacquired)
+		goto check_tup_satisfies_update;
+
+	if (trans_slot_id == InvalidXactSlotId)
+	{
+		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+
+		pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+		pg_usleep(10000L);	/* 10 ms */
+		pgstat_report_wait_end();
+
+		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Also take care of cases when page is pruned after we release
+		 * the buffer lock. For this we check if ItemId is not deleted and
+		 * refresh the tuple offset position in page.  If TID is already
+		 * delete marked due to pruning, then get new ctid, so that we can
+		 * lock the new tuple.
+		 */
+		if (ItemIdIsDeleted(lp))
+		{
+			ctid = *tid;
+			ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid);
+			result = HeapTupleUpdated;
+			goto failed;
+		}
+
+		zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		zhtup.t_len = ItemIdGetLength(lp);
+
+		goto check_tup_satisfies_update;
+	}
+
+	/* transaction slot must be reserved before locking a tuple */
+	Assert(trans_slot_id != InvalidXactSlotId);
+
+	/*
+	 * It's possible that tuple slot is now marked as frozen. Hence, we refetch
+	 * the tuple here.
+	 */
+	Assert(!ItemIdIsDeleted(lp));
+	zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	zhtup.t_len = ItemIdGetLength(lp);
+
+	/*
+	 * If the slot is marked as frozen, the latest modifier of the tuple must be
+	 * frozen.
+	 */
+	if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (zhtup.t_data)) == ZHTUP_SLOT_FROZEN)
+	{
+		tup_trans_slot_id = ZHTUP_SLOT_FROZEN;
+		tup_xid = InvalidTransactionId;
+	}
+
+	/*
+	 * If all the members were lockers and are all gone, we can do away
+	 * with the MULTI_LOCKERS bit.
+	 */
+	zheap_lock_tuple_guts(relation, *buffer, &zhtup, tup_xid, xid, mode,
+						  lockopr, epoch, tup_trans_slot_id, trans_slot_id,
+						  single_locker_xid, single_locker_trans_slot,
+						  prev_urecptr, cid, !any_multi_locker_member_alive);
+
+	tuple->t_tableOid = RelationGetRelid(relation);
+	tuple->t_len = zhtup.t_len;
+	tuple->t_self = zhtup.t_self;
+	tuple->t_data = palloc0(tuple->t_len);
+
+	memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len);
+
+	result = HeapTupleMayBeUpdated;
+
+out_locked:
+	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+out_unlocked:
+
+	/*
+	 * Don't update the visibility map here. Locking a tuple doesn't change
+	 * visibility info.
+	 */
+
+	/*
+	 * Now that we have successfully marked the tuple as locked, we can
+	 * release the lmgr tuple lock, if we had it.
+	 */
+	if (have_tuple_lock)
+		UnlockTupleTuplock(relation, tid, mode);
+
+	return result;
+}
+
+/*
+ * test_lockmode_for_conflict - Helper function for zheap_lock_updated_tuple.
+ *
+ * Given a lockmode held by the transaction identified with the given xid,
+ * does the current transaction need to wait, fail, or can it continue if
+ * it wanted to acquire a lock of the given mode (required_mode)?  "needwait"
+ * is set to true if waiting is necessary; if it can continue, then
+ * HeapTupleMayBeUpdated is returned.  To notify the caller if some pending
+ * rollback is applied, rollback_and_relocked is set to true.
+ */
+static HTSU_Result
+test_lockmode_for_conflict(Relation rel, Buffer buf, ZHeapTuple zhtup,
+						   UndoRecPtr urec_ptr, LockTupleMode old_mode,
+						   TransactionId xid, int trans_slot_id,
+						   LockTupleMode required_mode, bool has_update,
+						   SubTransactionId *subxid, bool *needwait,
+						   bool *rollback_and_relocked)
+{
+	*needwait = false;
+
+	/*
+	 * Note: we *must* check TransactionIdIsInProgress before
+	 * TransactionIdDidAbort/Commit; see comment at top of tqual.c for an
+	 * explanation.
+	 */
+	if (TransactionIdIsCurrentTransactionId(xid))
+	{
+		/*
+		 * The tuple has already been locked by our own transaction.  This is
+		 * very rare but can happen if multiple transactions are trying to
+		 * lock an ancient version of the same tuple.
+		 */
+		return HeapTupleSelfUpdated;
+	}
+	else if (TransactionIdIsInProgress(xid))
+	{
+		/*
+		 * If the locking transaction is running, what we do depends on
+		 * whether the lock modes conflict: if they do, then we must wait for
+		 * it to finish; otherwise we can fall through to lock this tuple
+		 * version without waiting.
+		 */
+		if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_mode),
+								HWLOCKMODE_from_locktupmode(required_mode)))
+		{
+			*needwait = true;
+			if (subxid)
+				ZHeapTupleGetSubXid(zhtup, buf, urec_ptr, subxid);
+		}
+
+		/*
+		 * If we set needwait above, then this value doesn't matter;
+		 * otherwise, this value signals to caller that it's okay to proceed.
+		 */
+		return HeapTupleMayBeUpdated;
+	}
+	else if (TransactionIdDidAbort(xid))
+	{
+		/*
+		 * For aborted transaction, if the undo actions are not applied
+		 * yet, then apply them before modifying the page.
+		 */
+		zheap_exec_pending_rollback(rel, buf, trans_slot_id, xid);
+
+		/*
+		 * If it was only a locker, then the lock is completely gone now and
+		 * we can return success; but if it was an update, then after applying
+		 * pending actions, the tuple might have changed and we must report
+		 * error to the caller.  It will allow caller to reverify the tuple in
+		 * case it's values got changed.
+		 */
+
+		*rollback_and_relocked = true;
+
+		return HeapTupleMayBeUpdated;
+	}
+	else if (TransactionIdDidCommit(xid))
+	{
+		/*
+		 * The other transaction committed.  If it was only a locker, then the
+		 * lock is completely gone now and we can return success; but if it
+		 * was an update, then what we do depends on whether the two lock
+		 * modes conflict.  If they conflict, then we must report error to
+		 * caller. But if they don't, we can fall through to allow the current
+		 * transaction to lock the tuple.
+		 *
+		 * Note: the reason we worry about has_update here is because as soon
+		 * as a transaction ends, all its locks are gone and meaningless, and
+		 * thus we can ignore them; whereas its updates persist.  In the
+		 * TransactionIdIsInProgress case, above, we don't need to check
+		 * because we know the lock is still "alive" and thus a conflict needs
+		 * always be checked.
+		 */
+		if (!has_update)
+			return HeapTupleMayBeUpdated;
+
+		if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_mode),
+								HWLOCKMODE_from_locktupmode(required_mode)))
+			/* bummer */
+			return HeapTupleUpdated;
+
+		return HeapTupleMayBeUpdated;
+	}
+
+	/* Not in progress, not aborted, not committed -- must have crashed */
+	return HeapTupleMayBeUpdated;
+}
+
+/*
+ * zheap_lock_updated_tuple - Lock all the versions of updated tuple.
+ *
+ * Fetch the tuple pointed to by tid in rel, reserve transaction slot on a
+ * page for a given and mark it as locked by the given xid with the given
+ * mode; if this tuple is updated, recurse to lock the new version as well.
+ * During chain traversal, we might find some intermediate version which
+ * is pruned (due to non-inplace-update got committed and the version only
+ * has line pointer), so we need to continue fetching the newer versions
+ * to lock them.  The bool rolled_and_relocked is used to notify the caller
+ * that the update has been performed by an aborted transaction and it's
+ * pending undo actions are applied here.
+ *
+ * Note that it is important to lock all the versions that are from
+ * non-committed transaction, but if the transaction that has created the
+ * new version is committed, we only care to lock its latest version.
+ *
+ */
+static HTSU_Result
+zheap_lock_updated_tuple(Relation rel, ZHeapTuple tuple, ItemPointer ctid,
+						 TransactionId xid, LockTupleMode mode,
+						 LockOper lockopr, CommandId cid,
+						 bool *rollback_and_relocked)
+{
+	HTSU_Result result;
+	ZHeapTuple	mytup;
+	UndoRecPtr	prev_urecptr;
+	Buffer		buf;
+	Page		page;
+	ItemPointerData tupid;
+	TransactionId	tup_xid;
+	int			tup_trans_slot;
+	TransactionId	priorXmax = InvalidTransactionId;
+	uint32		epoch;
+	uint64		epoch_xid;
+	int			trans_slot_id;
+	bool		lock_reacquired;
+	OffsetNumber	offnum;
+
+	ItemPointerCopy(ctid, &tupid);
+
+	if (rollback_and_relocked)
+		*rollback_and_relocked = false;
+
+	for (;;)
+	{
+		ZHeapTupleData	zhtup;
+		ItemId	lp;
+		uint16	old_infomask;
+		UndoRecPtr	urec_ptr;
+
+		if (!zheap_fetch(rel, SnapshotAny, ctid, &mytup, &buf, false, NULL))
+		{
+			/*
+			 * if we fail to find the updated version of the tuple, it's
+			 * because it was vacuumed/pruned/rolledback away after its creator
+			 * transaction aborted.  So behave as if we got to the end of the
+			 * chain, and there's no further tuple to lock: return success to
+			 * caller.
+			 */
+			if (mytup == NULL)
+				return HeapTupleMayBeUpdated;
+
+			/*
+			 * If we reached the end of the chain, we're done, so return
+			 * success.  See EvalPlanQualZFetch for detailed reason.
+			 */
+			if (TransactionIdIsValid(priorXmax) &&
+				!ValidateTuplesXact(mytup, SnapshotAny, buf, priorXmax, true))
+				return HeapTupleMayBeUpdated;
+
+			/* deleted or moved to another partition, so forget about it */
+			if (ZHeapTupleIsMoved(mytup->t_data->t_infomask) ||
+				ItemPointerEquals(&(mytup->t_self), ctid))
+				return HeapTupleMayBeUpdated;
+
+			/* updated row should have xid matching this xmax */
+			ZHeapTupleGetTransInfo(mytup, buf, NULL, NULL, &priorXmax, NULL,
+								   NULL, true);
+
+			/* continue to lock the next version of tuple */
+			continue;
+		}
+
+lock_tuple:
+		urec_ptr = InvalidUndoRecPtr;
+
+		CHECK_FOR_INTERRUPTS();
+
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * If we reached the end of the chain, we're done, so return
+		 * success.  See EvalPlanQualZFetch for detailed reason.
+		 */
+		if (TransactionIdIsValid(priorXmax) &&
+			!ValidateTuplesXact(mytup, SnapshotAny, buf, priorXmax, false))
+		{
+			UnlockReleaseBuffer(buf);
+			return HeapTupleMayBeUpdated;
+		}
+
+		ZHeapTupleGetTransInfo(mytup, buf, &tup_trans_slot, &epoch_xid,
+							   &tup_xid, NULL, &urec_ptr, false);
+		old_infomask = mytup->t_data->t_infomask;
+
+		/*
+		 * If this tuple was created by an aborted (sub)transaction, then we
+		 * already locked the last live one in the chain, thus we're done, so
+		 * return success.
+		 */
+		if (!IsZHeapTupleModified(old_infomask) &&
+			TransactionIdDidAbort(tup_xid))
+		{
+			result = HeapTupleMayBeUpdated;
+			goto out_locked;
+		}
+
+		/*
+		 * If this tuple version has been updated or locked by some concurrent
+		 * transaction(s), what we do depends on whether our lock mode
+		 * conflicts with what those other transactions hold, and also on the
+		 * status of them.
+		 */
+		if (IsZHeapTupleModified(old_infomask))
+		{
+			SubTransactionId	subxid = InvalidSubTransactionId;
+			LockTupleMode	old_lock_mode;
+			bool		needwait;
+			bool		has_update = false;
+
+			if (ZHeapTupleHasMultiLockers(old_infomask))
+			{
+				List	*mlmembers;
+				ListCell   *lc;
+				TransactionId	update_xact = InvalidTransactionId;
+
+				/*
+				 * As we always maintain strongest lock mode on the tuple, it
+				 * must be pointing to the transaction id of the updater.
+				 */
+				if (!ZHEAP_XID_IS_LOCKED_ONLY(old_infomask))
+					update_xact = tup_xid;
+
+				mlmembers = ZGetMultiLockMembers(rel, mytup, buf, false);
+				foreach(lc, mlmembers)
+				{
+					ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+					if (TransactionIdIsValid(update_xact))
+					{
+						has_update = (update_xact == mlmember->xid) ?
+											true : false;
+					}
+
+					result = test_lockmode_for_conflict(rel,
+														buf,
+														NULL,
+														InvalidUndoRecPtr,
+														mlmember->mode,
+														mlmember->xid,
+														mlmember->trans_slot_id,
+														mode, has_update,
+														NULL,
+														&needwait,
+														rollback_and_relocked);
+
+					/*
+					 * If the update was by some aborted transaction with
+					 * pending rollback, then it's undo actions are applied.
+					 * Now, notify the caller to check for the latest
+					 * copy of the tuple.
+					 */
+					if (*rollback_and_relocked)
+					{
+						list_free_deep(mlmembers);
+						goto out_locked;
+					}
+
+					if (result == HeapTupleSelfUpdated)
+					{
+						list_free_deep(mlmembers);
+						goto next;
+					}
+
+					if (needwait)
+					{
+						LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+						if (mlmember->subxid != InvalidSubTransactionId)
+							SubXactLockTableWait(mlmember->xid, mlmember->subxid,
+												 rel, &mytup->t_self,
+												 XLTW_LockUpdated);
+						else
+							XactLockTableWait(mlmember->xid, rel,
+											  &mytup->t_self,
+											  XLTW_LockUpdated);
+
+						list_free_deep(mlmembers);
+						goto lock_tuple;
+					}
+					if (result != HeapTupleMayBeUpdated)
+					{
+						list_free_deep(mlmembers);
+						goto out_locked;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * For a non-multi locker, we first need to compute the
+				 * corresponding lock mode by using the infomask bits.
+				 */
+				if (ZHEAP_XID_IS_LOCKED_ONLY(old_infomask))
+				{
+					/*
+					 * We don't expect to lock updated version of a tuple if
+					 * there is only a single locker on the tuple and previous
+					 * modifier is all-visible.
+					 */
+					Assert(!(tup_trans_slot == ZHTUP_SLOT_FROZEN ||
+					epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)));
+
+					if (ZHEAP_XID_IS_KEYSHR_LOCKED(old_infomask))
+						old_lock_mode = LockTupleKeyShare;
+					else if (ZHEAP_XID_IS_SHR_LOCKED(old_infomask))
+						old_lock_mode = LockTupleShare;
+					else if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(old_infomask))
+						old_lock_mode = LockTupleNoKeyExclusive;
+					else if (ZHEAP_XID_IS_EXCL_LOCKED(old_infomask))
+						old_lock_mode = LockTupleExclusive;
+					else
+					{
+						/* LOCK_ONLY can't be present alone */
+						pg_unreachable();
+					}
+				}
+				else
+				{
+					has_update = true;
+					/* it's an update, but which kind? */
+					if (old_infomask & ZHEAP_XID_EXCL_LOCK)
+						old_lock_mode = LockTupleExclusive;
+					else
+						old_lock_mode = LockTupleNoKeyExclusive;
+				}
+
+				result = test_lockmode_for_conflict(rel, buf, mytup, urec_ptr,
+													old_lock_mode, tup_xid,
+													tup_trans_slot, mode,
+													has_update, &subxid,
+													&needwait,
+													rollback_and_relocked);
+
+				/*
+				 * If the update was by some aborted transaction with
+				 * pending rollback, then it's undo actions are applied.
+				 * Now, notify the caller to check for the latest
+				 * copy of the tuple.
+				 */
+				if (*rollback_and_relocked)
+					goto out_locked;
+
+				/*
+				 * If the tuple was already locked by ourselves in a previous
+				 * iteration of this (say zheap_lock_tuple was forced to
+				 * restart the locking loop because of a change in xid), then
+				 * we hold the lock already on this tuple version and we don't
+				 * need to do anything; and this is not an error condition
+				 * either.  We just need to skip this tuple and continue
+				 * locking the next version in the update chain.
+				 */
+				if (result == HeapTupleSelfUpdated)
+					goto next;
+
+				if (needwait)
+				{
+					LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+					if (subxid != InvalidSubTransactionId)
+						SubXactLockTableWait(tup_xid, subxid, rel,
+											 &mytup->t_self,
+											 XLTW_LockUpdated);
+					else
+						XactLockTableWait(tup_xid, rel, &mytup->t_self,
+										  XLTW_LockUpdated);
+					goto lock_tuple;
+				}
+				if (result != HeapTupleMayBeUpdated)
+				{
+					goto out_locked;
+				}
+			}
+		}
+
+		epoch = GetEpochForXid(xid);
+		offnum = ItemPointerGetOffsetNumber(&mytup->t_self);
+
+		/*
+		 * The transaction information of tuple needs to be set in transaction
+		 * slot, so needs to reserve the slot before proceeding with the actual
+		 * operation.  It will be costly to wait for getting the slot, but we do
+		 * that by releasing the buffer lock.
+		 */
+		trans_slot_id = PageReserveTransactionSlot(rel, buf, offnum, epoch, xid,
+											&prev_urecptr, &lock_reacquired);
+		if (lock_reacquired)
+			goto lock_tuple;
+
+		if (trans_slot_id == InvalidXactSlotId)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+			pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+			pg_usleep(10000L);	/* 10 ms */
+			pgstat_report_wait_end();
+
+			goto lock_tuple;
+		}
+
+		/* transaction slot must be reserved before locking a tuple */
+		Assert(trans_slot_id != InvalidXactSlotId);
+
+		page = BufferGetPage(buf);
+		lp = PageGetItemId(page, offnum);
+
+		Assert(ItemIdIsNormal(lp));
+
+		/*
+		 * It's possible that tuple slot is now marked as frozen. Hence, we refetch
+		 * the tuple here.
+		 */
+		zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		zhtup.t_len = ItemIdGetLength(lp);
+		zhtup.t_tableOid = mytup->t_tableOid;
+		zhtup.t_self = mytup->t_self;
+
+		/*
+		 * If the slot is marked as frozen, the latest modifier of the tuple must be
+		 * frozen.
+		 */
+		if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (zhtup.t_data)) == ZHTUP_SLOT_FROZEN)
+		{
+			tup_trans_slot = ZHTUP_SLOT_FROZEN;
+			tup_xid = InvalidTransactionId;
+		}
+
+		zheap_lock_tuple_guts(rel, buf, &zhtup, tup_xid, xid, mode, lockopr,
+							  epoch, tup_trans_slot, trans_slot_id,
+							  InvalidTransactionId, InvalidXactSlotId,
+							  prev_urecptr, cid, false);
+
+next:
+		/*
+		 * if we find the end of update chain, or if the transaction that has
+		 * updated the tuple is aborter, we're done.
+		 */
+		if (TransactionIdDidAbort(tup_xid) ||
+			ZHeapTupleIsMoved(mytup->t_data->t_infomask) ||
+			ItemPointerEquals(&mytup->t_self, ctid) ||
+			ZHEAP_XID_IS_LOCKED_ONLY(mytup->t_data->t_infomask))
+		{
+			result = HeapTupleMayBeUpdated;
+			goto out_locked;
+		}
+
+		/*
+		 * Updated row should have xid matching this xmax.
+		 *
+		 * XXX Using tup_xid will work as this must be the xid of updater if
+		 * any on the tuple; that is because we always maintain the strongest
+		 * locker information on the tuple.
+		 */
+		priorXmax = tup_xid;
+
+		/*
+		 * As we still hold a snapshot to which priorXmax is not visible, neither
+		 * the transaction slot on tuple can be marked as frozen nor the
+		 * corresponding undo be discarded.
+		 */
+		Assert(TransactionIdIsValid(priorXmax));
+
+		/* be tidy */
+		zheap_freetuple(mytup);
+		UnlockReleaseBuffer(buf);
+	}
+
+	result = HeapTupleMayBeUpdated;
+
+out_locked:
+	UnlockReleaseBuffer(buf);
+
+	return result;
+}
+
+/*
+ * zheap_lock_tuple_guts - Helper function for locking the tuple.
+ *
+ * It locks the tuple in given mode, writes an undo and WAL for the
+ * operation.
+ *
+ * It is the responsibility of caller to lock and unlock the buffer ('buf').
+ */
+static void
+zheap_lock_tuple_guts(Relation rel, Buffer buf, ZHeapTuple zhtup,
+					  TransactionId tup_xid, TransactionId xid,
+					  LockTupleMode mode, LockOper lockopr, uint32 epoch,
+					  int tup_trans_slot_id, int trans_slot_id,
+					  TransactionId single_locker_xid,
+					  int single_locker_trans_slot, UndoRecPtr prev_urecptr,
+					  CommandId cid, bool clear_multi_locker)
+{
+	TransactionId oldestXidHavingUndo;
+	UndoRecPtr	urecptr;
+	UnpackedUndoRecord	undorecord;
+	int			new_trans_slot_id;
+	uint16		  old_infomask, temp_infomask;
+	uint16		  new_infomask = 0;
+	Page		  page;
+	xl_undolog_meta undometa;
+	bool		hasSubXactLock = false;
+
+	page = BufferGetPage(buf);
+
+	/* Compute the new xid and infomask to store into the tuple. */
+	old_infomask = zhtup->t_data->t_infomask;
+
+	temp_infomask = old_infomask;
+	if (ZHeapTupleHasMultiLockers(old_infomask) && clear_multi_locker)
+		old_infomask &= ~ZHEAP_MULTI_LOCKERS;
+	compute_new_xid_infomask(zhtup, buf, tup_xid, tup_trans_slot_id,
+							 temp_infomask, xid, trans_slot_id,
+							 single_locker_xid, mode, lockopr,
+							 &new_infomask, &new_trans_slot_id);
+
+
+	/* Acquire subtransaction lock, if current transaction is a subtransaction. */
+	if (IsSubTransaction())
+	{
+		SubXactLockTableInsert(GetCurrentSubTransactionId());
+		hasSubXactLock = true;
+	}
+
+	/*
+	 * If the last transaction that has updated the tuple is already too
+	 * old, then consider it as frozen which means it is all-visible.  This
+	 * ensures that we don't need to store epoch in the undo record to check
+	 * if the undo tuple belongs to previous epoch and hence all-visible.  See
+	 * comments atop of file ztqual.c.
+	 */
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+	if (TransactionIdPrecedes(tup_xid, oldestXidHavingUndo))
+		tup_xid = FrozenTransactionId;
+
+	/*
+	 * Prepare an undo record.  We need to separately store the latest
+	 * transaction id that has changed the tuple to ensure that we don't
+	 * try to process the tuple in undo chain that is already discarded.
+	 * See GetTupleFromUndo.
+	 */
+	if (ZHeapTupleHasMultiLockers(new_infomask))
+		undorecord.uur_type = UNDO_XID_MULTI_LOCK_ONLY;
+	else if (lockopr == LockForUpdate)
+		undorecord.uur_type = UNDO_XID_LOCK_FOR_UPDATE;
+	else
+		undorecord.uur_type = UNDO_XID_LOCK_ONLY;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = rel->rd_id;
+	undorecord.uur_prevxid = tup_xid;
+	undorecord.uur_xid = xid;
+	/*
+	 * While locking the tuple, we set the command id as FirstCommandId since
+	 * it doesn't modify the tuple, just updates the infomask.
+	 */
+	undorecord.uur_cid = FirstCommandId;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = prev_urecptr;
+	undorecord.uur_block = ItemPointerGetBlockNumber(&(zhtup->t_self));
+	undorecord.uur_offset = ItemPointerGetOffsetNumber(&(zhtup->t_self));
+
+	initStringInfo(&undorecord.uur_tuple);
+	initStringInfo(&undorecord.uur_payload);
+
+	/*
+	 * Here, we are storing zheap tuple header which is required to
+	 * reconstruct the old copy of tuple.
+	 */
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) zhtup->t_data,
+						   SizeofZHeapTupleHeader);
+
+	/*
+	 * We keep the lock mode in undo record as for multi lockers we can't have
+	 * that information in tuple header.  We need lock mode later to detect
+	 * conflicts.
+	 */
+	appendBinaryStringInfo(&undorecord.uur_payload,
+						   (char *) &mode,
+						   sizeof(LockTupleMode));
+
+	if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+	{
+		undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT;
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) &tup_trans_slot_id,
+							   sizeof(tup_trans_slot_id));
+	}
+
+	/*
+	 * Store subtransaction id in undo record.  See SubXactLockTableWait
+	 * to know why we need to store subtransaction id in undo.
+	 */
+	if (hasSubXactLock)
+	{
+		SubTransactionId subxid = GetCurrentSubTransactionId();
+
+		undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT;
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) &subxid,
+							   sizeof(subxid));
+	}
+
+	urecptr = PrepareUndoInsert(&undorecord,
+								InvalidTransactionId,
+								UndoPersistenceForRelation(rel),
+								&undometa);
+
+
+	START_CRIT_SECTION();
+
+	InsertPreparedUndo();
+
+	/*
+	 * We never set the locker slot on the tuple, so pass set_tpd_map_slot flag
+	 * as false from the locker.  From all other places it should always be
+	 * passed as true so that the proper slot get set in the TPD offset map if
+	 * its a TPD slot.
+	 */
+	PageSetUNDO(undorecord, buf, trans_slot_id,
+				(lockopr == LockForUpdate) ? true : false,
+				epoch, xid, urecptr, NULL, 0);
+
+	ZHeapTupleHeaderSetXactSlot(zhtup->t_data, new_trans_slot_id);
+	zhtup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+	zhtup->t_data->t_infomask |= new_infomask;
+
+	MarkBufferDirty(buf);
+
+	/*
+	 * Do xlog stuff
+	 */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_zheap_lock	xlrec;
+		xl_undo_header  xlundohdr;
+		XLogRecPtr      recptr;
+		XLogRecPtr		RedoRecPtr;
+		bool			doPageWrites;
+
+		/*
+		 * Store the information required to generate undo record during
+		 * replay.
+		 */
+		xlundohdr.reloid = undorecord.uur_reloid;
+		xlundohdr.urec_ptr = urecptr;
+		xlundohdr.blkprev = prev_urecptr;
+
+		xlrec.prev_xid = tup_xid;
+		xlrec.offnum = ItemPointerGetOffsetNumber(&zhtup->t_self);
+		xlrec.infomask = zhtup->t_data->t_infomask;
+		xlrec.trans_slot_id = new_trans_slot_id;
+		xlrec.flags = 0;
+		if (new_trans_slot_id != trans_slot_id)
+		{
+			Assert(new_trans_slot_id == tup_trans_slot_id);
+			xlrec.flags |= XLZ_LOCK_TRANS_SLOT_FOR_UREC;
+		}
+		else if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			xlrec.flags |= XLZ_LOCK_CONTAINS_TPD_SLOT;
+
+		if (hasSubXactLock)
+			xlrec.flags |= XLZ_LOCK_CONTAINS_SUBXACT;
+		if (lockopr == LockForUpdate)
+			xlrec.flags |= XLZ_LOCK_FOR_UPDATE;
+
+prepare_xlog:
+		/* LOG undolog meta if this is the first WAL after the checkpoint. */
+		LogUndoMetaData(&undometa);
+
+		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			(void) RegisterTPDBuffer(page, 1);
+		XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+		XLogRegisterData((char *) &xlrec, SizeOfZHeapLock);
+
+		/*
+		 * We always include old tuple header for undo in WAL record
+		 * irrespective of full page image is taken or not. This is done
+		 * since savings for not including a zheap tuple header are less
+		 * compared to code complexity. However in future, if required we
+		 * can do it similar to what we have done in zheap_update or
+		 * zheap_delete.
+		 */
+		XLogRegisterData((char *) undorecord.uur_tuple.data,
+						 SizeofZHeapTupleHeader);
+		XLogRegisterData((char *) &mode, sizeof(LockTupleMode));
+		if (xlrec.flags & XLZ_LOCK_TRANS_SLOT_FOR_UREC)
+			XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id));
+		else if (xlrec.flags & XLZ_LOCK_CONTAINS_TPD_SLOT)
+			XLogRegisterData((char *) &tup_trans_slot_id, sizeof(tup_trans_slot_id));
+
+		recptr = XLogInsertExtended(RM_ZHEAP_ID, XLOG_ZHEAP_LOCK, RedoRecPtr,
+									doPageWrites);
+		if (recptr == InvalidXLogRecPtr)
+		{
+			ResetRegisteredTPDBuffers();
+			goto prepare_xlog;
+		}
+
+		PageSetLSN(page, recptr);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			TPDPageSetLSN(page, recptr);
+	}
+	END_CRIT_SECTION();
+
+	pfree(undorecord.uur_tuple.data);
+	pfree(undorecord.uur_payload.data);
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+}
+
+/*
+ * compute_new_xid_infomask - Given the old values of tuple header's infomask,
+ * compute the new values for tuple header which includes lock mode, new
+ * infomask and transaction slot.
+ *
+ * We don't clear the multi lockers bit in this function as for that we need
+ * to ensure that all the lockers are gone.  Unfortunately, it is not easy to
+ * do that as we need to traverse all the undo chains for the current page to
+ * ensure the same and doing it here which is quite common code path doesn't
+ * seem advisable.  We clear this bit lazily when we detect the conflict and
+ * we anyway need to traverse the undo chains for the page.
+ *
+ * We ensure that the tuple always point to the transaction slot of latest
+ * inserter/updater except for cases where we lock first and then update the
+ * tuple (aka locks via EvalPlanQual mechanism).  For example, say after a
+ * committed insert/update, a new request arrives to lock the tuple in key
+ * share mode, we will keep the inserter's/updater's slot on the tuple and
+ * set the multi-locker and key-share bit.  If the inserter/updater is already
+ * known to be having a frozen slot (visible to every one), we will set the
+ * key-share locker bit and the tuple will indicate a frozen slot.  Similarly,
+ * for a new updater, if the tuple has a single locker, then the undo will
+ * have a frozen tuple and for multi-lockers, the undo of updater will have
+ * previous inserter/updater slot; in both cases the new tuple will point to
+ * the updaters slot.  Now, the rollback of a single locker will set the
+ * frozen slot on tuple and the rollback of multi-locker won't change slot
+ * information on tuple.  We don't want to keep the slot of locker on the
+ * tuple as after rollback, we will lose track of last updater/inserter.
+ *
+ * When we are locking for the purpose of updating the tuple, we don't need
+ * to preserve previous updater's information and we also keep the latest
+ * slot on tuple.  This is only true when there are no previous lockers on
+ * the tuple.
+ */
+static void
+compute_new_xid_infomask(ZHeapTuple zhtup, Buffer buf, TransactionId tup_xid,
+						 int tup_trans_slot, uint16 old_infomask,
+						 TransactionId add_to_xid, int trans_slot,
+						 TransactionId single_locker_xid, LockTupleMode mode,
+						 LockOper lockoper, uint16 *result_infomask,
+						 int *result_trans_slot)
+{
+	int			new_trans_slot;
+	uint16		new_infomask;
+	bool		old_tuple_has_update = false;
+	bool		is_update = false;
+
+	Assert(TransactionIdIsValid(add_to_xid));
+
+	new_infomask = 0;
+	new_trans_slot = trans_slot;
+	is_update = (lockoper == ForUpdate || lockoper == LockForUpdate);
+
+	if ((IsZHeapTupleModified(old_infomask) &&
+		 TransactionIdIsInProgress(tup_xid)) ||
+		ZHeapTupleHasMultiLockers(old_infomask))
+	{
+		ZGetMultiLockInfo(old_infomask, tup_xid, tup_trans_slot,
+						  add_to_xid, &new_infomask, &new_trans_slot,
+						  &mode, &old_tuple_has_update, is_update);
+	}
+	else if (!is_update &&
+			 TransactionIdIsInProgress(single_locker_xid))
+	{
+		LockTupleMode old_mode;
+
+		/*
+		 * When there is a single in-progress locker on the tuple and previous
+		 * inserter/updater became all visible, we've to set multi-locker flag
+		 * and highest lock mode. If current transaction tries to reacquire
+		 * a lock, we don't set multi-locker flag.
+		 */
+		Assert(ZHEAP_XID_IS_LOCKED_ONLY(old_infomask));
+		if (single_locker_xid != add_to_xid)
+		{
+			new_infomask |= ZHEAP_MULTI_LOCKERS;
+			new_trans_slot = tup_trans_slot;
+		}
+
+		old_mode = get_old_lock_mode(old_infomask);
+
+		/* Acquire the strongest of both. */
+		if (mode < old_mode)
+			mode = old_mode;
+
+		/* Keep the old tuple slot as it is */
+		new_trans_slot = tup_trans_slot;
+	}
+	else if (!is_update &&
+			 TransactionIdIsInProgress(tup_xid))
+	{
+		/*
+		 * Normally if the tuple is not modified and the current transaction
+		 * is in progress, the other transaction can't lock the tuple except
+		 * itself.
+		 *
+		 * However, this can happen while locking the updated tuple chain.  We
+		 * keep the transaction slot of original tuple as that will allow us to
+		 * check the visibility of tuple by just referring the current
+		 * transaction slot.
+		 */
+		Assert((tup_xid == add_to_xid) || (mode == LockTupleKeyShare));
+
+		if (tup_xid != add_to_xid)
+		{
+			new_infomask |= ZHEAP_MULTI_LOCKERS;
+			new_trans_slot = tup_trans_slot;
+		}
+	}
+	else if (!is_update &&
+			 tup_trans_slot == ZHTUP_SLOT_FROZEN)
+	{
+		/*
+		 * It's a frozen update or insert, so the locker must not change the
+		 * slot on a tuple.  The lockmode to be used on tuple is computed
+		 * below. There could be a single committed/aborted locker (multilocker
+		 * case is handled in the first condition). In that case, we can ignore
+		 * the locker. If the locker is still in progress, it'll be handled in
+		 * above case.
+		 */
+		new_trans_slot = ZHTUP_SLOT_FROZEN;
+	}
+	else if (!is_update &&
+			 !ZHEAP_XID_IS_LOCKED_ONLY(old_infomask) &&
+			 tup_trans_slot != ZHTUP_SLOT_FROZEN &&
+			 (TransactionIdDidCommit(tup_xid)
+			  || !TransactionIdIsValid(tup_xid)))
+	{
+		/*
+		 * It's a committed update or insert, so we gotta preserve him as
+		 * updater of the tuple.  Also, indicate that tuple has multiple
+		 * lockers.
+		 *
+		 * Tuple xid could be invalid if the corresponding transaction is
+		 * discarded or the tuple is marked as frozen.  The later case is
+		 * handled in the above condition (slot frozen).  In the former case,
+		 * we can consider it as a committed update or insert.
+		 */
+		old_tuple_has_update = true;
+		new_infomask |= ZHEAP_MULTI_LOCKERS;
+
+		if (ZHEAP_XID_IS_EXCL_LOCKED(old_infomask))
+			new_infomask |= ZHEAP_XID_EXCL_LOCK;
+		else if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(old_infomask))
+			new_infomask |= ZHEAP_XID_NOKEY_EXCL_LOCK;
+		else
+		{
+			/*
+			 * Tuple must not be locked in any other mode as we are here
+			 * because either the tuple is updated or inserted and the
+			 * corresponding transaction is committed.
+			 */
+			Assert(!(ZHEAP_XID_IS_KEYSHR_LOCKED(old_infomask) ||
+					 ZHEAP_XID_IS_SHR_LOCKED(old_infomask)));
+		}
+
+		if (ZHeapTupleIsInPlaceUpdated(old_infomask))
+			new_infomask |= ZHEAP_INPLACE_UPDATED;
+		else if (ZHeapTupleIsUpdated(old_infomask))
+			new_infomask |= ZHEAP_UPDATED;
+		else
+		{
+			/*
+			 * This is a freshly inserted tuple, allow to set the requested
+			 * lock mode on tuple.
+			 */
+			old_tuple_has_update = false;
+		}
+
+		new_trans_slot = tup_trans_slot;
+
+		if (old_tuple_has_update)
+			goto infomask_is_computed;
+	}
+	else if (!is_update &&
+			 ZHEAP_XID_IS_LOCKED_ONLY(old_infomask) &&
+			 tup_trans_slot != ZHTUP_SLOT_FROZEN &&
+			 (TransactionIdDidCommit(tup_xid)
+			  || !TransactionIdIsValid(tup_xid)))
+	{
+		LockTupleMode old_mode;
+
+		/*
+		 * This case arises for non-inplace updates when the newly inserted
+		 * tuple is marked as locked-only, but multi-locker bit is not set.
+		 *
+		 * See comments in above condition to know when tup_xid can be
+		 * invalid.
+		 */
+		new_infomask |= ZHEAP_MULTI_LOCKERS;
+
+		/* The tuple is locked-only. */
+		Assert(!(old_infomask &
+				 (ZHEAP_DELETED | ZHEAP_UPDATED | ZHEAP_INPLACE_UPDATED)));
+
+		old_mode = get_old_lock_mode(old_infomask);
+
+		/* Acquire the strongest of both. */
+		if (mode < old_mode)
+			mode = old_mode;
+
+		/* Keep the old tuple slot as it is */
+		new_trans_slot = tup_trans_slot;
+	}
+
+	if (is_update && !ZHeapTupleHasMultiLockers(new_infomask))
+	{
+		if (lockoper == LockForUpdate)
+		{
+			/*
+			 * When we are locking for the purpose of updating the tuple, we
+			 * don't need to preserve previous updater's information.
+			 */
+			new_infomask |= ZHEAP_XID_LOCK_ONLY;
+			if (mode == LockTupleExclusive)
+				new_infomask |= ZHEAP_XID_EXCL_LOCK;
+			else
+				new_infomask |= ZHEAP_XID_NOKEY_EXCL_LOCK;
+		}
+		else if (mode == LockTupleExclusive)
+			new_infomask |= ZHEAP_XID_EXCL_LOCK;
+	}
+	else
+	{
+		if (!is_update && !old_tuple_has_update)
+			new_infomask |= ZHEAP_XID_LOCK_ONLY;
+		switch (mode)
+		{
+			case LockTupleKeyShare:
+				new_infomask |= ZHEAP_XID_KEYSHR_LOCK;
+				break;
+			case LockTupleShare:
+				new_infomask |= ZHEAP_XID_SHR_LOCK;
+				break;
+			case LockTupleNoKeyExclusive:
+				new_infomask |= ZHEAP_XID_NOKEY_EXCL_LOCK;
+				break;
+			case LockTupleExclusive:
+				new_infomask |= ZHEAP_XID_EXCL_LOCK;
+				break;
+			default:
+				elog(ERROR, "invalid lock mode");
+		}
+	}
+
+infomask_is_computed:
+
+	*result_infomask = new_infomask;
+
+	if (result_trans_slot)
+		*result_trans_slot = new_trans_slot;
+
+	/*
+	 * We store the reserved transaction slot only when we update the
+	 * tuple. For lock only, we keep the old transaction slot in the
+	 * tuple.
+	 */
+	Assert(is_update || new_trans_slot == tup_trans_slot);
+ }
+
+/*
+ *	zheap_finish_speculative - mark speculative insertion as successful
+ *
+ * To successfully finish a speculative insertion we have to clear speculative
+ * flag from tuple.  See heap_finish_speculative why it is important to clear
+ * the information of speculative insertion on tuple.
+ */
+void
+zheap_finish_speculative(Relation relation, ZHeapTuple tuple)
+{
+	Buffer		buffer;
+	Page		page;
+	OffsetNumber offnum;
+	ItemId		lp = NULL;
+	ZHeapTupleHeader zhtup;
+
+	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&(tuple->t_self)));
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
+
+	offnum = ItemPointerGetOffsetNumber(&(tuple->t_self));
+	if (PageGetMaxOffsetNumber(page) >= offnum)
+		lp = PageGetItemId(page, offnum);
+
+	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
+		elog(ERROR, "invalid lp");
+
+	zhtup = (ZHeapTupleHeader) PageGetItem(page, lp);
+
+	/* NO EREPORT(ERROR) from here till changes are logged */
+	START_CRIT_SECTION();
+
+	Assert(ZHeapTupleHeaderIsSpeculative(tuple->t_data));
+
+	MarkBufferDirty(buffer);
+
+	/* Clear the speculative insertion marking from the tuple. */
+	zhtup->t_infomask &= ~ZHEAP_SPECULATIVE_INSERT;
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(relation))
+	{
+		xl_zheap_confirm xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
+		xlrec.flags = XLZ_SPEC_INSERT_SUCCESS;
+
+		XLogBeginInsert();
+
+		/* We want the same filtering on this as on a plain insert */
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+		XLogRegisterData((char *) &xlrec, SizeOfZHeapConfirm);
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_ZHEAP2_ID, XLOG_ZHEAP_CONFIRM);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	UnlockReleaseBuffer(buffer);
+}
+
+/*
+ *	zheap_abort_speculative - kill a speculatively inserted tuple
+ *
+ * Marks a tuple that was speculatively inserted in the same command as dead.
+ * That makes it immediately appear as dead to all transactions, including our
+ * own.  In particular, it makes another backend inserting a duplicate key
+ * value won't unnecessarily wait for our whole transaction to finish (it'll
+ * just wait for our speculative insertion to finish).
+ *
+ * The functionality is same as heap_abort_speculative, but we achieve it
+ * differently.
+ */
+void
+zheap_abort_speculative(Relation relation, ZHeapTuple tuple)
+{
+	TransactionId xid = GetTopTransactionId();
+	TransactionId	current_tup_xid;
+	ItemPointer tid = &(tuple->t_self);
+	ItemId		lp;
+	ZHeapTupleHeader zhtuphdr;
+	Page		page;
+	BlockNumber block;
+	Buffer		buffer;
+	OffsetNumber	offnum;
+	int			out_slot_no PG_USED_FOR_ASSERTS_ONLY;
+
+	Assert(ItemPointerIsValid(tid));
+
+	block = ItemPointerGetBlockNumber(tid);
+	buffer = ReadBuffer(relation, block);
+	page = BufferGetPage(buffer);
+
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	offnum = ItemPointerGetOffsetNumber(tid);
+	lp = PageGetItemId(page, offnum);
+	Assert(ItemIdIsNormal(lp));
+
+	zhtuphdr = (ZHeapTupleHeader) PageGetItem(page, lp);
+
+	/*
+	 * Sanity check that the tuple really is a speculatively inserted tuple,
+	 * inserted by us.
+	 */
+	out_slot_no = GetTransactionSlotInfo(buffer,
+										 offnum,
+										 ZHeapTupleHeaderGetXactSlot(zhtuphdr),
+										 NULL,
+										 &current_tup_xid,
+										 NULL,
+										 true,
+										 false);
+
+	/* As the transaction is still open, the slot can't be frozen. */
+	Assert(out_slot_no != ZHTUP_SLOT_FROZEN);
+	Assert(current_tup_xid != InvalidTransactionId);
+
+	if (current_tup_xid != xid)
+		elog(ERROR, "attempted to kill a tuple inserted by another transaction");
+	if (!(IsToastRelation(relation) || ZHeapTupleHeaderIsSpeculative(zhtuphdr)))
+		elog(ERROR, "attempted to kill a non-speculative tuple");
+	Assert(!IsZHeapTupleModified(zhtuphdr->t_infomask));
+
+	START_CRIT_SECTION();
+
+	/*
+	 * The tuple will become DEAD immediately.  Flag that this page is a
+	 * candidate for pruning.  The action here is exactly same as what we do
+	 * for rolling back insert.
+	 */
+	ItemIdSetDead(lp);
+	ZPageSetPrunable(page, xid);
+
+	MarkBufferDirty(buffer);
+
+	/*
+	 * XLOG stuff
+	 *
+	 * The WAL records generated here match heap_delete().  The same recovery
+	 * routines are used.
+	 */
+	if (RelationNeedsWAL(relation))
+	{
+		xl_zheap_confirm xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
+		xlrec.flags = XLZ_SPEC_INSERT_FAILED;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfZHeapConfirm);
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+		/* No replica identity & replication origin logged */
+
+		recptr = XLogInsert(RM_ZHEAP2_ID, XLOG_ZHEAP_CONFIRM);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	if (ZHeapTupleHasExternal(tuple))
+	{
+		Assert(!IsToastRelation(relation));
+		ztoast_delete(relation, tuple, true);
+	}
+
+	/*
+	 * Never need to mark tuple for invalidation, since catalogs don't support
+	 * speculative insertion
+	 */
+
+	/* Now we can release the buffer */
+	ReleaseBuffer(buffer);
+
+	/* count deletion, as we counted the insertion too */
+	pgstat_count_heap_delete(relation);
+}
+
+/*
+ * zheap_freetuple
+ */
+void
+zheap_freetuple(ZHeapTuple zhtup)
+{
+	pfree(zhtup);
+}
+
+/*
+ * znocachegetattr - This is same as nocachegetattr except that it takes
+ * ZHeapTuple as input.
+ *
+ * Note that for zheap, cached offsets are not used and we always start
+ * deforming with the actual byte from where the first attribute starts.  See
+ * atop zheap_compute_data_size.
+ */
+Datum
+znocachegetattr(ZHeapTuple tuple,
+				int attnum,
+				TupleDesc tupleDesc)
+{
+	ZHeapTupleHeader tup = tuple->t_data;
+	Form_pg_attribute thisatt;
+	Datum		ret_datum = (Datum) 0;
+	char	   *tp;				/* ptr to data part of tuple */
+	bits8	   *bp = tup->t_bits;	/* ptr to null bitmap in tuple */
+	int			off;			/* current offset within data */
+	int			i;
+
+	attnum--;
+	tp = (char *) tup;
+
+	/*
+	 * For each non-null attribute, we have to first account for alignment
+	 * padding before the attr, then advance over the attr based on its
+	 * length.  Nulls have no storage and no alignment padding either.
+	 */
+	off = tup->t_hoff;
+
+	for (i = 0;; i++)		/* loop exit is at "break" */
+	{
+		Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
+
+		if (ZHeapTupleHasNulls(tuple) && att_isnull(i, bp))
+		{
+			continue;		/* this cannot be the target att */
+		}
+
+		if (att->attlen == -1)
+		{
+				off = att_align_pointer(off, att->attalign, -1,
+										tp + off);
+		}
+		else if (!att->attbyval)
+		{
+			/* not varlena, so safe to use att_align_nominal */
+			off = att_align_nominal(off, att->attalign);
+		}
+
+		if (i == attnum)
+			break;
+
+		off = att_addlength_pointer(off, att->attlen, tp + off);
+	}
+
+	thisatt = TupleDescAttr(tupleDesc, attnum);
+	if (thisatt->attbyval)
+		memcpy(&ret_datum, tp + off, thisatt->attlen);
+	else
+		ret_datum = PointerGetDatum((char *) (tp + off));
+
+	return ret_datum;
+}
+
+TransactionId
+zheap_fetchinsertxid(ZHeapTuple zhtup, Buffer buffer)
+{
+	UndoRecPtr urec_ptr;
+	TransactionId xid = InvalidTransactionId;
+	int	trans_slot_id = InvalidXactSlotId;
+	int	prev_trans_slot_id;
+	TransactionId result;
+	BlockNumber blk;
+	OffsetNumber offnum;
+	UnpackedUndoRecord	*urec;
+	ZHeapTuple	undo_tup;
+
+	prev_trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup->t_data);
+	blk = ItemPointerGetBlockNumber(&zhtup->t_self);
+	offnum = ItemPointerGetOffsetNumber(&zhtup->t_self);
+	(void) GetTransactionSlotInfo(buffer,
+								  offnum,
+								  prev_trans_slot_id,
+								  NULL,
+								  NULL,
+								  &urec_ptr,
+								  true,
+								  false);
+	undo_tup = zhtup;
+
+	while(true)
+	{
+		urec = UndoFetchRecord(urec_ptr, blk, offnum, xid, NULL, ZHeapSatisfyUndoRecord);
+		if (urec != NULL)
+		{
+			/*
+			 * If we have valid undo record, then check if we have
+			 * reached the insert log and return the corresponding
+			 * transaction id.
+			 */
+			if (urec->uur_type == UNDO_INSERT ||
+				urec->uur_type == UNDO_MULTI_INSERT ||
+				urec->uur_type == UNDO_INPLACE_UPDATE)
+			{
+				result = urec->uur_xid;
+				UndoRecordRelease(urec);
+				break;
+			}
+
+			undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id,
+						 NULL, (undo_tup) == (zhtup) ? false : true,
+						 BufferGetPage(buffer));
+
+			xid = urec->uur_prevxid;
+			urec_ptr = urec->uur_blkprev;
+			UndoRecordRelease(urec);
+			if (!UndoRecPtrIsValid(urec_ptr))
+			{
+				zheap_freetuple(undo_tup);
+				result = FrozenTransactionId;
+				break;
+			}
+
+
+			/*
+			 * Change the undo chain if the undo tuple is stamped
+			 * with the different transaction slot.
+			 */
+			if (trans_slot_id != prev_trans_slot_id)
+			{
+				(void) GetTransactionSlotInfo(buffer,
+											  ItemPointerGetOffsetNumber(&undo_tup->t_self),
+											  trans_slot_id,
+											  NULL,
+											  NULL,
+											  &urec_ptr,
+											  true,
+											  true);
+				prev_trans_slot_id = trans_slot_id;
+			}
+			zhtup = undo_tup;
+		}
+		else
+		{
+			/*
+			 * Undo record could be null only when it's undo log
+			 * is/about to be discarded. We cannot use any assert
+			 * for checking is the log is actually discarded, since
+			 * UndoFetchRecord can return NULL for the records which
+			 * are not yet discarded but are about to be discarded.
+			 */
+			result = FrozenTransactionId;
+			break;
+		}
+	}
+
+	return result;
+}
+
+/* ----------------
+ *		zheap_getsysattr
+ *
+ *		Fetch the value of a system attribute for a tuple.
+ *
+ * This provides same information as heap_getsysattr, but for zheap tuple.
+ * ----------------
+ */
+Datum
+zheap_getsysattr(ZHeapTuple zhtup, Buffer buf, int attnum,
+				 TupleDesc tupleDesc, bool *isnull)
+{
+	Datum		result;
+	TransactionId xid = InvalidTransactionId;
+	bool	release_buf = false;
+
+	Assert(zhtup);
+
+	/*
+	 * For xmin,xmax,cmin and cmax we may need to fetch the information from
+	 * the undo record, so ensure we have the valid buffer.
+	 */
+	if (!BufferIsValid(buf) &&
+		((attnum == MinTransactionIdAttributeNumber) ||
+		(attnum == MaxTransactionIdAttributeNumber) ||
+		(attnum == MinCommandIdAttributeNumber) ||
+		(attnum == MaxCommandIdAttributeNumber)))
+	{
+		Relation rel = relation_open(zhtup->t_tableOid, NoLock);
+		buf = ReadBuffer(rel, ItemPointerGetBlockNumber(&(zhtup->t_self)));
+		relation_close(rel, NoLock);
+		release_buf = true;
+	}
+
+	/* Currently, no sys attribute ever reads as NULL. */
+	*isnull = false;
+
+	switch (attnum)
+	{
+		case SelfItemPointerAttributeNumber:
+			/* pass-by-reference datatype */
+			result = PointerGetDatum(&(zhtup->t_self));
+			break;
+		case MinTransactionIdAttributeNumber:
+		{
+			/*
+			 * Fixme - Need to check whether we need any handling of epoch here.
+			 */
+			uint64  epoch_xid;
+			ZHeapTupleGetTransInfo(zhtup, buf, NULL, &epoch_xid, &xid,
+								   NULL, NULL, false);
+
+			if (!TransactionIdIsValid(xid) || epoch_xid <
+				pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+				xid = FrozenTransactionId;
+
+			result = TransactionIdGetDatum(xid);
+		}
+			break;
+		case MaxTransactionIdAttributeNumber:
+		case MinCommandIdAttributeNumber:
+		case MaxCommandIdAttributeNumber:
+			ereport(ERROR,
+				   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				   errmsg("xmax, cmin, and cmax are not supported for zheap tuples")));
+			break;
+		case TableOidAttributeNumber:
+			result = ObjectIdGetDatum(zhtup->t_tableOid);
+			break;
+		default:
+			elog(ERROR, "invalid attnum: %d", attnum);
+			result = 0;			/* keep compiler quiet */
+			break;
+	}
+
+	if (release_buf)
+		ReleaseBuffer(buf);
+
+	return result;
+}
+
+/* ---------------------
+ *		zheap_attisnull  - returns TRUE if zheap tuple attribute is not present
+ * ---------------------
+ */
+bool
+zheap_attisnull(ZHeapTuple tup, int attnum, TupleDesc tupleDesc)
+{
+	if (attnum > (int) ZHeapTupleHeaderGetNatts(tup->t_data))
+		return true;
+
+	/*
+	 * We allow a NULL tupledesc for relations not expected to have missing
+	 * values, such as catalog relations and indexes.
+	 */
+	Assert(!tupleDesc || attnum <= tupleDesc->natts);
+	if (attnum > (int) ZHeapTupleHeaderGetNatts(tup->t_data))
+	{
+		if (tupleDesc && TupleDescAttr(tupleDesc, attnum - 1)->atthasmissing)
+			return false;
+		else
+			return true;
+	}
+
+	if (attnum > 0)
+	{
+		if (ZHeapTupleNoNulls(tup))
+			return false;
+		return att_isnull(attnum - 1, tup->t_data->t_bits);
+	}
+
+	switch (attnum)
+	{
+		case TableOidAttributeNumber:
+		case SelfItemPointerAttributeNumber:
+		case MinTransactionIdAttributeNumber:
+		case MinCommandIdAttributeNumber:
+		case MaxTransactionIdAttributeNumber:
+		case MaxCommandIdAttributeNumber:
+			/* these are never null */
+			break;
+		default:
+			elog(ERROR, "invalid attnum: %d", attnum);
+	}
+
+	return false;
+}
+
+/*
+ * Check if the specified attribute's value is same in both given tuples.
+ * Subroutine for ZHeapDetermineModifiedColumns.
+ */
+static bool
+zheap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+						ZHeapTuple tup1, ZHeapTuple tup2)
+{
+	Datum		value1,
+				value2;
+	bool		isnull1,
+				isnull2;
+	Form_pg_attribute att;
+
+	/*
+	 * If it's a whole-tuple reference, say "not equal".  It's not really
+	 * worth supporting this case, since it could only succeed after a no-op
+	 * update, which is hardly a case worth optimizing for.
+	 */
+	if (attrnum == 0)
+		return false;
+
+	/*
+	 * Likewise, automatically say "not equal" for any system attribute other
+	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+	 * chain, or even to be set correctly yet in the new tuple.
+	 */
+	if (attrnum < 0)
+	{
+		if (attrnum != TableOidAttributeNumber)
+			return false;
+	}
+
+	/*
+	 * Extract the corresponding values.  XXX this is pretty inefficient if
+	 * there are many indexed columns.  Should HeapDetermineModifiedColumns do
+	 * a single heap_deform_tuple call on each tuple, instead?	But that
+	 * doesn't work for system columns ...
+	 */
+	value1 = zheap_getattr(tup1, attrnum, tupdesc, &isnull1);
+	value2 = zheap_getattr(tup2, attrnum, tupdesc, &isnull2);
+
+	/*
+	 * If one value is NULL and other is not, then they are certainly not
+	 * equal
+	 */
+	if (isnull1 != isnull2)
+		return false;
+
+	/*
+	 * If both are NULL, they can be considered equal.
+	 */
+	if (isnull1)
+		return true;
+
+	/*
+	 * We do simple binary comparison of the two datums.  This may be overly
+	 * strict because there can be multiple binary representations for the
+	 * same logical value.  But we should be OK as long as there are no false
+	 * positives.  Using a type-specific equality operator is messy because
+	 * there could be multiple notions of equality in different operator
+	 * classes; furthermore, we cannot safely invoke user-defined functions
+	 * while holding exclusive buffer lock.
+	 */
+	if (attrnum <= 0)
+	{
+		/* The only allowed system columns are OIDs, so do this */
+		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+	}
+	else
+	{
+		Assert(attrnum <= tupdesc->natts);
+		att = TupleDescAttr(tupdesc, attrnum - 1);
+		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+	}
+}
+
+/*
+ * ZHeapDetermineModifiedColumns - Check which columns are being updated.
+ *	This is same as HeapDetermineModifiedColumns except that it takes
+ *	ZHeapTuple as input.
+ */
+static Bitmapset *
+ZHeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
+							  ZHeapTuple oldtup, ZHeapTuple newtup)
+{
+	int			attnum;
+	Bitmapset  *modified = NULL;
+
+	while ((attnum = bms_first_member(interesting_cols)) >= 0)
+	{
+		attnum += FirstLowInvalidHeapAttributeNumber;
+
+		if (!zheap_tuple_attr_equals(RelationGetDescr(relation),
+									 attnum, oldtup, newtup))
+			modified = bms_add_member(modified,
+								attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	return modified;
+}
+
+/*
+ * -----------
+ * Zheap transaction information related API's.
+ * -----------
+ */
+
+/*
+ * GetTransactionSlotInfo - Get the required transaction slot info.  We also
+ *	return the transaction slot number, if the transaction slot is in TPD entry.
+ *
+ * We can directly call this function to get transaction slot info if we are
+ * sure that the corresponding tuple is not deleted or we don't care if the
+ * tuple has multi-locker flag in which case we need to call
+ * ZHeapTupleGetTransInfo.
+ *
+ * NoTPDBufLock - See TPDPageGetTransactionSlotInfo.
+ * TPDSlot - true, if the passed transaction_slot_id is the slot number in TPD
+ * entry.
+ */
+int
+GetTransactionSlotInfo(Buffer buf, OffsetNumber offset, int trans_slot_id,
+					   uint32 *epoch, TransactionId *xid,
+					   UndoRecPtr *urec_ptr, bool NoTPDBufLock, bool TPDSlot)
+{
+	ZHeapPageOpaque	opaque;
+	Page	page;
+	PageHeader	phdr PG_USED_FOR_ASSERTS_ONLY;
+	int		out_trans_slot_id = trans_slot_id;
+
+	page = BufferGetPage(buf);
+	phdr = (PageHeader) page;
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Fetch the required information from the transaction slot. The
+	 * transaction slot can either be on the heap page or TPD page.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+	{
+		if (epoch)
+			*epoch = 0;
+		if (xid)
+			*xid = InvalidTransactionId;
+		if (urec_ptr)
+			*urec_ptr = InvalidUndoRecPtr;
+	}
+	else if (trans_slot_id < ZHEAP_PAGE_TRANS_SLOTS ||
+			 (trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS &&
+			 !ZHeapPageHasTPDSlot(phdr)))
+	{
+		if (epoch)
+			*epoch = opaque->transinfo[trans_slot_id - 1].xid_epoch;
+		if (xid)
+			*xid = opaque->transinfo[trans_slot_id - 1].xid;
+		if (urec_ptr)
+			*urec_ptr = opaque->transinfo[trans_slot_id - 1].urec_ptr;
+	}
+	else
+	{
+		Assert((ZHeapPageHasTPDSlot(phdr)));
+		if (TPDSlot)
+		{
+			/*
+			 * The heap page's last transaction slot data is copied over to
+			 * first slot in TPD entry, so we need fetch it from there.  See
+			 * AllocateAndFormTPDEntry.
+			 */
+			if (trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS)
+				trans_slot_id = ZHEAP_PAGE_TRANS_SLOTS + 1;
+			out_trans_slot_id = TPDPageGetTransactionSlotInfo(buf,
+															  trans_slot_id,
+															  InvalidOffsetNumber,
+															  epoch,
+															  xid,
+															  urec_ptr,
+															  NoTPDBufLock,
+															  false);
+		}
+		else
+		{
+			Assert(offset != InvalidOffsetNumber);
+			out_trans_slot_id = TPDPageGetTransactionSlotInfo(buf,
+															  trans_slot_id,
+															  offset,
+															  epoch,
+															  xid,
+															  urec_ptr,
+															  NoTPDBufLock,
+															  false);
+		}
+	}
+
+	return out_trans_slot_id;
+}
+
+/*
+ * PageSetUNDO - Set the transaction information pointer for a given
+ *		transaction slot.
+ */
+void
+PageSetUNDO(UnpackedUndoRecord undorecord, Buffer buffer, int trans_slot_id,
+			bool set_tpd_map_slot, uint32 epoch, TransactionId xid,
+			UndoRecPtr urecptr, OffsetNumber *usedoff, int ucnt)
+{
+	ZHeapPageOpaque	opaque;
+	Page	page = BufferGetPage(buffer);
+	PageHeader	phdr;
+
+	Assert(trans_slot_id != InvalidXactSlotId);
+
+	phdr = (PageHeader) page;
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Set the required information in the transaction slot. The transaction
+	 * slot can either be on the heap page or TPD page.
+	 *
+	 * During recovery, we set the required information in TPD separately
+	 * only if required.
+	 */
+	if (trans_slot_id < ZHEAP_PAGE_TRANS_SLOTS ||
+		(trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS &&
+		 !ZHeapPageHasTPDSlot(phdr)))
+	{
+		opaque->transinfo[trans_slot_id - 1].xid_epoch = epoch;
+		opaque->transinfo[trans_slot_id - 1].xid = xid;
+		opaque->transinfo[trans_slot_id - 1].urec_ptr = urecptr;
+	}
+	/* TPD information is set separately during recovery. */
+	else if (!InRecovery)
+	{
+		if (ucnt <= 0)
+		{
+			Assert(ucnt == 0);
+
+			usedoff = &undorecord.uur_offset;
+			ucnt++;
+		}
+
+		TPDPageSetUndo(buffer, trans_slot_id, set_tpd_map_slot, epoch, xid,
+					   urecptr, usedoff, ucnt);
+	}
+
+	elog(DEBUG1, "undo record: TransSlot: %d, Epoch: %d, TransactionId: %d, urec: " UndoRecPtrFormat ", prev_urec: " UINT64_FORMAT ", block: %d, offset: %d, undo_op: %d, xid_tup: %d, reloid: %d",
+				 trans_slot_id, epoch, xid, urecptr, undorecord.uur_blkprev, undorecord.uur_block, undorecord.uur_offset, undorecord.uur_type,
+				 undorecord.uur_prevxid, undorecord.uur_reloid);
+}
+
+/*
+ * PageSetTransactionSlotInfo - Set the transaction slot info for the given
+ *			slot.
+ *
+ * This is similar to PageSetUNDO except that it doesn't need to update offset
+ * map in TPD.
+ */
+void
+PageSetTransactionSlotInfo(Buffer buf, int trans_slot_id, uint32 epoch,
+						   TransactionId xid, UndoRecPtr urec_ptr)
+{
+	ZHeapPageOpaque	opaque;
+	Page	page;
+	PageHeader	phdr;
+
+	page = BufferGetPage(buf);
+	phdr = (PageHeader) page;
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	if (trans_slot_id < ZHEAP_PAGE_TRANS_SLOTS ||
+		(trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS &&
+		 !ZHeapPageHasTPDSlot(phdr)))
+	{
+		opaque->transinfo[trans_slot_id - 1].xid_epoch = epoch;
+		opaque->transinfo[trans_slot_id - 1].xid = xid;
+		opaque->transinfo[trans_slot_id - 1].urec_ptr = urec_ptr;
+	}
+	else
+	{
+		TPDPageSetTransactionSlotInfo(buf, trans_slot_id, epoch, xid,
+									  urec_ptr);
+	}
+}
+
+/*
+ * PageGetTransactionSlotId - Get the transaction slot for the given epoch and
+ *			xid.
+ *
+ * If the slot is not in the TPD page but the caller has asked to lock the TPD
+ * buffer than do so.  tpd_page_locked will be set to true if the required page
+ * is locked, false, otherwise.
+ */
+int
+PageGetTransactionSlotId(Relation rel, Buffer buf, uint32 epoch,
+						 TransactionId xid, UndoRecPtr *urec_ptr,
+						 bool keepTPDBufLock, bool locktpd,
+						 bool *tpd_page_locked)
+{
+	ZHeapPageOpaque	opaque;
+	Page	page;
+	PageHeader	phdr;
+	int		slot_no;
+	int		total_slots_in_page;
+	bool	check_tpd;
+
+	page = BufferGetPage(buf);
+	phdr = (PageHeader) page;
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	if (ZHeapPageHasTPDSlot(phdr))
+	{
+		total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS - 1;
+		check_tpd = true;
+	}
+	else
+	{
+		total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS;
+		check_tpd = false;
+	}
+
+	/* Check if the required slot exists on the page. */
+	for (slot_no = 0; slot_no < total_slots_in_page; slot_no++)
+	{
+		if (opaque->transinfo[slot_no].xid_epoch == epoch &&
+			opaque->transinfo[slot_no].xid == xid)
+		{
+			*urec_ptr = opaque->transinfo[slot_no].urec_ptr;
+
+			/* Check if TPD has page slot, then lock TPD page */
+			if (locktpd && ZHeapPageHasTPDSlot(phdr))
+			{
+				Assert(tpd_page_locked);
+				*tpd_page_locked = TPDPageLock(rel, buf);
+			}
+
+			return slot_no + 1;
+		}
+	}
+
+	/* Check if the slot exists on the TPD page. */
+	if (check_tpd)
+	{
+		int tpd_e_slot;
+
+		tpd_e_slot = TPDPageGetSlotIfExists(rel, buf, InvalidOffsetNumber,
+											epoch, xid, urec_ptr,
+											keepTPDBufLock, false);
+		if (tpd_e_slot != InvalidXactSlotId)
+		{
+			/*
+			 * If we get the valid slot then the TPD page must be locked and
+			 * the lock will be retained if asked for.
+			 */
+			if (tpd_page_locked)
+				*tpd_page_locked = keepTPDBufLock;
+			return tpd_e_slot;
+		}
+	}
+	else
+	{
+		/*
+		 * Lock the TPD page if the caller has instructed so and the page
+		 * has tpd slot.
+		 */
+		if (locktpd && ZHeapPageHasTPDSlot(phdr))
+		{
+			Assert(tpd_page_locked);
+			*tpd_page_locked = TPDPageLock(rel, buf);
+		}
+	}
+
+	return InvalidXactSlotId;
+}
+
+/*
+ * PageGetTransactionSlotInfo - Get the transaction slot info for the given
+ *	slot no.
+ */
+void
+PageGetTransactionSlotInfo(Buffer buf, int slot_no, uint32 *epoch,
+						 TransactionId *xid, UndoRecPtr *urec_ptr,
+						 bool keepTPDBufLock)
+{
+	ZHeapPageOpaque	opaque;
+	Page	page;
+	PageHeader	phdr;
+
+	page = BufferGetPage(buf);
+	phdr = (PageHeader) page;
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Fetch the required information from the transaction slot. The
+	 * transaction slot can either be on the heap page or TPD page.
+	 */
+	if (slot_no < ZHEAP_PAGE_TRANS_SLOTS ||
+		(slot_no == ZHEAP_PAGE_TRANS_SLOTS &&
+		 !ZHeapPageHasTPDSlot(phdr)))
+	{
+		if (epoch)
+			*epoch = opaque->transinfo[slot_no - 1].xid_epoch;
+		if (xid)
+			*xid = opaque->transinfo[slot_no - 1].xid;
+		if (urec_ptr)
+			*urec_ptr = opaque->transinfo[slot_no - 1].urec_ptr;
+	}
+	else
+	{
+		Assert((ZHeapPageHasTPDSlot(phdr)));
+		(void)TPDPageGetTransactionSlotInfo(buf,
+											slot_no,
+											InvalidOffsetNumber,
+											epoch,
+											xid,
+											urec_ptr,
+											false,
+											true);
+	}
+}
+
+/*
+ * PageReserveTransactionSlot - Reserve the transaction slot in page.
+ *
+ *	This function returns transaction slot number if either the page already
+ *	has some slot that contains the transaction info or there is an empty
+ *	slot or it manages to reuse some existing slot or it manages to get the
+ *  slot in TPD; otherwise retruns InvalidXactSlotId.
+ *
+ *  Note that we always return array location of slot plus one as zeroth slot
+ *  number is reserved for frozen slot number (ZHTUP_SLOT_FROZEN).
+ */
+int
+PageReserveTransactionSlot(Relation relation, Buffer buf, OffsetNumber offset,
+						   uint32 epoch, TransactionId xid,
+						   UndoRecPtr *urec_ptr, bool *lock_reacquired)
+{
+	ZHeapPageOpaque	opaque;
+	Page	page;
+	PageHeader	phdr;
+	int		latestFreeTransSlot = InvalidXactSlotId;
+	int		slot_no;
+	int		total_slots_in_page;
+	bool	check_tpd;
+
+	*lock_reacquired = false;
+
+	page = BufferGetPage(buf);
+	phdr = (PageHeader) page;
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	if (ZHeapPageHasTPDSlot(phdr))
+	{
+		total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS - 1;
+		check_tpd = true;
+	}
+	else
+	{
+		total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS;
+		check_tpd = false;
+	}
+
+	/*
+	 * For temp relations, we don't have to check all the slots since
+	 * no other backend can access the same relation. If a slot is available,
+	 * we return it from here. Else, we freeze the slot in PageFreezeTransSlots.
+	 *
+	 * XXX For temp tables, oldestXidWithEpochHavingUndo is not relevant as
+	 * the undo for them can be discarded on commit.  Hence, comparing xid
+	 * with oldestXidWithEpochHavingUndo during visibility checks can lead to
+	 * incorrect behavior.  To avoid that, we can mark the tuple as frozen
+	 * for any previous transaction id.  In that way, we don't have to
+	 * compare the previous xid of tuple with oldestXidWithEpochHavingUndo.
+	 */
+	if (RELATION_IS_LOCAL(relation))
+	{
+		/*  We can't access temp tables of other backends. */
+		Assert(!RELATION_IS_OTHER_TEMP(relation));
+
+		slot_no = 0;
+		if (opaque->transinfo[slot_no].xid_epoch == epoch &&
+			opaque->transinfo[slot_no].xid == xid)
+		{
+			*urec_ptr = opaque->transinfo[slot_no].urec_ptr;
+			return (slot_no + 1);
+		}
+		else if (opaque->transinfo[slot_no].xid == InvalidTransactionId &&
+				 latestFreeTransSlot == InvalidXactSlotId)
+			latestFreeTransSlot = slot_no;
+	}
+	else
+	{
+		for (slot_no = 0; slot_no < total_slots_in_page; slot_no++)
+		{
+			if (opaque->transinfo[slot_no].xid_epoch == epoch &&
+				opaque->transinfo[slot_no].xid == xid)
+			{
+				*urec_ptr = opaque->transinfo[slot_no].urec_ptr;
+				return (slot_no + 1);
+			}
+			else if (opaque->transinfo[slot_no].xid == InvalidTransactionId &&
+					 latestFreeTransSlot == InvalidXactSlotId)
+				latestFreeTransSlot = slot_no;
+		}
+	}
+
+	/* Check if we already have a slot on the TPD page */
+	if (check_tpd)
+	{
+		int tpd_e_slot;
+
+		tpd_e_slot = TPDPageGetSlotIfExists(relation, buf, offset, epoch,
+											xid, urec_ptr, true, true);
+		if (tpd_e_slot != InvalidXactSlotId)
+			return tpd_e_slot;
+	}
+
+
+	if (latestFreeTransSlot >= 0)
+	{
+		*urec_ptr = opaque->transinfo[latestFreeTransSlot].urec_ptr;
+		return (latestFreeTransSlot + 1);
+	}
+
+	/* no transaction slot available, try to reuse some existing slot */
+	if (PageFreezeTransSlots(relation, buf, lock_reacquired, NULL, 0))
+	{
+		/*
+		 * If the lock is reacquired inside, then we allow callers to reverify
+		 * the condition whether then can still perform the required
+		 * operation.
+		 */
+		if (*lock_reacquired)
+			return InvalidXactSlotId;
+
+		/*
+		 * TPD entry might get pruned in TPDPageGetSlotIfExists, so recheck
+		 * it.
+		 */
+		if (ZHeapPageHasTPDSlot(phdr))
+			total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS - 1;
+		else
+			total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS;
+
+		for (slot_no = 0; slot_no < total_slots_in_page; slot_no++)
+		{
+			if (opaque->transinfo[slot_no].xid == InvalidTransactionId)
+			{
+				*urec_ptr = opaque->transinfo[slot_no].urec_ptr;
+				return (slot_no + 1);
+			}
+		}
+
+		/*
+		 * After freezing transaction slots, we should get atleast one free
+		 * slot.
+		 */
+		Assert(false);
+	}
+	Assert (!RELATION_IS_LOCAL(relation));
+
+	/*
+	 * Reserve the transaction slot in TPD.  First we check if there already
+	 * exists an TPD entry for this page, then reserve in that, otherwise,
+	 * allocate a new TPD entry and reserve the slot in it.
+	 */
+	if (ZHeapPageHasTPDSlot(phdr))
+	{
+		int tpd_e_slot;
+
+		tpd_e_slot = TPDPageReserveTransSlot(relation, buf, offset,
+											 urec_ptr, lock_reacquired);
+
+		if (tpd_e_slot != InvalidXactSlotId)
+			return tpd_e_slot;
+
+		/*
+		 * Fixme : We should allow to allocate bigger TPD entries or support
+		 * chained TPD entries.
+		 */
+		return InvalidXactSlotId;
+	}
+	else
+	{
+		slot_no = TPDAllocateAndReserveTransSlot(relation, buf, offset,
+												 urec_ptr);
+		if (slot_no != InvalidXactSlotId)
+			return slot_no;
+	}
+
+	/* no transaction slot available */
+	return InvalidXactSlotId;
+}
+
+/*
+ * zheap_freeze_or_invalidate_tuples - Clear the slot information or set
+ *									   invalid_xact flags.
+ *
+ * 	Process all the tuples on the page and match their trasaction slot with
+ *	the input slot array, if tuple is pointing to the slot then set the tuple
+ *  slot as ZHTUP_SLOT_FROZEN if is frozen is true otherwise set
+ *  ZHEAP_INVALID_XACT_SLOT flag on the tuple
+ */
+void
+zheap_freeze_or_invalidate_tuples(Buffer buf, int nSlots, int *slots,
+								  bool isFrozen, bool TPDSlot)
+{
+	OffsetNumber offnum, maxoff;
+	Page page = BufferGetPage(buf);
+	int	i;
+
+	/* clear the slot info from tuples */
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ZHeapTupleHeader	tup_hdr;
+		ItemId		itemid;
+		int		trans_slot;
+
+		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			continue;
+
+		if (!ItemIdIsUsed(itemid))
+		{
+			if (!ItemIdHasPendingXact(itemid))
+				continue;
+			trans_slot = ItemIdGetTransactionSlot(itemid);
+		}
+		else if (ItemIdIsDeleted(itemid))
+		{
+			trans_slot = ItemIdGetTransactionSlot(itemid);
+		}
+		else
+		{
+			tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid);
+			trans_slot = ZHeapTupleHeaderGetXactSlot(tup_hdr);
+		}
+
+		/* If we are freezing TPD slot then get the actual slot from the TPD. */
+		if (TPDSlot)
+		{
+			/* Tuple is not pointing to TPD slot so skip it. */
+			if (trans_slot < ZHEAP_PAGE_TRANS_SLOTS)
+				continue;
+
+			/*
+			 * If we come for freezing the TPD slot the fetch the exact slot
+			 * info from the TPD.
+			 */
+			trans_slot = TPDPageGetTransactionSlotInfo(buf, trans_slot, offnum,
+													   NULL, NULL, NULL, false,
+													   false);
+
+			/*
+			 * The input slots array always stores the slot index which starts
+			 * from 0, even for TPD slots, the index will start from 0.
+			 * So convert it into the slot index.
+			 */
+			trans_slot -= (ZHEAP_PAGE_TRANS_SLOTS + 1);
+		}
+		else
+		{
+			/*
+			 * The slot number on tuple is always array location of slot plus
+			 * one, so we need to subtract one here before comparing it with
+			 * frozen slots.  See PageReserveTransactionSlot.
+			 */
+			trans_slot -= 1;
+		}
+
+		for (i = 0; i < nSlots; i++)
+		{
+			if (trans_slot == slots[i])
+			{
+				/*
+				 * Set transaction slots of tuple as frozen to indicate tuple
+				 * is all visible and mark the deleted itemids as dead.
+				 */
+				if (isFrozen)
+				{
+					if (!ItemIdIsUsed(itemid))
+					{
+						/* This must be unused entry which has xact information. */
+						Assert(ItemIdHasPendingXact(itemid));
+
+						/*
+						 * The pending xact must be commited if the corresponding
+						 * slot is being marked as frozen.  So, clear the pending
+						 * xact and transaction slot information from itemid.
+						 */
+						ItemIdSetUnused(itemid);
+					}
+					else if (ItemIdIsDeleted(itemid))
+					{
+						/*
+						 * The deleted item must not be visible to anyone if the
+						 * corresponding slot is being marked as frozen.  So,
+						 * marking it as dead.
+						 */
+						ItemIdSetDead(itemid);
+					}
+					else
+					{
+						tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid);
+						ZHeapTupleHeaderSetXactSlot(tup_hdr, ZHTUP_SLOT_FROZEN);
+					}
+				}
+				else
+				{
+					/*
+					 * We just append the invalid xact flag in the tuple/itemid to
+					 * indicate that for this tuple/itemid we need to fetch the
+					 * transaction information from undo record.  Also, we
+					 * ensure to clear the transaction information from unused
+					 * itemid.
+					 */
+					if (!ItemIdIsUsed(itemid))
+					{
+						/* This must be unused entry which has xact information. */
+						Assert(ItemIdHasPendingXact(itemid));
+
+						/*
+						 * The pending xact is commited.  So, clear the pending xact
+						 * and transaction slot information from itemid.
+						 */
+						ItemIdSetUnused(itemid);
+					}
+					else if (ItemIdIsDeleted(itemid))
+						ItemIdSetInvalidXact(itemid);
+					else
+					{
+						tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid);
+						tup_hdr->t_infomask |= ZHEAP_INVALID_XACT_SLOT;
+					}
+					break;
+				}
+				break;
+			}
+		}
+	}
+}
+
+/*
+ * GetCompletedSlotOffsets
+ *
+ * Find all the tuples pointing to the transaction slots for committed
+ * transactions.
+ */
+void
+GetCompletedSlotOffsets(Page page, int nCompletedXactSlots,
+						int *completed_slots,
+						OffsetNumber *offset_completed_slots,
+						int	*numOffsets)
+{
+	int				noffsets = 0;
+	OffsetNumber 	offnum, maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ZHeapTupleHeader	tup_hdr;
+		ItemId		itemid;
+		int			i, trans_slot;
+
+		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			continue;
+
+		if (!ItemIdIsUsed(itemid))
+		{
+			if (!ItemIdHasPendingXact(itemid))
+				continue;
+			trans_slot = ItemIdGetTransactionSlot(itemid);
+		}
+		else if (ItemIdIsDeleted(itemid))
+		{
+			if ((ItemIdGetVisibilityInfo(itemid) & ITEMID_XACT_INVALID))
+				continue;
+			trans_slot = ItemIdGetTransactionSlot(itemid);
+		}
+		else
+		{
+			tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid);
+			if (ZHeapTupleHasInvalidXact(tup_hdr->t_infomask))
+				continue;
+			trans_slot = ZHeapTupleHeaderGetXactSlot(tup_hdr);
+		}
+
+		for (i = 0; i < nCompletedXactSlots; i++)
+		{
+			/*
+			 * we don't need to include the tuples that have not changed
+			 * since the last time as the special undo record for them can
+			 * be found in the undo chain of their present slot.
+			 */
+			if (trans_slot == completed_slots[i])
+			{
+				offset_completed_slots[noffsets++] = offnum;
+				break;
+			}
+		}
+	}
+
+	*numOffsets = noffsets;
+}
+
+/*
+ * PageFreezeTransSlots - Make the transaction slots available for reuse.
+ *
+ *	This function tries to free up some existing transaction slots so that
+ *	they can be reused.  To reuse the slot, it needs to ensure one of the below
+ *	conditions:
+ *	(a) the xid is committed, all-visible and doesn't have pending rollback
+ *	to perform.
+ *	(b) if the xid is committed, then ensure to mark a special flag on the
+ *	tuples that are modified by that xid on the current page.
+ *	(c) if the xid is rolledback, then ensure that rollback is performed or
+ *	at least undo actions for this page have been replayed.
+ *
+ *	For committed/aborted transactions, we simply clear the xid from the
+ *	transaction slot and undo record pointer is kept as it is to ensure that
+ *	we don't break the undo chain for that slot. We also mark the tuples that
+ *	are modified by committed xid with a special flag indicating that slot for
+ *	this tuple is reused.  The special flag is just an indication that the
+ *	transaction information of the transaction that has modified the tuple can
+ *	be retrieved from the undo.
+ *
+ *	If we don't do so, then after that slot got reused for some other
+ *	unrelated transaction, it might become tricky to traverse the undo chain.
+ *	In such a case, it is quite possible that the particular tuple has not
+ *	been modified, but it is still pointing to transaction slot which has been
+ *	reused by new transaction and that transaction is still not committed.
+ *	During the visibility check for such a tuple, it can appear that the tuple
+ *	is modified by current transaction which is clearly wrong and can lead to
+ *	wrong results.  One such case would be when we try to fetch the commandid
+ *	for that tuple to check the visibility, it will fetch the commandid for a
+ *	different transaction that is already committed.
+ *
+ *	The basic principle used here is to ensure that we can always fetch the
+ *	transaction information of tuple until it is frozen (committed and
+ *	all-visible).
+ *
+ *	This also ensures that we are consistent with how other operations work in
+ *	zheap i.e the tuple always reflect the current state.
+ *
+ *	We don't need any special handling for the tuples that are locked by
+ *	multiple transactions (aka tuples that have MULTI_LOCKERS bit set).
+ *	Basically, we always maintain either strongest lockers or latest lockers
+ *	(when all the lockers are of same mode) transaction slot on the tuple.
+ *	In either case, we should be able to detect the visibility of tuple based
+ *	on the latest locker information.
+ *
+ *	This function assumes that the caller already has Exclusive lock on the
+ *	buffer.
+ *
+ *	This function returns true if it manages to free some transaction slot,
+ *	false otherwise.
+ */
+bool
+PageFreezeTransSlots(Relation relation, Buffer buf, bool *lock_reacquired,
+					 TransInfo *transinfo, int num_slots)
+{
+	uint64	oldestXidWithEpochHavingUndo;
+	int		slot_no;
+	int		*frozen_slots = NULL;
+	int		nFrozenSlots = 0;
+	int		*completed_xact_slots = NULL;
+	uint16	 nCompletedXactSlots = 0;
+	int		*aborted_xact_slots = NULL;
+	int		nAbortedXactSlots = 0;
+	bool	TPDSlot;
+	Page	page;
+	bool	result = false;
+
+	page = BufferGetPage(buf);
+
+	/*
+	 * If the num_slots is 0 then the caller wants to freeze the page slots so
+	 * get the transaction slots information from the page.
+	 */
+	if (num_slots == 0)
+	{
+		PageHeader	phdr;
+		ZHeapPageOpaque	opaque;
+
+		phdr = (PageHeader) page;
+		opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+		if (ZHeapPageHasTPDSlot(phdr))
+			num_slots = ZHEAP_PAGE_TRANS_SLOTS - 1;
+		else
+			num_slots = ZHEAP_PAGE_TRANS_SLOTS;
+
+		transinfo = opaque->transinfo;
+		TPDSlot = false;
+	}
+	else
+	{
+		Assert(num_slots > 0);
+		TPDSlot = true;
+	}
+
+	oldestXidWithEpochHavingUndo = pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo);
+
+	frozen_slots = palloc0(num_slots * sizeof(int));
+
+	/*
+	 * Clear the slot information from tuples.  The basic idea is to collect
+	 * all the transaction slots that can be cleared.  Then traverse the page
+	 * to see if any tuple has marking for any of the slots, if so, just clear
+	 * the slot information from the tuple.
+	 *
+	 * For temp relations, we can freeze the first slot since no other backend
+	 * can access the same relation.
+	 */
+	if (RELATION_IS_LOCAL(relation))
+		frozen_slots[nFrozenSlots++] = 0;
+	else
+	{
+		for (slot_no = 0; slot_no < num_slots; slot_no++)
+		{
+			uint64	slot_xid_epoch = transinfo[slot_no].xid_epoch;
+			TransactionId	slot_xid = transinfo[slot_no].xid;
+
+			/*
+			 * Transaction slot can be considered frozen if it belongs to previous
+			 * epoch or transaction id is old enough that it is all visible.
+			 */
+			slot_xid_epoch = MakeEpochXid(slot_xid_epoch, slot_xid);
+
+			if (slot_xid_epoch < oldestXidWithEpochHavingUndo)
+				frozen_slots[nFrozenSlots++] = slot_no;
+		}
+	}
+
+	if (nFrozenSlots > 0)
+	{
+		TransactionId	latestxid = InvalidTransactionId;
+		int		i;
+		int		slot_no;
+
+
+		START_CRIT_SECTION();
+
+		/* clear the transaction slot info on tuples */
+		zheap_freeze_or_invalidate_tuples(buf, nFrozenSlots, frozen_slots,
+										  true, TPDSlot);
+
+		/* Initialize the frozen slots. */
+		if (TPDSlot)
+		{
+			for (i = 0; i < nFrozenSlots; i++)
+			{
+				int	tpd_slot_id;
+
+				slot_no = frozen_slots[i];
+
+				/* Remember the latest xid. */
+				if (TransactionIdFollows(transinfo[slot_no].xid, latestxid))
+					latestxid = transinfo[slot_no].xid;
+
+				/* Calculate the actual slot no. */
+				tpd_slot_id = slot_no + ZHEAP_PAGE_TRANS_SLOTS + 1;
+
+				/* Initialize the TPD slot. */
+				TPDPageSetTransactionSlotInfo(buf, tpd_slot_id, 0,
+											  InvalidTransactionId,
+											  InvalidUndoRecPtr);
+			}
+		}
+		else
+		{
+			for (i = 0; i < nFrozenSlots; i++)
+			{
+				slot_no = frozen_slots[i];
+
+				/* Remember the latest xid. */
+				if (TransactionIdFollows(transinfo[slot_no].xid, latestxid))
+					latestxid = transinfo[slot_no].xid;
+
+				transinfo[slot_no].xid_epoch = 0;
+				transinfo[slot_no].xid = InvalidTransactionId;
+				transinfo[slot_no].urec_ptr = InvalidUndoRecPtr;
+			}
+		}
+
+		MarkBufferDirty(buf);
+
+		/*
+		 * Xlog Stuff
+		 *
+		 * Log all the frozen_slots number for which we need to clear the
+		 * transaction slot information.  Also, note down the latest xid
+		 * corresponding to the frozen slots. This is required to ensure that
+		 * no standby query conflicts with the frozen xids.
+		 */
+		if (RelationNeedsWAL(relation))
+		{
+			xl_zheap_freeze_xact_slot xlrec = {0};
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+
+			xlrec.nFrozen = nFrozenSlots;
+			xlrec.lastestFrozenXid = latestxid;
+
+			XLogRegisterData((char *) &xlrec, SizeOfZHeapFreezeXactSlot);
+
+			/*
+			 * Ideally we need the frozen slots information when WAL needs to be
+			 * applied on the page, but in case of the TPD slots freeze we need
+			 * the frozen slot information for both heap page as well as for the
+			 * TPD page.  So the problem is that if we register with any one of
+			 * the buffer it might happen that the data did not registered due
+			 * to fpw of that buffer but we need that data for another buffer.
+			 */
+			XLogRegisterData((char *) frozen_slots, nFrozenSlots * sizeof(int));
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			if (TPDSlot)
+				RegisterTPDBuffer(page, 1);
+
+			recptr = XLogInsert(RM_ZHEAP_ID, XLOG_ZHEAP_FREEZE_XACT_SLOT);
+			PageSetLSN(page, recptr);
+
+			if (TPDSlot)
+				TPDPageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		result = true;
+		goto cleanup;
+	}
+
+	Assert(!RELATION_IS_LOCAL(relation));
+	completed_xact_slots = palloc0(num_slots * sizeof(int));
+	aborted_xact_slots = palloc0(num_slots * sizeof(int));
+
+	/*
+	 * Try to reuse transaction slots of committed/aborted transactions. This
+	 * is just like above but it will maintain a link to the previous
+	 * transaction undo record in this slot.  This is to ensure that if there
+	 * is still any alive snapshot to which this transaction is not visible,
+	 * it can fetch the record from undo and check the visibility.
+	 */
+	for (slot_no = 0; slot_no < num_slots; slot_no++)
+	{
+		if (!TransactionIdIsInProgress(transinfo[slot_no].xid))
+		{
+			if (TransactionIdDidCommit(transinfo[slot_no].xid))
+				completed_xact_slots[nCompletedXactSlots++] = slot_no;
+			else
+				aborted_xact_slots[nAbortedXactSlots++] = slot_no;
+		}
+	}
+
+	if (nCompletedXactSlots > 0)
+	{
+		int		i;
+		int		slot_no;
+
+
+		START_CRIT_SECTION();
+
+		/* clear the transaction slot info on tuples */
+		zheap_freeze_or_invalidate_tuples(buf, nCompletedXactSlots,
+										  completed_xact_slots, false, TPDSlot);
+
+		/*
+		 * Clear the xid information from the slot but keep the undo record
+		 * pointer as it is so that undo records of the transaction are
+		 * accessible by traversing slot's undo chain even though the slots
+		 * are reused.
+		 */
+		if (TPDSlot)
+		{
+			for (i = 0; i < nCompletedXactSlots; i++)
+			{
+				int tpd_slot_id;
+
+				slot_no = completed_xact_slots[i];
+				/* calculate the actual slot no. */
+				tpd_slot_id = slot_no + ZHEAP_PAGE_TRANS_SLOTS + 1;
+
+				/* Clear xid from the TPD slot but keep the urec_ptr intact. */
+				TPDPageSetTransactionSlotInfo(buf, tpd_slot_id, 0,
+											  InvalidTransactionId,
+											  transinfo[slot_no].urec_ptr);
+			}
+		}
+		else
+		{
+			for (i = 0; i < nCompletedXactSlots; i++)
+			{
+				slot_no = completed_xact_slots[i];
+				transinfo[slot_no].xid_epoch = 0;
+				transinfo[slot_no].xid = InvalidTransactionId;
+			}
+		}
+		MarkBufferDirty(buf);
+
+		/*
+		 * Xlog Stuff
+		 */
+		if (RelationNeedsWAL(relation))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+
+
+			/* See comments while registering frozen slot. */
+			XLogRegisterData((char *) &nCompletedXactSlots, sizeof(uint16));
+			XLogRegisterData((char *) completed_xact_slots, nCompletedXactSlots * sizeof(int));
+
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+
+			if (TPDSlot)
+				RegisterTPDBuffer(page, 1);
+
+			recptr = XLogInsert(RM_ZHEAP_ID, XLOG_ZHEAP_INVALID_XACT_SLOT);
+			PageSetLSN(page, recptr);
+
+			if (TPDSlot)
+				TPDPageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		result = true;
+		goto cleanup;
+	}
+	else if (nAbortedXactSlots)
+	{
+		int		i;
+		int		slot_no;
+		UndoRecPtr *urecptr = palloc(nAbortedXactSlots * sizeof(UndoRecPtr));
+		TransactionId *xid = palloc(nAbortedXactSlots * sizeof(TransactionId));
+		uint32 *epoch = palloc(nAbortedXactSlots * sizeof(uint32));
+
+		/* Collect slot information before releasing the lock. */
+		for (i = 0; i < nAbortedXactSlots; i++)
+		{
+			urecptr[i] = transinfo[aborted_xact_slots[i]].urec_ptr;
+			xid[i] = transinfo[aborted_xact_slots[i]].xid;
+			epoch[i] = transinfo[aborted_xact_slots[i]].xid_epoch;
+		}
+
+		/*
+		 * We need to release and the lock before applying undo actions for a
+		 * page as we might need to traverse the long undo chain for a page.
+		 */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+		/*
+		 * Instead of just unlocking the TPD buffer like heap buffer its ok to
+		 * unlock and release, because next time while trying to reserve the
+		 * slot if we get the slot in TPD then anyway we will pin it again.
+		 */
+		if (TPDSlot)
+			UnlockReleaseTPDBuffers();
+
+		for (i = 0; i < nAbortedXactSlots; i++)
+		{
+			slot_no = aborted_xact_slots[i] + 1;
+			process_and_execute_undo_actions_page(urecptr[i],
+												  relation,
+												  buf,
+												  epoch[i],
+												  xid[i],
+												  slot_no);
+		}
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		*lock_reacquired = true;
+		pfree(urecptr);
+		pfree(xid);
+		pfree(epoch);
+
+		result = true;
+		goto cleanup;
+	}
+
+cleanup:
+	if (frozen_slots != NULL)
+		pfree(frozen_slots);
+	if (completed_xact_slots != NULL)
+		pfree(completed_xact_slots);
+	if (aborted_xact_slots != NULL)
+		pfree(aborted_xact_slots);
+
+	return result;
+}
+
+/*
+ * ZHeapTupleGetCid - Retrieve command id from tuple's undo record.
+ *
+ * It is expected that the caller of this function has atleast read lock
+ * on the buffer.
+ */
+CommandId
+ZHeapTupleGetCid(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr,
+				 int trans_slot_id)
+{
+	UnpackedUndoRecord	*urec;
+	UndoRecPtr	undo_rec_ptr;
+	CommandId	current_cid;
+	TransactionId	xid;
+	uint64		epoch_xid;
+	uint32		epoch;
+	bool		TPDSlot = true;
+	int			out_slot_no;
+
+	/*
+	 * For undo tuple caller will pass the valid slot id otherwise we can get it
+	 * directly from the tuple.
+	 */
+	if (trans_slot_id == InvalidXactSlotId)
+	{
+		trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup->t_data);
+		TPDSlot = false;
+	}
+
+	/*
+	 * If urec_ptr is not provided, fetch the latest undo pointer from the page.
+	 */
+	if (!UndoRecPtrIsValid(urec_ptr))
+	{
+		out_slot_no = GetTransactionSlotInfo(buf,
+											 ItemPointerGetOffsetNumber(&zhtup->t_self),
+											 trans_slot_id,
+											 &epoch,
+											 &xid,
+											 &undo_rec_ptr,
+											 true,
+											 TPDSlot);
+	}
+	else
+	{
+		out_slot_no =  GetTransactionSlotInfo(buf,
+											  ItemPointerGetOffsetNumber(&zhtup->t_self),
+											  trans_slot_id,
+											  &epoch,
+											  &xid,
+											  NULL,
+											  true,
+											  TPDSlot);
+		undo_rec_ptr = urec_ptr;
+	}
+
+	if (out_slot_no == ZHTUP_SLOT_FROZEN)
+		return InvalidCommandId;
+
+	epoch_xid = (uint64 ) epoch;
+	epoch_xid = MakeEpochXid(epoch_xid, xid);
+
+	if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return InvalidCommandId;
+
+	Assert(UndoRecPtrIsValid(undo_rec_ptr));
+	urec = UndoFetchRecord(undo_rec_ptr,
+						   ItemPointerGetBlockNumber(&zhtup->t_self),
+						   ItemPointerGetOffsetNumber(&zhtup->t_self),
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+	if (urec == NULL)
+		return InvalidCommandId;
+
+	current_cid = urec->uur_cid;
+
+	UndoRecordRelease(urec);
+
+	return current_cid;
+}
+
+/*
+ * ZHeapTupleGetCtid - Retrieve tuple id from tuple's undo record.
+ *
+ * It is expected that caller of this function has atleast read lock
+ * on the buffer and we call it only for non-inplace-updated tuples.
+ */
+void
+ZHeapTupleGetCtid(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr,
+				  ItemPointer ctid)
+{
+	*ctid = zhtup->t_self;
+	ZHeapPageGetCtid(ZHeapTupleHeaderGetXactSlot(zhtup->t_data), buf,
+					 urec_ptr, ctid);
+}
+
+/*
+ * ZHeapTupleGetSubXid - Retrieve subtransaction id from tuple's undo record.
+ *
+ * It is expected that caller of this function has atleast read lock.
+ *
+ * Note that we don't handle ZHEAP_INVALID_XACT_SLOT as this function is only
+ * called for in-progress transactions.  If we need to call it for some other
+ * purpose, then we might need to deal with ZHEAP_INVALID_XACT_SLOT.
+ */
+void
+ZHeapTupleGetSubXid(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr,
+					SubTransactionId *subxid)
+{
+	UnpackedUndoRecord	*urec;
+
+	*subxid = InvalidSubTransactionId;
+
+	Assert(UndoRecPtrIsValid(urec_ptr));
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(&zhtup->t_self),
+						   ItemPointerGetOffsetNumber(&zhtup->t_self),
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/*
+	 * We mostly expect urec here to be valid as it try to fetch
+	 * subtransactionid of tuples that are visible to the snapshot, so
+	 * corresponding undo record can't be discarded.
+	 *
+	 * In case when it is called while index creation, it might be possible
+	 * that the transaction that updated the tuple is committed and is not
+	 * present the calling transaction's snapshot (it uses snapshotany while
+	 * index creation), hence undo is discarded.
+	 */
+	if (urec == NULL)
+		return;
+
+	if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SUBXACT)
+	{
+		Assert(urec->uur_payload.len > 0);
+
+		/*
+		 * For UNDO_UPDATE, we first store the CTID, then transaction slot
+		 * and after that subtransaction id in payload.  For
+		 * UNDO_XID_LOCK_ONLY, we first store the Lockmode, then transaction
+		 * slot and after that subtransaction id.  So retrieve accordingly.
+		 */
+		if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+		{
+			if (urec->uur_type == UNDO_UPDATE)
+				*subxid = *(int *) ((char *) urec->uur_payload.data +
+							sizeof(ItemPointerData) + sizeof(TransactionId));
+			else if (urec->uur_type == UNDO_XID_LOCK_ONLY ||
+					 urec->uur_type == UNDO_XID_LOCK_FOR_UPDATE ||
+					 urec->uur_type == UNDO_XID_MULTI_LOCK_ONLY)
+				*subxid = *(int *) ((char *) urec->uur_payload.data +
+							sizeof(LockTupleMode) + sizeof(TransactionId));
+			else
+				*subxid = *(int *) ((char *) urec->uur_payload.data +
+								sizeof(TransactionId));
+		}
+		else
+		{
+			if (urec->uur_type == UNDO_UPDATE)
+				*subxid = *(int *) ((char *) urec->uur_payload.data +
+													sizeof(ItemPointerData));
+			else if (urec->uur_type == UNDO_XID_LOCK_ONLY ||
+					 urec->uur_type == UNDO_XID_LOCK_FOR_UPDATE ||
+					 urec->uur_type == UNDO_XID_MULTI_LOCK_ONLY)
+				*subxid = *(int *) ((char *) urec->uur_payload.data +
+												sizeof(LockTupleMode));
+			else
+				*subxid = *(SubTransactionId *) urec->uur_payload.data;
+		}
+	}
+
+	UndoRecordRelease(urec);
+}
+
+/*
+ * ZHeapTupleGetSpecToken - Retrieve speculative token from tuple's undo
+ *			record.
+ *
+ * It is expected that caller of this function has atleast read lock
+ * on the buffer.
+ */
+void
+ZHeapTupleGetSpecToken(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr,
+					   uint32 *specToken)
+{
+	UnpackedUndoRecord	*urec;
+
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(&zhtup->t_self),
+						   ItemPointerGetOffsetNumber(&zhtup->t_self),
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/*
+	 * We always expect urec to be valid as it try to fetch speculative token
+	 * of tuples for which inserting transaction hasn't been committed.  So,
+	 * corresponding undo record can't be discarded.
+	 */
+	Assert(urec);
+
+	*specToken = *(uint32 *) urec->uur_payload.data;
+
+	UndoRecordRelease(urec);
+}
+
+/*
+ * ZHeapTupleGetTransInfo - Retrieve transaction information of transaction
+ *			that has modified the tuple.
+ *
+ * nobuflock indicates whether caller has lock on the buffer 'buf'. If nobuflock
+ * is false, we rely on the supplied tuple zhtup to fetch the slot and undo
+ * information. Otherwise, we take buffer lock and fetch the actual tuple.
+ */
+void
+ZHeapTupleGetTransInfo(ZHeapTuple zhtup, Buffer buf, int *trans_slot,
+					   uint64 *epoch_xid_out, TransactionId *xid_out,
+					   CommandId *cid_out, UndoRecPtr *urec_ptr_out,
+					   bool nobuflock)
+{
+	ZHeapTupleHeader	tuple = zhtup->t_data;
+	UndoRecPtr	urec_ptr;
+	uint64		epoch;
+	uint32		tmp_epoch;
+	TransactionId	xid = InvalidTransactionId;
+	CommandId	cid;
+	ItemId	lp;
+	Page	page;
+	ItemPointer tid = &(zhtup->t_self);
+	int		trans_slot_id;
+	OffsetNumber	offnum = ItemPointerGetOffsetNumber(tid);
+	bool	is_invalid_slot = false;
+
+	/*
+	 * We are going to access special space in the page to retrieve the
+	 * transaction information and that requires share lock on buffer.
+	 */
+	if (nobuflock)
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	page = BufferGetPage(buf);
+	lp = PageGetItemId(page, offnum);
+	Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp));
+	if (!ItemIdIsDeleted(lp))
+	{
+		if (nobuflock)
+		{
+			/*
+			 * If the tuple is updated such that its transaction slot has
+			 * been changed, then we will never be able to get the correct
+			 * tuple from undo. To avoid, that we get the latest tuple from
+			 * page rather than relying on it's in-memory copy.
+			 */
+			zhtup->t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+			zhtup->t_len = ItemIdGetLength(lp);
+			tuple = zhtup->t_data;
+		}
+		trans_slot_id = ZHeapTupleHeaderGetXactSlot(tuple);
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+			goto slot_is_frozen;
+		trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id,
+											   &tmp_epoch, &xid, &urec_ptr,
+											   true, false);
+		/*
+		 * It is quite possible that the item is showing some
+		 * valid transaction slot, but actual slot has been frozen.
+		 * This can happen when the slot belongs to TPD entry and
+		 * the corresponding TPD entry is pruned.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+			goto slot_is_frozen;
+		if (ZHeapTupleHasInvalidXact(tuple->t_infomask))
+			is_invalid_slot = true;
+	}
+	else
+	{
+		/*
+		 * If it's deleted and pruned, we fetch the slot and undo information
+		 * from the item pointer itself.
+		 */
+		trans_slot_id = ItemIdGetTransactionSlot(lp);
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+			goto slot_is_frozen;
+		trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id,
+											   &tmp_epoch, &xid, &urec_ptr,
+											   true, false);
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+			goto slot_is_frozen;
+		if (ItemIdGetVisibilityInfo(lp) & ITEMID_XACT_INVALID)
+			is_invalid_slot = true;
+	}
+
+	/*
+	 * We need to fetch all the transaction related information from undo
+	 * record for the tuples that point to a slot that gets invalidated for
+	 * reuse at some point of time.  See PageFreezeTransSlots.
+	 */
+	if (is_invalid_slot)
+	{
+		xid = InvalidTransactionId;
+		FetchTransInfoFromUndo(zhtup, &epoch, &xid, &cid, &urec_ptr, false);
+	}
+	else if (ZHeapTupleHasMultiLockers(tuple->t_infomask))
+	{
+		/*
+		 * When we take a lock on the tuple, we never set locker's slot on the
+		 * tuple.  However, we use the newly computed infomask for the tuple
+		 * and write its current infomask in undo due to which
+		 * INVALID_XACT_SLOT bit of the tuple will move to undo.  In such
+		 * cases, if we need the previous inserter/updater's transaction id,
+		 * we've to skip locker's undo records.
+		 */
+		xid = InvalidTransactionId;
+		FetchTransInfoFromUndo(zhtup, &epoch, &xid, &cid, &urec_ptr, true);
+	}
+	else
+	{
+		if(cid_out && TransactionIdIsCurrentTransactionId(xid))
+		{
+			lp = PageGetItemId(page, offnum);
+			if (!ItemIdIsDeleted(lp))
+				cid = ZHeapTupleGetCid(zhtup, buf, InvalidUndoRecPtr, InvalidXactSlotId);
+			else
+				cid = ZHeapPageGetCid(buf, trans_slot_id, tmp_epoch, xid,
+									  urec_ptr, offnum);
+		}
+		epoch = (uint64) tmp_epoch;
+	}
+
+	goto done;
+
+slot_is_frozen:
+	trans_slot_id = ZHTUP_SLOT_FROZEN;
+	epoch = 0;
+	xid = InvalidTransactionId;
+	cid = InvalidCommandId;
+	urec_ptr = InvalidUndoRecPtr;
+
+done:
+	/* Set the value of required parameters. */
+	if (trans_slot)
+		*trans_slot = trans_slot_id;
+	if (epoch_xid_out)
+		*epoch_xid_out = MakeEpochXid(epoch, xid);
+	if (xid_out)
+		*xid_out = xid;
+	if (cid_out)
+		*cid_out = cid;
+	if (nobuflock)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (urec_ptr_out)
+		*urec_ptr_out = urec_ptr;
+
+	return;
+}
+
+/*
+ * ZHeapPageGetCid - Retrieve command id from tuple's undo record.
+ *
+ * This is similar to ZHeapTupleGetCid with a difference that here we use
+ * transaction slot to fetch the appropriate undo record.  It is expected that
+ * the caller of this function has atleast read lock on the buffer.
+ */
+CommandId
+ZHeapPageGetCid(Buffer buf, int trans_slot, uint32 epoch, TransactionId xid,
+				UndoRecPtr urec_ptr, OffsetNumber off)
+{
+	UnpackedUndoRecord	*urec;
+	CommandId	current_cid;
+	uint64		epoch_xid;
+
+	epoch_xid = (uint64) epoch;
+	epoch_xid = MakeEpochXid(epoch_xid, xid);
+
+	if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return InvalidCommandId;
+
+	urec = UndoFetchRecord(urec_ptr,
+						   BufferGetBlockNumber(buf),
+						   off,
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+	if (urec == NULL)
+		return InvalidCommandId;
+
+	current_cid = urec->uur_cid;
+
+	UndoRecordRelease(urec);
+
+	return current_cid;
+}
+
+
+/*
+ * ZHeapPageGetCtid - Retrieve tuple id from tuple's undo record.
+ *
+ * It is expected that caller of this function has atleast read lock.
+ */
+void
+ZHeapPageGetCtid(int trans_slot, Buffer buf, UndoRecPtr urec_ptr,
+				 ItemPointer ctid)
+{
+	UnpackedUndoRecord	*urec;
+
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(ctid),
+						   ItemPointerGetOffsetNumber(ctid),
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/*
+	 * We always expect urec here to be valid as it try to fetch ctid of
+	 * tuples that are visible to the snapshot, so corresponding undo record
+	 * can't be discarded.
+	 */
+	Assert(urec);
+
+	/*
+	 * The tuple should be deleted/updated previously. Else, the caller should
+	 * not be calling this function.
+	 */
+	Assert(urec->uur_type == UNDO_DELETE || urec->uur_type == UNDO_UPDATE);
+
+	/*
+	 * For a deleted tuple, ctid refers to self.
+	 */
+	if (urec->uur_type != UNDO_DELETE)
+	{
+		Assert(urec->uur_payload.len > 0);
+		*ctid = *(ItemPointer) urec->uur_payload.data;
+	}
+
+	UndoRecordRelease(urec);
+}
+
+
+/*
+ * ValidateTuplesXact - Check if the tuple is modified by priorXmax.
+ *
+ *	We need to traverse the undo chain of tuple to see if any of its
+ *	prior version is modified by priorXmax.
+ *
+ *  nobuflock indicates whether caller has lock on the buffer 'buf'.
+ */
+bool
+ValidateTuplesXact(ZHeapTuple tuple, Snapshot snapshot, Buffer buf,
+				   TransactionId priorXmax, bool nobuflock)
+{
+	ZHeapTupleData	zhtup;
+	UnpackedUndoRecord	*urec = NULL;
+	UndoRecPtr		urec_ptr;
+	ZHeapTuple	undo_tup = NULL;
+	ItemPointer tid = &(tuple->t_self);
+	ItemId		lp;
+	Page		page;
+	TransactionId	xid;
+	TransactionId	prev_undo_xid = InvalidTransactionId;
+	uint32		epoch;
+	int	trans_slot_id = InvalidXactSlotId;
+	int	prev_trans_slot_id;
+	OffsetNumber	offnum;
+	bool		valid = false;
+
+	/*
+	 * As we are going to access special space in the page to retrieve the
+	 * transaction information share lock on buffer is required.
+	 */
+	if (nobuflock)
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	page = BufferGetPage(buf);
+	offnum = ItemPointerGetOffsetNumber(tid);
+	lp = PageGetItemId(page, offnum);
+
+	zhtup.t_tableOid = tuple->t_tableOid;
+	zhtup.t_self = *tid;
+
+	if(ItemIdIsDead(lp) || !ItemIdHasStorage(lp))
+	{
+		/*
+		 * If the tuple is already removed by Rollbacks/pruning, then we
+		 * don't need to proceed further.
+		 */
+		if (nobuflock)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		return false;
+	}
+	else if (!ItemIdIsDeleted(lp))
+	{
+		/*
+		 * If the tuple is updated such that its transaction slot has been
+		 * changed, then we will never be able to get the correct tuple from undo.
+		 * To avoid, that we get the latest tuple from page rather than relying on
+		 * it's in-memory copy.
+		 */
+		zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		zhtup.t_len = ItemIdGetLength(lp);
+		trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup.t_data);
+		trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id,
+											   &epoch, &xid, &urec_ptr, true,
+											   false);
+	}
+	else
+	{
+		ZHeapTuple vis_tuple;
+		trans_slot_id = ItemIdGetTransactionSlot(lp);
+		trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id,
+											   &epoch, &xid, &urec_ptr, true,
+											   false);
+
+		/*
+		 * XXX for now we shall get a visible undo tuple for the given
+		 * dirty snapshot. The tuple data is needed below in
+		 * CopyTupleFromUndoRecord and some undo records will not have
+		 * tuple data and mask info with them.
+		 * */
+		vis_tuple = ZHeapGetVisibleTuple(ItemPointerGetOffsetNumber(tid),
+										 snapshot, buf, NULL);
+		Assert(vis_tuple != NULL);
+		zhtup.t_data = vis_tuple->t_data;
+		zhtup.t_len = vis_tuple->t_len;
+	}
+
+	/*
+	 * Current xid on tuple must not precede oldestXidHavingUndo as it
+	 * will be greater than priorXmax which was not visible to our
+	 * snapshot.
+	 */
+	Assert(trans_slot_id != ZHTUP_SLOT_FROZEN);
+
+	if (TransactionIdEquals(xid, priorXmax))
+	{
+		valid = true;
+		goto tuple_is_valid;
+	}
+
+	undo_tup = &zhtup;
+
+	/*
+	 * Current xid on tuple must not precede RecentGlobalXmin as it will be
+	 * greater than priorXmax which was not visible to our snapshot.
+	 */
+	Assert(TransactionIdEquals(xid, InvalidTransactionId) ||
+		   !TransactionIdPrecedes(xid, RecentGlobalXmin));
+
+	do
+	{
+		prev_trans_slot_id = trans_slot_id;
+		Assert(prev_trans_slot_id != ZHTUP_SLOT_FROZEN);
+
+		urec = UndoFetchRecord(urec_ptr,
+							   ItemPointerGetBlockNumber(&undo_tup->t_self),
+							   ItemPointerGetOffsetNumber(&undo_tup->t_self),
+							   prev_undo_xid,
+							   NULL,
+							   ZHeapSatisfyUndoRecord);
+
+		/*
+		 * As we still hold a snapshot to which priorXmax is not visible, neither
+		 * the transaction slot on tuple can be marked as frozen nor the
+		 * corresponding undo be discarded.
+		 */
+		Assert(urec != NULL);
+
+		if (TransactionIdEquals(urec->uur_xid, priorXmax))
+		{
+			valid = true;
+			goto tuple_is_valid;
+		}
+
+		/* don't free the tuple passed by caller */
+		undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, NULL,
+										   (undo_tup) == (&zhtup) ? false : true,
+										   page);
+
+		Assert(!TransactionIdPrecedes(urec->uur_prevxid, RecentGlobalXmin));
+
+		prev_undo_xid = urec->uur_prevxid;
+
+		/*
+		 * Change the undo chain if the undo tuple is stamped with the different
+		 * transaction slot.
+		 */
+		if (prev_trans_slot_id != trans_slot_id)
+		{
+			trans_slot_id =  GetTransactionSlotInfo(buf,
+													ItemPointerGetOffsetNumber(&undo_tup->t_self),
+													trans_slot_id,
+													NULL,
+													NULL,
+													&urec_ptr,
+													true,
+													true);
+		}
+		else
+			urec_ptr = urec->uur_blkprev;
+
+		UndoRecordRelease(urec);
+		urec = NULL;
+	} while (UndoRecPtrIsValid(urec_ptr));
+
+tuple_is_valid:
+	if (urec)
+		UndoRecordRelease(urec);
+	if (undo_tup && undo_tup != &zhtup)
+		pfree(undo_tup);
+
+	if (nobuflock)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	return valid;
+}
+
+/*
+ * Initialize zheap page.
+ */
+void
+ZheapInitPage(Page page, Size pageSize)
+{
+	ZHeapPageOpaque	opaque;
+	int				i;
+
+	/*
+	 * The size of the opaque space depends on the number of transaction
+	 * slots in a page. We set it to default here.
+	 */
+	PageInit(page, pageSize, ZHEAP_PAGE_TRANS_SLOTS * sizeof(TransInfo));
+
+	opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+	for (i = 0; i < ZHEAP_PAGE_TRANS_SLOTS; i++)
+	{
+		opaque->transinfo[i].xid_epoch = 0;
+		opaque->transinfo[i].xid = InvalidTransactionId;
+		opaque->transinfo[i].urec_ptr = InvalidUndoRecPtr;
+	}
+}
+
+/*
+ * zheap_init_meta_page - Initialize the metapage.
+ */
+void
+zheap_init_meta_page(Buffer metabuf, BlockNumber first_blkno,
+					 BlockNumber last_blkno)
+{
+	ZHeapMetaPage metap;
+	Page		page;
+
+	page = BufferGetPage(metabuf);
+	PageInit(page, BufferGetPageSize(metabuf), 0);
+
+	metap = ZHeapPageGetMeta(page);
+	metap->zhm_magic = ZHEAP_MAGIC;
+	metap->zhm_version = ZHEAP_VERSION;
+	metap->zhm_first_used_tpd_page = first_blkno;
+	metap->zhm_last_used_tpd_page = last_blkno;
+
+	/*
+	 * Set pd_lower just past the end of the metadata.  This is essential,
+	 * because without doing so, metadata will be lost if xlog.c compresses
+	 * the page.
+	 */
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(ZHeapMetaPageData)) - (char *) page;
+}
+
+/*
+ * ZheapInitMetaPage - Allocate and initialize the zheap metapage.
+ */
+void
+ZheapInitMetaPage(Relation rel, ForkNumber forkNum)
+{
+	Buffer		buf;
+	bool		use_wal;
+
+	buf = ReadBufferExtended(rel, forkNum, P_NEW, RBM_NORMAL, NULL);
+	if (BufferGetBlockNumber(buf) != ZHEAP_METAPAGE)
+		elog(ERROR, "unexpected zheap relation size: %u, should be %u",
+			 BufferGetBlockNumber(buf), ZHEAP_METAPAGE);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	START_CRIT_SECTION();
+
+	zheap_init_meta_page(buf, InvalidBlockNumber, InvalidBlockNumber);
+	MarkBufferDirty(buf);
+
+	/*
+	 * WAL log creation of metapage if the relation is persistent, or this is the
+	 * init fork.  Init forks for unlogged relations always need to be WAL
+	 * logged.
+	 */
+	use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM;
+
+	if (use_wal)
+		log_newpage_buffer(buf, true);
+
+	END_CRIT_SECTION();
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * -----------
+ * Zheap scan related API's.
+ * -----------
+ */
+
+/*
+ * zinitscan - same as initscan except for tuple initialization
+ */
+static void
+zinitscan(ZHeapScanDesc scan, ScanKey key, bool keep_startblock)
+{
+	bool		allow_strat;
+	bool		allow_sync;
+
+	/*
+	 * Determine the number of blocks we have to scan.
+	 *
+	 * It is sufficient to do this once at scan start, since any tuples added
+	 * while the scan is in progress will be invisible to my snapshot anyway.
+	 * (That is not true when using a non-MVCC snapshot.  However, we couldn't
+	 * guarantee to return tuples added after scan start anyway, since they
+	 * might go into pages we already scanned.  To guarantee consistent
+	 * results for a non-MVCC snapshot, the caller must hold some higher-level
+	 * lock that ensures the interesting tuple(s) won't change.)
+	 */
+	if (scan->rs_scan.rs_parallel != NULL)
+		scan->rs_scan.rs_nblocks = scan->rs_scan.rs_parallel->phs_nblocks;
+	else
+		scan->rs_scan.rs_nblocks = RelationGetNumberOfBlocks(scan->rs_scan.rs_rd);
+
+	/*
+	 * If the table is large relative to NBuffers, use a bulk-read access
+	 * strategy and enable synchronized scanning (see syncscan.c).  Although
+	 * the thresholds for these features could be different, we make them the
+	 * same so that there are only two behaviors to tune rather than four.
+	 * (However, some callers need to be able to disable one or both of these
+	 * behaviors, independently of the size of the table; also there is a GUC
+	 * variable that can disable synchronized scanning.)
+	 *
+	 * Note that heap_parallelscan_initialize has a very similar test; if you
+	 * change this, consider changing that one, too.
+	 */
+	if (!RelationUsesLocalBuffers(scan->rs_scan.rs_rd) &&
+		scan->rs_scan.rs_nblocks > NBuffers / 4)
+	{
+		allow_strat = scan->rs_scan.rs_allow_strat;
+		allow_sync = scan->rs_scan.rs_allow_sync;
+	}
+	else
+		allow_strat = allow_sync = false;
+
+	if (allow_strat)
+	{
+		/* During a rescan, keep the previous strategy object. */
+		if (scan->rs_strategy == NULL)
+			scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);
+	}
+	else
+	{
+		if (scan->rs_strategy != NULL)
+			FreeAccessStrategy(scan->rs_strategy);
+		scan->rs_strategy = NULL;
+	}
+
+	if (scan->rs_scan.rs_parallel != NULL)
+	{
+		/* For parallel scan, believe whatever ParallelHeapScanDesc says. */
+		scan->rs_scan.rs_syncscan = scan->rs_scan.rs_parallel->phs_syncscan;
+	}
+	else if (keep_startblock)
+	{
+		/*
+		 * When rescanning, we want to keep the previous startblock setting,
+		 * so that rewinding a cursor doesn't generate surprising results.
+		 * Reset the active syncscan setting, though.
+		 */
+		scan->rs_scan.rs_syncscan = (allow_sync && synchronize_seqscans);
+	}
+	else if (allow_sync && synchronize_seqscans)
+	{
+		scan->rs_scan.rs_syncscan = true;
+		scan->rs_scan.rs_startblock = ss_get_location(scan->rs_scan.rs_rd, scan->rs_scan.rs_nblocks);
+		/* Skip metapage */
+		if (scan->rs_scan.rs_startblock == ZHEAP_METAPAGE)
+			scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1;
+	}
+	else
+	{
+		scan->rs_scan.rs_syncscan = false;
+		scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1;
+	}
+
+	scan->rs_scan.rs_numblocks = InvalidBlockNumber;
+	scan->rs_inited = false;
+	scan->rs_cbuf = InvalidBuffer;
+	scan->rs_cblock = InvalidBlockNumber;
+
+	/* page-at-a-time fields are always invalid when not rs_inited */
+
+	/*
+	 * copy the scan key, if appropriate
+	 */
+	if (key != NULL)
+		memcpy(scan->rs_scan.rs_key, key, scan->rs_scan.rs_nkeys * sizeof(ScanKeyData));
+
+	/*
+	 * Currently, we don't have a stats counter for bitmap heap scans (but the
+	 * underlying bitmap index scans will be counted) or sample scans (we only
+	 * update stats for tuple fetches there)
+	 */
+	if (!scan->rs_scan.rs_bitmapscan && !scan->rs_scan.rs_samplescan)
+		pgstat_count_heap_scan(scan->rs_scan.rs_rd);
+}
+
+/* ----------------
+ *		zheap_rescan		- similar to heap_rescan
+ * ----------------
+ */
+void
+zheap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
+			bool allow_strat, bool allow_sync, bool allow_pagemode)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+
+	if (set_params)
+	{
+		scan->rs_scan.rs_allow_strat = allow_strat;
+		scan->rs_scan.rs_allow_sync = allow_sync;
+		scan->rs_scan.rs_pageatatime = allow_pagemode && IsMVCCSnapshot(scan->rs_scan.rs_snapshot);
+	}
+
+	/*
+	 * unpin scan buffers
+	 */
+	if (BufferIsValid(scan->rs_cbuf))
+		ReleaseBuffer(scan->rs_cbuf);
+
+	/*
+	 * reinitialize scan descriptor
+	 */
+	zinitscan(scan, key, true);
+
+	/*
+	 * reset parallel scan, if present
+	 */
+	if (scan->rs_scan.rs_parallel != NULL)
+	{
+		ParallelTableScanDesc parallel_scan;
+
+		/*
+		 * Caller is responsible for making sure that all workers have
+		 * finished the scan before calling this.
+		 */
+		parallel_scan = scan->rs_scan.rs_parallel;
+		pg_atomic_write_u64(&parallel_scan->phs_nallocated, 0);
+	}
+}
+
+/*
+ * zheap_beginscan - same as heap_beginscan except for tuple initialization
+ */
+TableScanDesc
+zheap_beginscan(Relation relation, Snapshot snapshot,
+				int nkeys, ScanKey key,
+				ParallelTableScanDesc parallel_scan,
+				bool allow_strat,
+				bool allow_sync,
+				bool allow_pagemode,
+				bool is_bitmapscan,
+				bool is_samplescan,
+				bool temp_snap)
+{
+	ZHeapScanDesc scan;
+
+	/*
+	 * increment relation ref count while scanning relation
+	 *
+	 * This is just to make really sure the relcache entry won't go away while
+	 * the scan has a pointer to it.  Caller should be holding the rel open
+	 * anyway, so this is redundant in all normal scenarios...
+	 */
+	RelationIncrementReferenceCount(relation);
+
+	/*
+	 * allocate and initialize scan descriptor
+	 */
+	scan = (ZHeapScanDesc) palloc(sizeof(ZHeapScanDescData));
+
+	scan->rs_scan.rs_rd = relation;
+	scan->rs_scan.rs_snapshot = snapshot;
+	scan->rs_scan.rs_nkeys = nkeys;
+	scan->rs_scan.rs_bitmapscan = is_bitmapscan;
+	scan->rs_scan.rs_samplescan = is_samplescan;
+	scan->rs_strategy = NULL;	/* set in zinitscan */
+	scan->rs_scan.rs_startblock = 0;	/* set in initscan */
+	scan->rs_scan.rs_allow_strat = allow_strat;
+	scan->rs_scan.rs_allow_sync = allow_sync;
+	scan->rs_scan.rs_temp_snap = temp_snap;
+	scan->rs_scan.rs_parallel = parallel_scan;
+	scan->rs_ntuples = 0; // ZBORKED ?
+
+	/*
+	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
+	 */
+	scan->rs_scan.rs_pageatatime = allow_pagemode && snapshot && IsMVCCSnapshot(snapshot);
+
+	/*
+	 * For a seqscan in a serializable transaction, acquire a predicate lock
+	 * on the entire relation. This is required not only to lock all the
+	 * matching tuples, but also to conflict with new insertions into the
+	 * table. In an indexscan, we take page locks on the index pages covering
+	 * the range specified in the scan qual, but in a heap scan there is
+	 * nothing more fine-grained to lock. A bitmap scan is a different story,
+	 * there we have already scanned the index and locked the index pages
+	 * covering the predicate. But in that case we still have to lock any
+	 * matching heap tuples.
+	 */
+	if (!is_bitmapscan && snapshot)
+		PredicateLockRelation(relation, snapshot);
+
+	scan->rs_cztup = NULL;
+
+
+	/*
+	 * we do this here instead of in initscan() because heap_rescan also calls
+	 * initscan() and we don't want to allocate memory again
+	 */
+	if (nkeys > 0)
+		scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys);
+	else
+		scan->rs_scan.rs_key = NULL;
+
+	zinitscan(scan, key, false);
+
+	return (TableScanDesc) scan;
+}
+
+/*
+ * zheap_setscanlimits - restrict range of a zheapscan
+ *
+ * startBlk is the page to start at
+ * numBlks is number of pages to scan (InvalidBlockNumber means "all")
+ */
+void
+zheap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlks)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+
+	Assert(!scan->rs_inited);	/* else too late to change */
+	Assert(!scan->rs_scan.rs_syncscan); /* else rs_startblock is
+											 * significant */
+
+	/*
+	 * Check startBlk is valid (but allow case of zero blocks...).
+	 * Consider meta-page as well.
+	 */
+	Assert(startBlk == 0 || startBlk < scan->rs_scan.rs_nblocks ||
+			startBlk == ZHEAP_METAPAGE + 1);
+
+	scan->rs_scan.rs_startblock = startBlk;
+	scan->rs_scan.rs_numblocks = numBlks;
+}
+
+/* ----------------
+ *		heap_update_snapshot
+ *
+ *		Update snapshot info in heap scan descriptor.
+ * ----------------
+ */
+void
+zheap_update_snapshot(TableScanDesc sscan, Snapshot snapshot)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+
+	Assert(IsMVCCSnapshot(snapshot));
+
+	RegisterSnapshot(snapshot);
+	scan->rs_scan.rs_snapshot = snapshot;
+	scan->rs_scan.rs_temp_snap = true;
+}
+
+/*
+ * zheapgetpage - Same as heapgetpage, but operate on zheap page and
+ * in page-at-a-time mode, visible tuples are stored in rs_visztuples.
+ *
+ * It returns false, if we can't scan the page (like in case of TPD page),
+ * otherwise, return true.
+ */
+bool
+zheapgetpage(TableScanDesc sscan, BlockNumber page)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	Buffer		buffer;
+	Snapshot	snapshot;
+	Page		dp;
+	int			lines;
+	int			ntup;
+	OffsetNumber lineoff;
+	ItemId		lpp;
+	bool		all_visible;
+	uint8		vmstatus;
+	Buffer		vmbuffer = InvalidBuffer;
+
+	Assert(page < scan->rs_scan.rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	// ZBORKED
+	if (page == ZHEAP_METAPAGE)
+	{
+		/* needs to be udpated to keep track of scan position */
+		scan->rs_cblock = page;
+		return false;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	buffer = ReadBufferExtended(scan->rs_scan.rs_rd, MAIN_FORKNUM, page,
+								RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = page;
+
+	/*
+	 * We must hold share lock on the buffer content while examining tuple
+	 * visibility.  Afterwards, however, the tuples we have found to be
+	 * visible are guaranteed good as long as we hold the buffer pin.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+	dp = BufferGetPage(buffer);
+
+	/*
+	 * Skip TPD pages. As of now, the size of special space in TPD pages is
+	 * different from other zheap pages like metapage and regular zheap page,
+	 * however, if that changes, we might need to explicitly store pagetype
+	 * flag somewhere.
+	 *
+	 * Fixme - As an exception, the size of special space for zheap page
+	 * with one transaction slot will match with TPD page's special size.
+	 */
+	if (PageGetSpecialSize(dp) == MAXALIGN(sizeof(TPDPageOpaqueData)))
+	{
+		UnlockReleaseBuffer(buffer);
+		return false;
+	}
+	else if (!scan->rs_scan.rs_pageatatime)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		scan->rs_cbuf = buffer;
+		return true;
+	}
+
+	snapshot = scan->rs_scan.rs_snapshot;
+
+	/*
+	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Fixme - Pruning is required in zheap for deletes, so we need to
+	 * make it work.
+	 */
+	/* heap_page_prune_opt(scan->rs_rd, buffer); */
+
+	TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp);
+	lines = PageGetMaxOffsetNumber(dp);
+	ntup = 0;
+
+	/*
+	 * If the all-visible flag indicates that all tuples on the page are
+	 * visible to everyone, we can skip the per-tuple visibility tests.
+	 *
+	 * Note: In hot standby, a tuple that's already visible to all
+	 * transactions in the master might still be invisible to a read-only
+	 * transaction in the standby. We partly handle this problem by tracking
+	 * the minimum xmin of visible tuples as the cut-off XID while marking a
+	 * page all-visible on master and WAL log that along with the visibility
+	 * map SET operation. In hot standby, we wait for (or abort) all
+	 * transactions that can potentially may not see one or more tuples on the
+	 * page. That's how index-only scans work fine in hot standby.
+	 */
+
+	vmstatus = visibilitymap_get_status(scan->rs_scan.rs_rd, page, &vmbuffer);
+
+	all_visible = (vmstatus & VISIBILITYMAP_ALL_VISIBLE) &&
+				  !snapshot->takenDuringRecovery;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
+		 lineoff <= lines;
+		 lineoff++, lpp++)
+	{
+		if (ItemIdIsNormal(lpp) || ItemIdIsDeleted(lpp))
+		{
+			ZHeapTuple	loctup = NULL;
+			ZHeapTuple	resulttup = NULL;
+			Size		loctup_len;
+			bool		valid = false;
+			ItemPointerData	tid;
+
+			ItemPointerSet(&tid, page, lineoff);
+
+			if (ItemIdIsDeleted(lpp))
+			{
+				if (all_visible)
+				{
+					valid = false;
+					resulttup = NULL;
+				}
+				else
+				{
+					resulttup = ZHeapGetVisibleTuple(lineoff, snapshot, buffer,
+													 NULL);
+					valid = resulttup ? true : false;
+				}
+			}
+			else
+			{
+				loctup_len = ItemIdGetLength(lpp);
+
+				loctup = palloc(ZHEAPTUPLESIZE + loctup_len);
+				loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE);
+
+				loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+				loctup->t_len = loctup_len;
+				loctup->t_self = tid;
+
+				/*
+				 * We always need to make a copy of zheap tuple as once we
+				 * release the buffer, an in-place update can change the tuple.
+				 */
+				memcpy(loctup->t_data,
+					   ((ZHeapTupleHeader) PageGetItem((Page) dp, lpp)),
+					   loctup->t_len);
+
+				if (all_visible)
+				{
+					valid = true;
+					resulttup = loctup;
+				}
+				else
+				{
+					resulttup = ZHeapTupleSatisfies(loctup, snapshot,
+													buffer, NULL);
+					valid = resulttup ? true : false;
+				}
+			}
+
+			/*
+			 * If any prior version is visible, we pass latest visible as
+			 * true. The state of latest version of tuple is determined by
+			 * the called function.
+			 *
+			 * Note that, it's possible that tuple is updated in-place and
+			 * we're seeing some prior version of that. We handle that case
+			 * in ZHeapTupleHasSerializableConflictOut.
+			 */
+			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &tid,
+											buffer, snapshot);
+
+			if (valid)
+				scan->rs_visztuples[ntup++] = resulttup;
+		}
+	}
+
+	UnlockReleaseBuffer(buffer);
+
+	Assert(ntup <= MaxZHeapTuplesPerPage);
+	scan->rs_ntuples = ntup;
+
+	return true;
+}
+
+/* ----------------
+ *		zheapgettup_pagemode - fetch next zheap tuple in page-at-a-time mode
+ *
+ * Note that here we process only regular zheap pages, meta and tpd pages are
+ * skipped.
+ * ----------------
+ */
+static ZHeapTuple
+zheapgettup_pagemode(ZHeapScanDesc scan,
+					 ScanDirection dir)
+{
+	ZHeapTuple	tuple = scan->rs_cztup;
+	bool		backward = ScanDirectionIsBackward(dir);
+	BlockNumber page;
+	bool		finished;
+	bool		valid;
+	int			lines;
+	int			lineindex;
+	int			linesleft;
+	int			i = 0;
+
+	/*
+	 * calculate next starting lineindex, given scan direction
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 ||
+				scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				tuple = NULL;
+				return tuple;
+			}
+			if (scan->rs_scan.rs_parallel != NULL)
+			{
+				table_parallelscan_startblock_init(&scan->rs_scan);
+
+				page = table_parallelscan_nextpage(&scan->rs_scan);
+
+				/* Skip metapage */
+				if (page == ZHEAP_METAPAGE)
+					page = table_parallelscan_nextpage(&scan->rs_scan);
+
+				/* Other processes might have already finished the scan. */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple = NULL;
+					return tuple;
+				}
+			}
+			else
+				page = scan->rs_scan.rs_startblock;		/* first page */
+			valid = zheapgetpage(&scan->rs_scan, page);
+			if (!valid)
+				goto get_next_page;
+
+			lineindex = 0;
+			scan->rs_inited = true;
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+			lineindex = scan->rs_cindex + 1;
+		}
+
+		/*dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);*/
+		lines = scan->rs_ntuples;
+		/* page and lineindex now reference the next visible tid */
+
+		linesleft = lines - lineindex;
+	}
+	else if (backward)
+	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_scan.rs_parallel == NULL);
+
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 ||
+				scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				tuple = NULL;
+				return tuple;
+			}
+
+			/*
+			 * Disable reporting to syncscan logic in a backwards scan; it's
+			 * not very likely anyone else is doing the same thing at the same
+			 * time, and much more likely that we'll just bollix things for
+			 * forward scanners.
+			 */
+			scan->rs_scan.rs_syncscan = false;
+			/* start from last page of the scan */
+			if (scan->rs_scan.rs_startblock > ZHEAP_METAPAGE + 1)
+				page = scan->rs_scan.rs_startblock - 1;
+			else
+				page = scan->rs_scan.rs_nblocks - 1;
+			valid = zheapgetpage(&scan->rs_scan, page);
+			if (!valid)
+				goto get_next_page;
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+		}
+
+		lines = scan->rs_ntuples;
+
+		if (!scan->rs_inited)
+		{
+			lineindex = lines - 1;
+			scan->rs_inited = true;
+		}
+		else
+		{
+			lineindex = scan->rs_cindex - 1;
+		}
+		/* page and lineindex now reference the previous visible tid */
+
+		linesleft = lineindex + 1;
+	}
+	else
+	{
+		/*
+		 * In executor it seems NoMovementScanDirection is nothing but
+		 * do-nothing flag so we should not be here. The else part is still
+		 * here to keep the code as in heapgettup_pagemode.
+		 */
+		Assert(false);
+		return NULL;
+	}
+
+get_next_tuple:
+	/*
+	 * advance the scan until we find a qualifying tuple or run out of stuff
+	 * to scan
+	 */
+	while (linesleft > 0)
+	{
+		tuple = scan->rs_visztuples[lineindex];
+		scan->rs_cindex = lineindex;
+		return tuple;
+	}
+
+	/*
+	 * if we get here, it means we've exhausted the items on this page and
+	 * it's time to move to the next.
+	 * For now we shall free all of the zheap tuples stored in rs_visztuples.
+	 * Later a better memory management is required.
+	 */
+	for (i = 0; i < scan->rs_ntuples; i++)
+		zheap_freetuple(scan->rs_visztuples[i]);
+	scan->rs_ntuples = 0;
+
+get_next_page:
+	for (;;)
+	{
+		if (backward)
+		{
+			finished = (page == scan->rs_scan.rs_startblock) ||
+				(scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false);
+			if (page == ZHEAP_METAPAGE + 1)
+				page = scan->rs_scan.rs_nblocks;
+			page--;
+		}
+		else if (scan->rs_scan.rs_parallel != NULL)
+		{
+			page = table_parallelscan_nextpage(&scan->rs_scan);
+			/* Skip metapage */
+			if (page == ZHEAP_METAPAGE)
+				page = table_parallelscan_nextpage(&scan->rs_scan);
+			finished = (page == InvalidBlockNumber);
+		}
+		else
+		{
+			page++;
+			if (page >= scan->rs_scan.rs_nblocks)
+				page = 0;
+
+			if (page == ZHEAP_METAPAGE)
+			{
+				/*
+				 * Since, we're skipping the metapage, we should update the scan
+				 * location if sync scan is enabled.
+				 */
+				if (scan->rs_scan.rs_syncscan)
+					ss_report_location(scan->rs_scan.rs_rd, page);
+				page++;
+			}
+
+			finished = (page == scan->rs_scan.rs_startblock) ||
+				(scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false);
+
+			/*
+			 * Report our new scan position for synchronization purposes. We
+			 * don't do that when moving backwards, however. That would just
+			 * mess up any other forward-moving scanners.
+			 *
+			 * Note: we do this before checking for end of scan so that the
+			 * final state of the position hint is back at the start of the
+			 * rel.  That's not strictly necessary, but otherwise when you run
+			 * the same query multiple times the starting position would shift
+			 * a little bit backwards on every invocation, which is confusing.
+			 * We don't guarantee any specific ordering in general, though.
+			 */
+			if (scan->rs_scan.rs_syncscan)
+				ss_report_location(scan->rs_scan.rs_rd, page);
+		}
+
+		/*
+		 * return NULL if we've exhausted all the pages
+		 */
+		if (finished)
+		{
+			if (BufferIsValid(scan->rs_cbuf))
+				ReleaseBuffer(scan->rs_cbuf);
+			scan->rs_cbuf = InvalidBuffer;
+			scan->rs_cblock = InvalidBlockNumber;
+			tuple = NULL;
+			scan->rs_inited = false;
+			return tuple;
+		}
+
+		valid = zheapgetpage(&scan->rs_scan, page);
+		if (!valid)
+			continue;
+
+		if (!scan->rs_inited)
+			scan->rs_inited = true;
+		lines = scan->rs_ntuples;
+		linesleft = lines;
+		if (backward)
+			lineindex = lines - 1;
+		else
+			lineindex = 0;
+
+		goto get_next_tuple;
+	}
+}
+
+/*
+ * Similar to heapgettup, but for fetching zheap tuple.
+ *
+ * Note that here we process only regular zheap pages, meta and tpd pages are
+ * skipped.
+ */
+static ZHeapTuple
+zheapgettup(ZHeapScanDesc scan,
+		   ScanDirection dir)
+{
+	ZHeapTuple	tuple = scan->rs_cztup;
+	Snapshot	snapshot = scan->rs_scan.rs_snapshot;
+	bool		backward = ScanDirectionIsBackward(dir);
+	BlockNumber page;
+	bool		finished;
+	bool		valid;
+	Page		dp;
+	int			lines;
+	OffsetNumber lineoff;
+	int			linesleft;
+	ItemId		lpp;
+
+	/*
+	 * calculate next starting lineoff, given scan direction
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 ||
+				scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				return NULL;
+			}
+			if (scan->rs_scan.rs_parallel != NULL)
+			{
+				table_parallelscan_startblock_init(&scan->rs_scan);
+
+				page = table_parallelscan_nextpage(&scan->rs_scan);
+
+				/* Skip metapage */
+				if (page == ZHEAP_METAPAGE)
+					page = table_parallelscan_nextpage(&scan->rs_scan);
+
+				/* Other processes might have already finished the scan. */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					return NULL;
+				}
+			}
+			else
+				page = scan->rs_scan.rs_startblock;		/* first page */
+			valid = zheapgetpage(&scan->rs_scan, page);
+			if (!valid)
+				goto get_next_page;
+			lineoff = FirstOffsetNumber;		/* first offnum */
+			scan->rs_inited = true;
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+			lineoff =			/* next offnum */
+				OffsetNumberNext(ItemPointerGetOffsetNumber(&(tuple->t_self)));
+		}
+
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp);
+		lines = PageGetMaxOffsetNumber(dp);
+		/* page and lineoff now reference the physically next tid */
+
+		linesleft = lines - lineoff + 1;
+	}
+	else if (backward)
+	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_scan.rs_parallel == NULL);
+
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 ||
+				scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				return NULL;
+			}
+
+			/*
+			 * Disable reporting to syncscan logic in a backwards scan; it's
+			 * not very likely anyone else is doing the same thing at the same
+			 * time, and much more likely that we'll just bollix things for
+			 * forward scanners.
+			 */
+			scan->rs_scan.rs_syncscan = false;
+			/* start from last page of the scan */
+			if (scan->rs_scan.rs_startblock > ZHEAP_METAPAGE + 1)
+				page = scan->rs_scan.rs_startblock - 1;
+			else
+				page = scan->rs_scan.rs_nblocks - 1;
+			valid = zheapgetpage(&scan->rs_scan, page);
+			if (!valid)
+				goto get_next_page;
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+		}
+
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp);
+		lines = PageGetMaxOffsetNumber(dp);
+
+		if (!scan->rs_inited)
+		{
+			lineoff = lines;	/* final offnum */
+			scan->rs_inited = true;
+		}
+		else
+		{
+			lineoff =			/* previous offnum */
+				OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self)));
+		}
+		/* page and lineoff now reference the physically previous tid */
+
+		linesleft = lineoff;
+	}
+	else
+	{
+		/*
+		 * In executor it seems NoMovementScanDirection is nothing but
+		 * do-nothing flag so we should not be here. The else part is still
+		 * here to keep the code as in heapgettup_pagemode.
+		 */
+		Assert(false);
+
+		return NULL;
+	}
+
+	/*
+	 * advance the scan until we find a qualifying tuple or run out of stuff
+	 * to scan
+	 */
+	lpp = PageGetItemId(dp, lineoff);
+
+get_next_tuple:
+	while (linesleft > 0)
+	{
+		if (ItemIdIsNormal(lpp))
+		{
+			ZHeapTuple	tuple = NULL;
+			ZHeapTuple loctup = NULL;
+			Size		loctup_len;
+			bool		valid = false;
+			ItemPointerData	tid;
+
+			ItemPointerSet(&tid, page, lineoff);
+
+			loctup_len = ItemIdGetLength(lpp);
+
+			loctup = palloc(ZHEAPTUPLESIZE + loctup_len);
+			loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE);
+
+			loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+			loctup->t_len = loctup_len;
+			loctup->t_self = tid;
+
+			/*
+			 * We always need to make a copy of zheap tuple as once we release
+			 * the buffer an in-place update can change the tuple.
+			 */
+			memcpy(loctup->t_data, ((ZHeapTupleHeader) PageGetItem((Page) dp, lpp)), loctup->t_len);
+
+			tuple = ZHeapTupleSatisfies(loctup, snapshot, scan->rs_cbuf, NULL);
+			valid = tuple ? true : false;
+
+			/*
+			 * If any prior version is visible, we pass latest visible as
+			 * true. The state of latest version of tuple is determined by
+			 * the called function.
+			 *
+			 * Note that, it's possible that tuple is updated in-place and
+			 * we're seeing some prior version of that. We handle that case
+			 * in ZHeapTupleHasSerializableConflictOut.
+			 */
+			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &tid,
+											scan->rs_cbuf, snapshot);
+
+			if (valid)
+			{
+				LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+				return tuple;
+			}
+		}
+
+		/*
+		 * otherwise move to the next item on the page
+		 */
+		--linesleft;
+		if (backward)
+		{
+			--lpp;			/* move back in this page's ItemId array */
+			--lineoff;
+		}
+		else
+		{
+			++lpp;			/* move forward in this page's ItemId array */
+			++lineoff;
+		}
+	}
+
+	/*
+	 * if we get here, it means we've exhausted the items on this page and
+	 * it's time to move to the next.
+	 */
+	LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+
+get_next_page:
+	for (;;)
+	{
+		/*
+		 * advance to next/prior page and detect end of scan
+		 */
+		if (backward)
+		{
+			finished = (page == scan->rs_scan.rs_startblock) ||
+				(scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false);
+			if (page == ZHEAP_METAPAGE + 1)
+				page = scan->rs_scan.rs_nblocks;
+			page--;
+		}
+		else if (scan->rs_scan.rs_parallel != NULL)
+		{
+			page = table_parallelscan_nextpage(&scan->rs_scan);
+			/* Skip metapage */
+			if (page == ZHEAP_METAPAGE)
+				page = table_parallelscan_nextpage(&scan->rs_scan);
+			finished = (page == InvalidBlockNumber);
+		}
+		else
+		{
+			page++;
+			if (page >= scan->rs_scan.rs_nblocks)
+				page = 0;
+
+			if (page == ZHEAP_METAPAGE)
+			{
+				/*
+				 * Since, we're skipping the metapage, we should update the scan
+				 * location if sync scan is enabled.
+				 */
+				if (scan->rs_scan.rs_syncscan)
+					ss_report_location(scan->rs_scan.rs_rd, page);
+				page++;
+			}
+
+			finished = (page == scan->rs_scan.rs_startblock) ||
+				(scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false);
+
+			/*
+			 * Report our new scan position for synchronization purposes. We
+			 * don't do that when moving backwards, however. That would just
+			 * mess up any other forward-moving scanners.
+			 *
+			 * Note: we do this before checking for end of scan so that the
+			 * final state of the position hint is back at the start of the
+			 * rel.  That's not strictly necessary, but otherwise when you run
+			 * the same query multiple times the starting position would shift
+			 * a little bit backwards on every invocation, which is confusing.
+			 * We don't guarantee any specific ordering in general, though.
+			 */
+			if (scan->rs_scan.rs_syncscan)
+				ss_report_location(scan->rs_scan.rs_rd, page);
+		}
+
+		/*
+		 * return NULL if we've exhausted all the pages
+		 */
+		if (finished)
+		{
+			if (BufferIsValid(scan->rs_cbuf))
+				ReleaseBuffer(scan->rs_cbuf);
+			scan->rs_cbuf = InvalidBuffer;
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_inited = false;
+			return NULL;
+		}
+
+		valid = zheapgetpage(&scan->rs_scan, page);
+		if (!valid)
+			continue;
+
+		if (!scan->rs_inited)
+			scan->rs_inited = true;
+
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp);
+		lines = PageGetMaxOffsetNumber((Page) dp);
+		linesleft = lines;
+		if (backward)
+		{
+			lineoff = lines;
+			lpp = PageGetItemId(dp, lines);
+		}
+		else
+		{
+			lineoff = FirstOffsetNumber;
+			lpp = PageGetItemId(dp, FirstOffsetNumber);
+		}
+
+		goto get_next_tuple;
+	}
+}
+#ifdef ZHEAPDEBUGALL
+#define ZHEAPDEBUG_1 \
+	elog(DEBUG2, "zheap_getnext([%s,nkeys=%d],dir=%d) called", \
+		 RelationGetRelationName(scan->rs_rd), scan->rs_nkeys, (int) direction)
+#define ZHEAPDEBUG_2 \
+	elog(DEBUG2, "zheap_getnext returning EOS")
+#define ZHEAPDEBUG_3 \
+	elog(DEBUG2, "zheap_getnext returning tuple")
+#else
+#define ZHEAPDEBUG_1
+#define ZHEAPDEBUG_2
+#define ZHEAPDEBUG_3
+#endif   /* !defined(ZHEAPDEBUGALL) */
+
+
+ZHeapTuple
+zheap_getnext(TableScanDesc sscan, ScanDirection direction)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	ZHeapTuple	zhtup = NULL;
+
+	/* Skip metapage */
+	if (scan->rs_scan.rs_startblock == ZHEAP_METAPAGE)
+		scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1;
+
+	/* Note: no locking manipulations needed */
+
+	ZHEAPDEBUG_1;				/* zheap_getnext( info ) */
+
+	/*
+	 * The key will be passed only for catalog table scans and catalog tables
+	 * are always a heap table!. So incase of zheap it should be set to NULL.
+	 */
+	Assert (scan->rs_scan.rs_key == NULL);
+
+	if (scan->rs_scan.rs_pageatatime)
+		zhtup = zheapgettup_pagemode(scan, direction);
+	else
+		zhtup = zheapgettup(scan, direction);
+
+	if (zhtup == NULL)
+	{
+		ZHEAPDEBUG_2;			/* zheap_getnext returning EOS */
+		return NULL;
+	}
+
+	scan->rs_cztup = zhtup;
+
+	/*
+	 * if we get here it means we have a new current scan tuple, so point to
+	 * the proper return buffer and return the tuple.
+	 */
+	ZHEAPDEBUG_3;				/* zheap_getnext returning tuple */
+
+	pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+
+	return zhtup;
+}
+
+TupleTableSlot *
+zheap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	ZHeapTuple	zhtup = NULL;
+
+	/* Skip metapage */
+	if (scan->rs_scan.rs_startblock == ZHEAP_METAPAGE)
+		scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1;
+
+	ZHEAPDEBUG_1;				/* zheap_getnext( info ) */
+
+	/*
+	 * The key will be passed only for catalog table scans and catalog tables
+	 * are always a heap table!. So incase of zheap it should be set to NULL.
+	 */
+	Assert (scan->rs_scan.rs_key == NULL);
+
+	if (scan->rs_scan.rs_pageatatime)
+		zhtup = zheapgettup_pagemode(scan, direction);
+	else
+		zhtup = zheapgettup(scan, direction);
+
+	if (zhtup == NULL)
+	{
+		ZHEAPDEBUG_2;			/* zheap_getnext returning EOS */
+		ExecClearTuple(slot);
+		return NULL;
+	}
+
+	scan->rs_cztup = zhtup;
+
+	/*
+	 * if we get here it means we have a new current scan tuple, so point to
+	 * the proper return buffer and return the tuple.
+	 */
+	ZHEAPDEBUG_3;				/* zheap_getnext returning tuple */
+
+	pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+
+	return ExecStoreZTuple(zhtup, slot, scan->rs_cbuf, false);
+}
+
+bool
+zheap_scan_bitmap_pagescan(TableScanDesc sscan,
+						   TBMIterateResult *tbmres)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	BlockNumber page = tbmres->blockno;
+	Page        dp;
+	Buffer		buffer;
+	Snapshot	snapshot;
+	int			ntup;
+
+	scan->rs_cindex = 0;
+	scan->rs_ntuples = 0;
+
+	/*
+	 * Ignore any claimed entries past what we think is the end of the
+	 * relation.  (This is probably not necessary given that we got at
+	 * least AccessShareLock on the table before performing any of the
+	 * indexscans, but let's be safe.)
+	 */
+	if (page >= scan->rs_scan.rs_nblocks)
+		return false;
+
+	if (page == ZHEAP_METAPAGE)
+		return false;
+
+	scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
+												 scan->rs_scan.rs_rd,
+												 page);
+	buffer = scan->rs_cbuf;
+	snapshot = scan->rs_scan.rs_snapshot;
+
+	ntup = 0;
+
+	/*
+	 * We must hold share lock on the buffer content while examining tuple
+	 * visibility.  Afterwards, however, the tuples we have found to be
+	 * visible are guaranteed good as long as we hold the buffer pin.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+	dp = (Page) BufferGetPage(buffer);
+
+	/*
+	 * Skip TPD pages. As of now, the size of special space in TPD pages is
+	 * different from other zheap pages like metapage and regular zheap page,
+	 * however, if that changes, we might need to explicitly store pagetype
+	 * flag somewhere.
+	 *
+	 * Fixme - As an exception, the size of special space for zheap page
+	 * with one transaction slot will match with TPD page's special size.
+	 */
+	if (PageGetSpecialSize(dp) == MAXALIGN(sizeof(TPDPageOpaqueData)))
+	{
+		UnlockReleaseBuffer(buffer);
+		return false;
+	}
+	/*
+	 * We need two separate strategies for lossy and non-lossy cases.
+	 */
+	if (tbmres->ntuples >= 0)
+	{
+		/*
+		 * Bitmap is non-lossy, so we just look through the offsets listed in
+		 * tbmres;
+		 */
+		int			curslot;
+
+		for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+		{
+			OffsetNumber offnum = tbmres->offsets[curslot];
+			ItemPointerData tid;
+			ZHeapTuple ztuple;
+
+			ItemPointerSet(&tid, page, offnum);
+			ztuple = zheap_search_buffer(&tid, scan->rs_scan.rs_rd, buffer, snapshot, NULL);
+			if (ztuple != NULL)
+				scan->rs_visztuples[ntup++] = ztuple;
+		}
+	}
+	else
+	{
+		/*
+		 * Bitmap is lossy, so we must examine each item pointer on the page.
+		 */
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(dp);
+		OffsetNumber offnum;
+
+		for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		lpp;
+			ZHeapTuple	loctup = NULL;
+			ZHeapTuple	resulttup = NULL;
+			Size		loctup_len;
+			bool		valid = false;
+			ItemPointerData tid;
+
+			lpp = PageGetItemId(dp, offnum);
+			if (!ItemIdIsNormal(lpp))
+				continue;
+
+			ItemPointerSet(&tid, page, offnum);
+			loctup_len = ItemIdGetLength(lpp);
+
+			loctup = palloc(ZHEAPTUPLESIZE + loctup_len);
+			loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE);
+
+			loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+			loctup->t_len = loctup_len;
+			loctup->t_self = tid;
+
+			/*
+			 * We always need to make a copy of zheap tuple as once we release
+			 * the buffer an in-place update can change the tuple.
+			 */
+			memcpy(loctup->t_data, ((ZHeapTupleHeader) PageGetItem((Page) dp, lpp)), loctup->t_len);
+
+			resulttup = ZHeapTupleSatisfies(loctup, snapshot, buffer, NULL);
+			valid = resulttup ? true : false;
+
+			if (valid)
+			{
+				PredicateLockTid(scan->rs_scan.rs_rd, &(resulttup->t_self), snapshot,
+								 IsSerializableXact() ?
+								 zheap_fetchinsertxid(resulttup, buffer) :
+								 InvalidTransactionId);
+			}
+
+			/*
+			 * If any prior version is visible, we pass latest visible as
+			 * true. The state of latest version of tuple is determined by
+			 * the called function.
+			 *
+			 * Note that, it's possible that tuple is updated in-place and
+			 * we're seeing some prior version of that. We handle that case
+			 * in ZHeapTupleHasSerializableConflictOut.
+			 */
+			CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &tid,
+											buffer, snapshot);
+
+			if (valid)
+				scan->rs_visztuples[ntup++] = resulttup;
+		}
+	}
+
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	Assert(ntup <= MaxZHeapTuplesPerPage);
+	scan->rs_ntuples = ntup;
+	return true;
+}
+
+bool
+zheap_scan_bitmap_pagescan_next(TableScanDesc sscan, struct TupleTableSlot *slot)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+
+	if (scan->rs_cindex < 0 || scan->rs_cindex >= scan->rs_ntuples)
+		return false;
+
+	scan->rs_cztup = scan->rs_visztuples[scan->rs_cindex];
+
+	/*
+	 * Set up the result slot to point to this tuple. We don't need
+	 * to keep the pin on the buffer, since we only scan tuples in page
+	 * mode.
+	 */
+	ExecStoreZTuple(scan->rs_cztup,
+					slot,
+					InvalidBuffer,
+					true);
+
+	scan->rs_cindex++;
+
+	return true;
+}
+
+/*
+ *	zheap_search_buffer - search tuple satisfying snapshot
+ *
+ * On entry, *tid is the TID of a tuple, and buffer is the buffer holding
+ * this tuple.  We search for the first visible member satisfying the given
+ * snapshot. If one is found, we return the tuple, in addition to updating
+ * *tid. Return NULL otherwise.
+ *
+ * The caller must already have pin and (at least) share lock on the buffer;
+ * it is still pinned/locked at exit.  Also, We do not report any pgstats
+ * count; caller may do so if wanted.
+ */
+ZHeapTuple
+zheap_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
+					Snapshot snapshot, bool *all_dead)
+{
+	Page		dp = (Page) BufferGetPage(buffer);
+	ItemId		lp;
+	OffsetNumber offnum;
+	ZHeapTuple	loctup = NULL;
+	ZHeapTupleData	loctup_tmp;
+	ZHeapTuple	resulttup = NULL;
+	Size		loctup_len;
+
+	if (all_dead)
+		*all_dead = false;
+
+	Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer));
+	offnum = ItemPointerGetOffsetNumber(tid);
+	/* check for bogus TID */
+	if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
+		return NULL;
+
+	lp = PageGetItemId(dp, offnum);
+
+	/* check for unused or dead items */
+	if (!(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp)))
+	{
+		if (all_dead)
+			*all_dead = true;
+		return NULL;
+	}
+
+	/*
+	 * If the record is deleted, its place in the page might have been taken
+	 * by another of its kind. Try to get it from the UNDO if it is still
+	 * visible.
+	 */
+	if (ItemIdIsDeleted(lp))
+	{
+		resulttup = ZHeapGetVisibleTuple(offnum, snapshot, buffer, all_dead);
+	}
+	else
+	{
+		loctup_len = ItemIdGetLength(lp);
+
+		loctup = palloc(ZHEAPTUPLESIZE + loctup_len);
+		loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE);
+
+		loctup->t_tableOid = RelationGetRelid(relation);
+		loctup->t_len = loctup_len;
+		loctup->t_self = *tid;
+
+		/*
+		 * We always need to make a copy of zheap tuple as once we release the
+		 * buffer an in-place update can change the tuple.
+		 */
+		memcpy(loctup->t_data, ((ZHeapTupleHeader) PageGetItem((Page) dp, lp)), loctup->t_len);
+
+		/* If it's visible per the snapshot, we must return it */
+		resulttup = ZHeapTupleSatisfies(loctup, snapshot, buffer, NULL);
+	}
+
+	if (resulttup)
+		PredicateLockTid(relation, &(resulttup->t_self), snapshot,
+						 IsSerializableXact() ?
+						 zheap_fetchinsertxid(resulttup, buffer) :
+						 InvalidTransactionId);
+
+	/*
+	 * If any prior version is visible, we pass latest visible as
+	 * true. The state of latest version of tuple is determined by
+	 * the called function.
+	 *
+	 * Note that, it's possible that tuple is updated in-place and
+	 * we're seeing some prior version of that. We handle that case
+	 * in ZHeapTupleHasSerializableConflictOut.
+	 */
+	CheckForSerializableConflictOut((resulttup != NULL), relation, (void *) tid,
+									buffer, snapshot);
+
+	if (resulttup)
+	{
+		/* set the tid */
+		*tid = resulttup->t_self;
+	}
+	else if (!ItemIdIsDeleted(lp))
+	{
+		/*
+		 * Temporarily get the copy of tuple from page to check if tuple is
+		 * surely dead.  We can't rely on the copy of local tuple (loctup)
+		 * that is prepared for the visibility test as that would have been
+		 * freed.
+		 */
+		loctup_tmp.t_tableOid = RelationGetRelid(relation);
+		loctup_tmp.t_data = (ZHeapTupleHeader) PageGetItem((Page) dp, lp);
+		loctup_tmp.t_len = ItemIdGetLength(lp);
+		loctup_tmp.t_self = *tid;
+
+		/*
+		 * If we can't see it, maybe no one else can either.  At caller
+		 * request, check whether tuple is dead to all transactions.
+		 */
+		if (!resulttup && all_dead &&
+			ZHeapTupleIsSurelyDead(&loctup_tmp,
+								   pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo),
+								   buffer))
+			*all_dead = true;
+	}
+	else
+	{
+		/* For deleted item pointers, we've already set the value for all_dead. */
+		return NULL;
+	}
+
+	return resulttup;
+}
+
+/*
+ * zheap_search - search for a zheap tuple satisfying snapshot.
+ *
+ * This is the same API as zheap_search_buffer, except that the caller
+ * does not provide the buffer containing the page, rather we access it
+ * locally.
+ */
+bool
+zheap_search(ItemPointer tid, Relation relation, Snapshot snapshot,
+			 bool *all_dead)
+{
+	Buffer	buffer;
+	ZHeapTuple	zheapTuple = NULL;
+
+	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+	zheapTuple = zheap_search_buffer(tid, relation, buffer, snapshot, all_dead);
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+	ReleaseBuffer(buffer);
+
+	return (zheapTuple != NULL);
+}
+
+/*
+ * zheap_fetch - Fetch a tuple based on TID.
+ *
+ *	This function is quite similar to heap_fetch with few differences like
+ *	it will always allocate the memory for tuple and do a memcpy of the tuple
+ *	instead of pointing it to disk tuple.  It is the responsibility of the
+ *	caller to free the tuple.
+ */
+bool
+zheap_fetch(Relation relation,
+			Snapshot snapshot,
+			ItemPointer tid,
+			ZHeapTuple *tuple,
+			Buffer *userbuf,
+			bool keep_buf,
+			Relation stats_relation)
+{
+	ZHeapTuple	resulttup;
+	ItemId		lp;
+	Buffer		buffer;
+	Page		page;
+	Size		tup_len;
+	OffsetNumber offnum;
+	bool		valid;
+	ItemPointerData	ctid;
+
+	/*
+	 * Fetch and pin the appropriate page of the relation.
+	 */
+	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+
+	/*
+	 * Need share lock on buffer to examine tuple commit status.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+	page = BufferGetPage(buffer);
+
+	/*
+	 * We'd better check for out-of-range offnum in case of VACUUM since the
+	 * TID was obtained. Exit if this is metapage.
+	 */
+	offnum = ItemPointerGetOffsetNumber(tid);
+	if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(page) ||
+		BufferGetBlockNumber(buffer) == ZHEAP_METAPAGE)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		if (keep_buf)
+			*userbuf = buffer;
+		else
+		{
+			ReleaseBuffer(buffer);
+			*userbuf = InvalidBuffer;
+		}
+		*tuple = NULL;
+		return false;
+	}
+
+	/*
+	 * get the item line pointer corresponding to the requested tid
+	 */
+	lp = PageGetItemId(page, offnum);
+
+	/*
+	 * Must check for dead and unused items.
+	 */
+	if (!ItemIdIsNormal(lp) && !ItemIdIsDeleted(lp))
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		if (keep_buf)
+			*userbuf = buffer;
+		else
+		{
+			ReleaseBuffer(buffer);
+			*userbuf = InvalidBuffer;
+		}
+		*tuple = NULL;
+		return false;
+	}
+
+	*tuple = NULL;
+	if (ItemIdIsDeleted(lp))
+	{
+		CommandId		tup_cid;
+		TransactionId	tup_xid;
+
+		resulttup = ZHeapGetVisibleTuple(offnum, snapshot, buffer, NULL);
+		ctid = *tid;
+		ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid);
+		valid = resulttup ? true : false;
+	}
+	else
+	{
+		/*
+		 * fill in *tuple fields
+		 */
+		tup_len = ItemIdGetLength(lp);
+
+		*tuple = palloc(ZHEAPTUPLESIZE + tup_len);
+		(*tuple)->t_data = (ZHeapTupleHeader) ((char *) (*tuple) + ZHEAPTUPLESIZE);
+
+		(*tuple)->t_tableOid = RelationGetRelid(relation);
+		(*tuple)->t_len = tup_len;
+		(*tuple)->t_self = *tid;
+
+		/*
+		 * We always need to make a copy of zheap tuple as once we release
+		 * the lock on buffer an in-place update can change the tuple.
+		 */
+		memcpy((*tuple)->t_data, ((ZHeapTupleHeader) PageGetItem(page, lp)), tup_len);
+		ItemPointerSetInvalid(&ctid);
+
+		/*
+		 * check time qualification of tuple, then release lock
+		 */
+		resulttup = ZHeapTupleSatisfies(*tuple, snapshot, buffer, &ctid);
+		valid = resulttup ? true : false;
+	}
+
+	if (valid)
+		PredicateLockTid(relation, &((resulttup)->t_self), snapshot,
+						 IsSerializableXact() ?
+						 zheap_fetchinsertxid(resulttup, buffer) :
+						 InvalidTransactionId);
+
+	/*
+	 * If any prior version is visible, we pass latest visible as
+	 * true. The state of latest version of tuple is determined by
+	 * the called function.
+	 *
+	 * Note that, it's possible that tuple is updated in-place and
+	 * we're seeing some prior version of that. We handle that case
+	 * in ZHeapTupleHasSerializableConflictOut.
+	 */
+	CheckForSerializableConflictOut(valid, relation, (void *) tid,
+									buffer, snapshot);
+
+	/*
+	 * Pass back the ctid if the tuple is invisible because it was updated.
+	 * Apart from SnapshotAny, ctid must be changed only when current
+	 * tuple in not visible.
+	 */
+	if (ItemPointerIsValid(&ctid))
+	{
+		if (snapshot == SnapshotAny || !valid)
+		{
+			*tid = ctid;
+		}
+	}
+
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	if (valid)
+	{
+		/*
+		 * All checks passed, so return the tuple as valid. Caller is now
+		 * responsible for releasing the buffer.
+		 */
+		*userbuf = buffer;
+		*tuple = resulttup;
+
+		/* Count the successful fetch against appropriate rel, if any */
+		if (stats_relation != NULL)
+			pgstat_count_heap_fetch(stats_relation);
+
+		return true;
+	}
+
+	/* Tuple failed time qual, but maybe caller wants to see it anyway. */
+	if (keep_buf)
+		*userbuf = buffer;
+	else
+	{
+		ReleaseBuffer(buffer);
+		*userbuf = InvalidBuffer;
+	}
+
+	return false;
+}
+
+/*
+ * zheap_fetch_undo_guts
+ *
+ * Main function for fetching the previous version of the tuple from the undo
+ * storage.
+ */
+ZHeapTuple
+zheap_fetch_undo_guts(ZHeapTuple ztuple, Buffer buffer, ItemPointer tid)
+{
+	UnpackedUndoRecord	*urec;
+	UndoRecPtr	urec_ptr;
+	ZHeapTuple	undo_tup;
+	int			out_slot_no PG_USED_FOR_ASSERTS_ONLY;
+
+	out_slot_no = GetTransactionSlotInfo(buffer,
+										 ItemPointerGetOffsetNumber(tid),
+										 ZHeapTupleHeaderGetXactSlot(ztuple->t_data),
+										 NULL,
+										 NULL,
+										 &urec_ptr,
+										 true,
+										 false);
+
+	/*
+	 * See the Asserts below to know why the transaction slot can't be frozen.
+	 */
+	Assert(out_slot_no != ZHTUP_SLOT_FROZEN);
+
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(tid),
+						   ItemPointerGetOffsetNumber(tid),
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/*
+	 * This function is used for trigger to retrieve previous version of the
+	 * tuple from undolog. Since, the transaction that is updating the tuple
+	 * is still in progress, neither undo record can be discarded nor it's
+	 * transaction slot can be reused.
+	 */
+	Assert(urec != NULL);
+	Assert(urec->uur_type == UNDO_INPLACE_UPDATE);
+
+	undo_tup = CopyTupleFromUndoRecord(urec, NULL, NULL, NULL, false, NULL);
+	UndoRecordRelease(urec);
+
+	return undo_tup;
+}
+
+/*
+ * zheap_fetch_undo
+ *
+ * Fetch the previous version of the tuple from the undo. In case of IN_PLACE
+ * update old tuple and new tuple has the same TID. And, trigger just
+ * stores the tid for fetching the old and new tuple so for fetching the older
+ * tuple this function should be called.
+ */
+bool
+zheap_fetch_undo(Relation relation,
+				 Snapshot snapshot,
+				 ItemPointer tid,
+				 ZHeapTuple *tuple,
+				 Buffer *userbuf,
+				 Relation stats_relation)
+{
+	ZHeapTuple	undo_tup;
+	Buffer		buffer;
+
+	if (!zheap_fetch(relation, snapshot, tid, tuple, &buffer, true, NULL))
+		return false;
+
+	undo_tup = zheap_fetch_undo_guts(*tuple, buffer, tid);
+	zheap_freetuple(*tuple);
+	*tuple = undo_tup;
+
+	ReleaseBuffer(buffer);
+
+	return true;
+}
+
+/*
+ * ZHeapTupleHeaderAdvanceLatestRemovedXid - Advance the latestremovexid, if
+ * tuple is deleted by a transaction greater than latestremovexid.  This is
+ * required to generate conflicts on Hot Standby.
+ *
+ * If we change this function then we need a similar change in
+ * *_xlog_vacuum_get_latestRemovedXid functions as well.
+ *
+ * This is quite similar to HeapTupleHeaderAdvanceLatestRemovedXid.
+ */
+void
+ZHeapTupleHeaderAdvanceLatestRemovedXid(ZHeapTupleHeader tuple,
+										TransactionId xid,
+										TransactionId *latestRemovedXid)
+{
+	/*
+	 * Ignore tuples inserted by an aborted transaction.
+	 *
+	 * XXX we can ignore the tuple if it was non-in-place updated/deleted
+	 * by the inserting transaction, but for that we need to traverse the
+	 * complete undo chain to find the root tuple, is it really worth?
+	 */
+	if (TransactionIdDidCommit(xid))
+	{
+		Assert (tuple->t_infomask & ZHEAP_DELETED ||
+				tuple->t_infomask & ZHEAP_UPDATED);
+		if (TransactionIdFollows(xid, *latestRemovedXid))
+			*latestRemovedXid = xid;
+	}
+
+	/* *latestRemovedXid may still be invalid at end */
+}
+
+/*
+ * ----------
+ * Page related API's.  Eventually we might need to split these API's
+ * into a separate file like bufzpage.c or buf_zheap_page.c or some
+ * thing like that.
+ * ----------
+ */
+
+/*
+ * ZPageAddItemExtended - Add an item to a zheap page.
+ *
+ *	This is similar to PageAddItemExtended except for max tuples that can
+ *	be accomodated on a page and alignment for each item (Ideally, we don't
+ *	need to align space between tuples as we always make the copy of tuple to
+ *	support in-place updates.  However, there are places in zheap code where we
+ *	access tuple header directly from page (ex. zheap_delete, zheap_update,
+ *	etc.) for which we them to be aligned at two-byte boundary). It
+ *	additionally handles the itemids that are marked as unused, but still
+ *	can't be reused.
+ *
+ *	Callers passed a valid input_page only incase there are constructing the
+ *	in-memory copy of tuples and then directly sync the page.
+ */
+OffsetNumber
+ZPageAddItemExtended(Buffer buffer,
+					 Page	input_page,
+					 Item item,
+					 Size size,
+					 OffsetNumber offsetNumber,
+					 int flags,
+					 bool NoTPDBufLock)
+{
+	Page		page;
+	Size		alignedSize;
+	PageHeader	phdr;
+	int			lower;
+	int			upper;
+	ItemId		itemId;
+	OffsetNumber limit;
+	bool		needshuffle = false;
+
+	/* Either one of buffer or page could be valid. */
+	if (BufferIsValid(buffer))
+	{
+		Assert(!PageIsValid(input_page));
+		page = BufferGetPage(buffer);
+	}
+	else
+	{
+		Assert(PageIsValid(input_page));
+		page = input_page;
+	}
+
+	phdr = (PageHeader) page;
+
+	/*
+	 * Be wary about corrupted page pointers
+	 */
+	if (phdr->pd_lower < SizeOfPageHeaderData ||
+		phdr->pd_lower > phdr->pd_upper ||
+		phdr->pd_upper > phdr->pd_special ||
+		phdr->pd_special > BLCKSZ)
+		ereport(PANIC,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+						phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+	/*
+	 * Select offsetNumber to place the new item at
+	 */
+	limit = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+
+	/* was offsetNumber passed in? */
+	if (OffsetNumberIsValid(offsetNumber))
+	{
+		/* yes, check it */
+		if ((flags & PAI_OVERWRITE) != 0)
+		{
+			if (offsetNumber < limit)
+			{
+				itemId = PageGetItemId(phdr, offsetNumber);
+				if (ItemIdIsUsed(itemId) || ItemIdHasStorage(itemId))
+				{
+					elog(WARNING, "will not overwrite a used ItemId");
+					return InvalidOffsetNumber;
+				}
+			}
+		}
+		else
+		{
+			if (offsetNumber < limit)
+				needshuffle = true;		/* need to move existing linp's */
+		}
+	}
+	else
+	{
+		/* offsetNumber was not passed in, so find a free slot */
+		/* if no free slot, we'll put it at limit (1st open slot) */
+		if (PageHasFreeLinePointers(phdr))
+		{
+			bool	hasPendingXact = false;
+
+			/*
+			 * Look for "recyclable" (unused) ItemId.  We check for no storage
+			 * as well, just to be paranoid --- unused items should never have
+			 * storage.
+			 */
+			for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
+			{
+				itemId = PageGetItemId(phdr, offsetNumber);
+				if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
+				{
+					/*
+					 * We allow Unused entries to be reused only if there is no
+					 * transaction information for the entry or the transaction
+					 * is committed.
+					 */
+					if (ItemIdHasPendingXact(itemId))
+					{
+						TransactionId	xid;
+						UndoRecPtr		urec_ptr;
+						int		trans_slot_id = ItemIdGetTransactionSlot(itemId);
+						uint32		epoch;
+
+						/*
+						 * We can't reach here for a valid input page as the
+						 * callers passed it for the pages that wouldn't have
+						 * been pruned.
+						 */
+						Assert(!PageIsValid(input_page));
+
+						/*
+						 * Here, we are relying on the transaction information in
+						 * slot as if the corresponding slot has been reused, then
+						 * transaction information from the entry would have been
+						 * cleared.  See PageFreezeTransSlots.
+						 */
+						if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+							break;
+						trans_slot_id = GetTransactionSlotInfo(buffer, offsetNumber,
+															   trans_slot_id, &epoch, &xid,
+															   &urec_ptr, NoTPDBufLock, false);
+						/*
+						 * It is quite possible that the item is showing some
+						 * valid transaction slot, but actual slot has been frozen.
+						 * This can happen when the slot belongs to TPD entry and
+						 * the corresponding TPD entry is pruned.
+						 */
+						if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+							break;
+						if (TransactionIdIsValid(xid) &&
+							!TransactionIdDidCommit(xid))
+						{
+							hasPendingXact = true;
+							continue;
+						}
+					}
+					break;
+				}
+			}
+			if (offsetNumber >= limit && !hasPendingXact)
+			{
+				/* the hint is wrong, so reset it */
+				PageClearHasFreeLinePointers(phdr);
+			}
+		}
+		else
+		{
+			/* don't bother searching if hint says there's no free slot */
+			offsetNumber = limit;
+		}
+	}
+
+	/* Reject placing items beyond the first unused line pointer */
+	if (offsetNumber > limit)
+	{
+		elog(WARNING, "specified item offset is too large");
+		return InvalidOffsetNumber;
+	}
+
+	/* Reject placing items beyond heap boundary, if heap */
+	if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxZHeapTuplesPerPage)
+	{
+		elog(WARNING, "can't put more than MaxHeapTuplesPerPage items in a heap page");
+		return InvalidOffsetNumber;
+	}
+
+	/*
+	 * Compute new lower and upper pointers for page, see if it'll fit.
+	 *
+	 * Note: do arithmetic as signed ints, to avoid mistakes if, say,
+	 * size > pd_upper.
+	 */
+	if (offsetNumber == limit || needshuffle)
+		lower = phdr->pd_lower + sizeof(ItemIdData);
+	else
+		lower = phdr->pd_lower;
+
+	alignedSize = SHORTALIGN(size);
+
+	upper = (int) phdr->pd_upper - (int) alignedSize;
+
+	if (lower > upper)
+		return InvalidOffsetNumber;
+
+	/*
+	 * OK to insert the item.  First, shuffle the existing pointers if needed.
+	 */
+	itemId = PageGetItemId(phdr, offsetNumber);
+
+	if (needshuffle)
+		memmove(itemId + 1, itemId,
+				(limit - offsetNumber) * sizeof(ItemIdData));
+
+	/* set the item pointer */
+	ItemIdSetNormal(itemId, upper, size);
+
+	/*
+	 * Items normally contain no uninitialized bytes.  Core bufpage consumers
+	 * conform, but this is not a necessary coding rule; a new index AM could
+	 * opt to depart from it.  However, data type input functions and other
+	 * C-language functions that synthesize datums should initialize all
+	 * bytes; datumIsEqual() relies on this.  Testing here, along with the
+	 * similar check in printtup(), helps to catch such mistakes.
+	 *
+	 * Values of the "name" type retrieved via index-only scans may contain
+	 * uninitialized bytes; see comment in btrescan().  Valgrind will report
+	 * this as an error, but it is safe to ignore.
+	 */
+	VALGRIND_CHECK_MEM_IS_DEFINED(item, size);
+
+	/* copy the item's data onto the page */
+	memcpy((char *) page + upper, item, size);
+
+	/* adjust page header */
+	phdr->pd_lower = (LocationIndex) lower;
+	phdr->pd_upper = (LocationIndex) upper;
+
+	return offsetNumber;
+}
+
+/*
+ * PageGetZHeapFreeSpace
+ *		Returns the size of the free (allocatable) space on a zheap page,
+ *		reduced by the space needed for a new line pointer.
+ *
+ * This is same as PageGetHeapFreeSpace except for max tuples that can
+ * be accomodated on a page or the way unused items are dealt.
+ */
+Size
+PageGetZHeapFreeSpace(Page page)
+{
+	Size		space;
+
+	space = PageGetFreeSpace(page);
+	if (space > 0)
+	{
+		OffsetNumber offnum,
+					nline;
+
+		nline = PageGetMaxOffsetNumber(page);
+		if (nline >= MaxZHeapTuplesPerPage)
+		{
+			if (PageHasFreeLinePointers((PageHeader) page))
+			{
+				/*
+				 * Since this is just a hint, we must confirm that there is
+				 * indeed a free line pointer
+				 */
+				for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+				{
+					ItemId		lp = PageGetItemId(page, offnum);
+
+					/*
+					 * The unused items that have pending xact information
+					 * can't be reused.
+					 */
+					if (!ItemIdIsUsed(lp) && !ItemIdHasPendingXact(lp))
+						break;
+				}
+
+				if (offnum > nline)
+				{
+					/*
+					 * The hint is wrong, but we can't clear it here since we
+					 * don't have the ability to mark the page dirty.
+					 */
+					space = 0;
+				}
+			}
+			else
+			{
+				/*
+				 * Although the hint might be wrong, PageAddItem will believe
+				 * it anyway, so we must believe it too.
+				 */
+				space = 0;
+			}
+		}
+	}
+	return space;
+}
+
+/*
+ * RelationPutZHeapTuple - Same as RelationPutHeapTuple, but for ZHeapTuple.
+ */
+static void
+RelationPutZHeapTuple(Relation relation,
+					  Buffer buffer,
+					  ZHeapTuple tuple)
+{
+	OffsetNumber offnum;
+
+	/* Add the tuple to the page.  Caller must ensure to have a TPD page lock. */
+	offnum = ZPageAddItem(buffer, NULL, (Item) tuple->t_data, tuple->t_len,
+						  InvalidOffsetNumber, false, true, false);
+
+	if (offnum == InvalidOffsetNumber)
+		elog(PANIC, "failed to add tuple to page");
+
+	/* Update tuple->t_self to the actual position where it was stored */
+	ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
+}
+
+/*
+ * CopyTupleFromUndoRecord
+ *	Extract the tuple from undo record.  Deallocate the previous version
+ *	of tuple and form the new version.
+ *
+ *	trans_slot_id - If non-NULL, then populate it with the transaction slot of
+ *			transaction that has modified the tuple.
+ *  cid - output command id
+ *	free_zhtup - if true, free the previous version of tuple.
+ */
+ZHeapTuple
+CopyTupleFromUndoRecord(UnpackedUndoRecord	*urec, ZHeapTuple zhtup,
+						int *trans_slot_id, CommandId *cid, bool free_zhtup,
+						Page page)
+{
+	ZHeapTuple	undo_tup;
+
+	switch (urec->uur_type)
+	{
+		case UNDO_INSERT:
+			{
+				Assert(zhtup != NULL);
+
+				/*
+				 * We need to deal with undo of root tuple only for a special
+				 * case where during non-inplace update operation, we
+				 * propagate the lockers information to the freshly inserted
+				 * tuple. But, we've to make sure the inserted tuple is locked only.
+				 */
+				Assert(ZHEAP_XID_IS_LOCKED_ONLY(zhtup->t_data->t_infomask));
+
+				undo_tup = palloc(ZHEAPTUPLESIZE + zhtup->t_len);
+				undo_tup->t_data = (ZHeapTupleHeader) ((char *) undo_tup + ZHEAPTUPLESIZE);
+
+				undo_tup->t_tableOid = zhtup->t_tableOid;
+				undo_tup->t_len = zhtup->t_len;
+				undo_tup->t_self = zhtup->t_self;
+				memcpy(undo_tup->t_data, zhtup->t_data, zhtup->t_len);
+
+				/*
+				 * Ensure to clear the visibility related information from
+				 * the tuple.  This is required for the cases where the passed
+				 * in tuple has lock only flags set on it.
+				 */
+				undo_tup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+
+				/*
+				 * Free the previous version of tuple, see comments in
+				 * UNDO_INPLACE_UPDATE case.
+				 */
+				if (free_zhtup)
+					zheap_freetuple(zhtup);
+
+				/* Retrieve the TPD transaction slot from payload */
+				if (trans_slot_id)
+				{
+					if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+						*trans_slot_id = *(int *) urec->uur_payload.data;
+					else
+						*trans_slot_id = ZHeapTupleHeaderGetXactSlot(undo_tup->t_data);
+				}
+				if (cid)
+					*cid = urec->uur_cid;
+			}
+			break;
+		case UNDO_XID_LOCK_ONLY:
+		case UNDO_XID_LOCK_FOR_UPDATE:
+		case UNDO_XID_MULTI_LOCK_ONLY:
+			{
+				ZHeapTupleHeader	undo_tup_hdr;
+
+				Assert(zhtup != NULL);
+
+				undo_tup_hdr = (ZHeapTupleHeader) urec->uur_tuple.data;
+
+				/*
+				 * For locked tuples, undo tuple data is always same as prior
+				 * tuple's data as we don't modify it.
+				 */
+				undo_tup = palloc(ZHEAPTUPLESIZE + zhtup->t_len);
+				undo_tup->t_data = (ZHeapTupleHeader) ((char *) undo_tup + ZHEAPTUPLESIZE);
+
+				undo_tup->t_tableOid = zhtup->t_tableOid;
+				undo_tup->t_len = zhtup->t_len;
+				undo_tup->t_self = zhtup->t_self;
+				memcpy(undo_tup->t_data, zhtup->t_data, zhtup->t_len);
+
+				/*
+				 * Free the previous version of tuple, see comments in
+				 * UNDO_INPLACE_UPDATE case.
+				 */
+				if (free_zhtup)
+					zheap_freetuple(zhtup);
+
+				/*
+				 * override the tuple header values with values fetched from
+				 * undo record
+				 */
+				undo_tup->t_data->t_infomask2 = undo_tup_hdr->t_infomask2;
+				undo_tup->t_data->t_infomask = undo_tup_hdr->t_infomask;
+				undo_tup->t_data->t_hoff = undo_tup_hdr->t_hoff;
+
+				/* Retrieve the TPD transaction slot from payload */
+				if (trans_slot_id)
+				{
+					if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+					{
+						/*
+						 * We first store the Lockmode and then transaction slot in
+						 * payload, so retrieve it accordingly.
+						 */
+						*trans_slot_id = *(int *) ((char *) urec->uur_payload.data + sizeof(LockTupleMode));
+					}
+					else
+						*trans_slot_id = ZHeapTupleHeaderGetXactSlot(undo_tup->t_data);
+				}
+			}
+			break;
+		case UNDO_DELETE:
+		case UNDO_UPDATE:
+		case UNDO_INPLACE_UPDATE:
+			{
+				Size		offset = 0;
+				uint32		undo_tup_len;
+
+				/*
+				 * After this point, the previous version of tuple won't be used.
+				 * If we don't free the previous version, then we might accumulate
+				 * lot of memory when many prior versions needs to be traversed.
+				 *
+				 * XXX One way to save deallocation and allocation of memory is to
+				 * only make a copy of prior version of tuple when it is determined
+				 * that the version is visible to current snapshot.  In practise,
+				 * we don't need to traverse many prior versions, so let's be tidy.
+				 */
+				undo_tup_len = *((uint32 *) &urec->uur_tuple.data[offset]);
+
+				undo_tup = palloc(ZHEAPTUPLESIZE + undo_tup_len);
+				undo_tup->t_data = (ZHeapTupleHeader) ((char *) undo_tup + ZHEAPTUPLESIZE);
+
+				memcpy(&undo_tup->t_len, &urec->uur_tuple.data[offset], sizeof(uint32));
+				offset += sizeof(uint32);
+
+				memcpy(&undo_tup->t_self, &urec->uur_tuple.data[offset], sizeof(ItemPointerData));
+				offset += sizeof(ItemPointerData);
+
+				memcpy(&undo_tup->t_tableOid, &urec->uur_tuple.data[offset], sizeof(Oid));
+				offset += sizeof(Oid);
+
+				memcpy(undo_tup->t_data, (ZHeapTupleHeader) &urec->uur_tuple.data[offset], undo_tup_len);
+
+				/* Retrieve the TPD transaction slot from payload */
+				if (trans_slot_id)
+				{
+					if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT)
+					{
+						/*
+						 * For UNDO_UPDATE, we first store the CTID and then
+						 * transaction slot, so retrieve it accordingly.
+						 */
+						if (urec->uur_type == UNDO_UPDATE)
+							*trans_slot_id = *(int *) ((char *) urec->uur_payload.data + sizeof(ItemPointerData));
+						else
+							*trans_slot_id = *(int *) urec->uur_payload.data;
+					}
+					else
+						*trans_slot_id = ZHeapTupleHeaderGetXactSlot(undo_tup->t_data);
+				}
+
+				if (free_zhtup)
+					zheap_freetuple(zhtup);
+			}
+			break;
+		default:
+			elog(ERROR, "unsupported undo record type");
+			/*
+			 * During tests, we take down the server to notice the error easily.
+			 * This can be removed later.
+			 */
+			Assert(0);
+	}
+
+	/*
+	 * If the undo tuple is pointing to the last slot of the page and the page
+	 * has TPD slots that means the last slot information must move to the
+	 * first slot of the TPD page so change the slot number as per that.
+	 */
+	if (page && (*trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS) &&
+		ZHeapPageHasTPDSlot((PageHeader) page))
+		*trans_slot_id = ZHEAP_PAGE_TRANS_SLOTS + 1;
+
+	return undo_tup;
+}
+
+/*
+ * ZHeapGetUsableOffsetRanges
+ *
+ * Given a page and a set of tuples, it calculates how many tuples can fit in
+ * the page and the contiguous ranges of free offsets that can be used/reused
+ * in the same page to store those tuples.
+ */
+ZHeapFreeOffsetRanges *
+ZHeapGetUsableOffsetRanges(Buffer buffer,
+						   ZHeapTuple *tuples,
+						   int ntuples,
+						   Size saveFreeSpace)
+{
+	Page			page;
+	PageHeader		phdr;
+	int				nthispage;
+	Size			used_space;
+	Size			avail_space;
+	OffsetNumber 	limit, offsetNumber;
+	ZHeapFreeOffsetRanges	*zfree_offset_ranges;
+
+	page = BufferGetPage(buffer);
+	phdr = (PageHeader) page;
+
+	zfree_offset_ranges = (ZHeapFreeOffsetRanges *)
+							palloc0(sizeof(ZHeapFreeOffsetRanges));
+
+	zfree_offset_ranges->nranges = 0;
+	limit = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+	avail_space = PageGetExactFreeSpace(page);
+	nthispage = 0;
+	used_space = 0;
+
+	if (PageHasFreeLinePointers(phdr))
+	{
+		bool in_range = false;
+		/*
+		 * Look for "recyclable" (unused) ItemId.  We check for no storage
+		 * as well, just to be paranoid --- unused items should never have
+		 * storage.
+		 */
+		for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
+		{
+			ItemId itemId = PageGetItemId(phdr, offsetNumber);
+
+			if (nthispage >= ntuples)
+			{
+				/* No more tuples to insert */
+				break;
+			}
+			if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
+			{
+				ZHeapTuple zheaptup = tuples[nthispage];
+				Size needed_space = used_space + zheaptup->t_len + saveFreeSpace;
+
+				/* Check if we can fit this tuple in the page */
+				if (avail_space < needed_space)
+				{
+					/* No more space to insert tuples in this page */
+					break;
+				}
+
+				used_space += zheaptup->t_len;
+				nthispage++;
+
+				if (!in_range)
+				{
+					/* Start of a new range */
+					zfree_offset_ranges->nranges++;
+					zfree_offset_ranges->startOffset[zfree_offset_ranges->nranges - 1] = offsetNumber;
+					in_range = true;
+				}
+				zfree_offset_ranges->endOffset[zfree_offset_ranges->nranges - 1] = offsetNumber;
+			}
+			else
+			{
+				in_range = false;
+			}
+		}
+	}
+
+	/*
+	 * Now, there are no free line pointers. Check whether we can insert another
+	 * tuple in the page, then we'll insert another range starting from limit to
+	 * max required offset number. We can decide the actual end offset for this
+	 * range while inserting tuples in the buffer.
+	 */
+	if ((limit <= MaxZHeapTuplesPerPage) && (nthispage < ntuples))
+	{
+		ZHeapTuple zheaptup = tuples[nthispage];
+		Size needed_space = used_space + sizeof(ItemIdData) +
+						zheaptup->t_len + saveFreeSpace;
+
+		/* Check if we can fit this tuple + a new offset in the page */
+		if (avail_space >= needed_space)
+		{
+			OffsetNumber	max_required_offset;
+			int				required_tuples = ntuples - nthispage;
+
+			/*
+			 * Choose minimum among MaxOffsetNumber and the maximum offsets
+			 * required for tuples.
+			 */
+			max_required_offset = Min(MaxOffsetNumber, (limit + required_tuples));
+
+			zfree_offset_ranges->nranges++;
+			zfree_offset_ranges->startOffset[zfree_offset_ranges->nranges - 1] = limit;
+			zfree_offset_ranges->endOffset[zfree_offset_ranges->nranges - 1] = max_required_offset;
+		}
+	}
+
+	return zfree_offset_ranges;
+}
+
+/*
+ *	zheap_multi_insert	- insert multiple tuple into a zheap
+ *
+ * Similar to heap_multi_insert(), but inserts zheap tuples.
+ */
+void
+zheap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
+				   CommandId cid, int options, BulkInsertState bistate)
+{
+	ZHeapTuple	*zheaptuples;
+	int			i;
+	int			ndone;
+	char	   *scratch = NULL;
+	Page		page;
+	bool		needwal;
+	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
+	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	Size		saveFreeSpace;
+	TransactionId	xid = GetTopTransactionId();
+	uint32		epoch = GetEpochForXid(xid);
+	xl_undolog_meta	undometa;
+	bool		lock_reacquired;
+	bool		skip_undo;
+
+	needwal = RelationNeedsWAL(relation);
+	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
+												   HEAP_DEFAULT_FILLFACTOR);
+	/*
+	 * We can skip inserting undo records if the tuples are to be marked
+	 * as frozen.
+	 */
+	skip_undo = (options & HEAP_INSERT_FROZEN);
+
+	/* Toast and set header data in all the tuples */
+	zheaptuples = palloc(ntuples * sizeof(ZHeapTuple));
+	for (i = 0; i < ntuples; i++)
+	{
+		zheaptuples[i] = zheap_prepare_insert(relation, ExecGetZHeapTupleFromSlot(slots[i]), options);
+
+		if (slots[i]->tts_tableOid != InvalidOid)
+			zheaptuples[i]->t_tableOid = slots[i]->tts_tableOid;
+	}
+
+	/*
+	 * Allocate some memory to use for constructing the WAL record. Using
+	 * palloc() within a critical section is not safe, so we allocate this
+	 * beforehand. This has consideration that offset ranges and tuples to be
+	 * stored in page will have size lesser than BLCKSZ. This is true since a
+	 * zheap page contains page header and transaction slots in special area
+	 * which are not stored in scratch area. In future, if we reduce the number
+	 * of transaction slots to one, we may need to allocate twice the BLCKSZ of
+	 * scratch area.
+	 */
+	if (needwal)
+		scratch = palloc(BLCKSZ);
+
+	/*
+	 * See heap_multi_insert to know why checking conflicts is important
+	 * before actually inserting the tuple.
+	 */
+	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+
+	ndone = 0;
+	while (ndone < ntuples)
+	{
+		Buffer	buffer;
+		Buffer	vmbuffer = InvalidBuffer;
+		bool	all_visible_cleared = false;
+		int		nthispage = 0;
+		int		trans_slot_id;
+		int		ucnt = 0;
+		UndoRecPtr	urecptr = InvalidUndoRecPtr,
+								prev_urecptr = InvalidUndoRecPtr;
+		UnpackedUndoRecord		*undorecord = NULL;
+		ZHeapFreeOffsetRanges	*zfree_offset_ranges;
+		OffsetNumber	usedoff[MaxOffsetNumber];
+		OffsetNumber	max_required_offset;
+		uint8		vm_status;
+
+		CHECK_FOR_INTERRUPTS();
+
+reacquire_buffer:
+		/*
+		 * Find buffer where at least the next tuple will fit.  If the page is
+		 * all-visible, this will also pin the requisite visibility map page.
+		 */
+		if (BufferIsValid(vmbuffer))
+		{
+			ReleaseBuffer(vmbuffer);
+			vmbuffer = InvalidBuffer;
+		}
+
+		buffer = RelationGetBufferForZTuple(relation, zheaptuples[ndone]->t_len,
+											InvalidBuffer, options, bistate,
+											&vmbuffer, NULL);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * Get the unused offset ranges in the page. This is required for
+		 * deciding the number of undo records to be prepared later.
+		 */
+		zfree_offset_ranges = ZHeapGetUsableOffsetRanges(buffer,
+														 &zheaptuples[ndone],
+														 ntuples - ndone,
+														 saveFreeSpace);
+
+		/*
+		 * We've ensured at least one tuple fits in the page. So, there'll be
+		 * at least one offset range.
+		 */
+		Assert(zfree_offset_ranges->nranges > 0);
+
+		max_required_offset =
+			zfree_offset_ranges->endOffset[zfree_offset_ranges->nranges - 1];
+
+		/*
+		 * If we're not inserting an undo record, we don't have to reserve
+		 * a transaction slot as well.
+		 */
+		if (!skip_undo)
+		{
+			/*
+			 * The transaction information of tuple needs to be set in transaction
+			 * slot, so needs to reserve the slot before proceeding with the actual
+			 * operation.  It will be costly to wait for getting the slot, but we do
+			 * that by releasing the buffer lock.
+			 */
+			trans_slot_id = PageReserveTransactionSlot(relation,
+													   buffer,
+													   max_required_offset,
+													   epoch,
+													   xid,
+													   &prev_urecptr,
+													   &lock_reacquired);
+			if (lock_reacquired)
+				goto reacquire_buffer;
+
+			if (trans_slot_id == InvalidXactSlotId)
+			{
+				UnlockReleaseBuffer(buffer);
+
+				pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+				pg_usleep(10000L);	/* 10 ms */
+				pgstat_report_wait_end();
+
+				goto reacquire_buffer;
+			}
+
+			/* transaction slot must be reserved before adding tuple to page */
+			Assert(trans_slot_id != InvalidXactSlotId);
+
+			/*
+			 * For every contiguous free or new offsets, we insert an undo record.
+			 * In the payload data of each undo record, we store the start and end
+			 * available offset for a contiguous range.
+			 */
+			undorecord = (UnpackedUndoRecord *) palloc(zfree_offset_ranges->nranges
+													   * sizeof(UnpackedUndoRecord));
+			/* Start UNDO prepare Stuff */
+			urecptr = prev_urecptr;
+			for (i = 0; i < zfree_offset_ranges->nranges; i++)
+			{
+				/* prepare an undo record */
+				undorecord[i].uur_type = UNDO_MULTI_INSERT;
+				undorecord[i].uur_info = 0;
+				undorecord[i].uur_prevlen = 0;	/* Fixme - need to figure out how to set this value and then decide whether to WAL log it */
+				undorecord[i].uur_reloid = relation->rd_id;
+				undorecord[i].uur_prevxid = FrozenTransactionId;
+				undorecord[i].uur_xid = xid;
+				undorecord[i].uur_cid = cid;
+				undorecord[i].uur_fork = MAIN_FORKNUM;
+				undorecord[i].uur_blkprev = urecptr;
+				undorecord[i].uur_block = BufferGetBlockNumber(buffer);
+				undorecord[i].uur_tuple.len = 0;
+				undorecord[i].uur_offset = 0;
+				undorecord[i].uur_payload.len = 2 * sizeof(OffsetNumber);
+			}
+
+			UndoSetPrepareSize(undorecord, zfree_offset_ranges->nranges,
+							   InvalidTransactionId,
+							   UndoPersistenceForRelation(relation), &undometa);
+
+			for (i = 0; i < zfree_offset_ranges->nranges; i++)
+			{
+				undorecord[i].uur_blkprev = urecptr;
+				urecptr = PrepareUndoInsert(&undorecord[i],
+											InvalidTransactionId,
+											UndoPersistenceForRelation(relation),
+											NULL);
+
+				initStringInfo(&undorecord[i].uur_payload);
+			}
+
+			Assert(UndoRecPtrIsValid(urecptr));
+			elog(DEBUG1, "Undo record prepared: %d for Block Number: %d",
+				 zfree_offset_ranges->nranges, BufferGetBlockNumber(buffer));
+			/* End UNDO prepare Stuff */
+		}
+
+		/*
+		 * If there is a valid vmbuffer get its status.  The vmbuffer will not
+		 * be valid if operated page is newly extended, see
+		 * RelationGetBufferForZTupleand. Also, anyway by default vm status
+		 * bits are clear for those pages hence no need to clear it again!
+		 */
+		vm_status = visibilitymap_get_status(relation,
+										BufferGetBlockNumber(buffer), &vmbuffer);
+
+		/*
+		 * Lock the TPD page before starting critical section.  We might need
+		 * to access it in ZPageAddItemExtended.  Note that if the transaction
+		 * slot belongs to TPD entry, then the TPD page must be locked during
+		 * slot reservation.
+		 *
+		 * XXX We can optimize this by avoid taking TPD page lock unless the page
+		 * has some unused item which requires us to fetch the transaction
+		 * information from TPD.
+		 */
+		if (trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS &&
+			ZHeapPageHasTPDSlot((PageHeader) page) &&
+			PageHasFreeLinePointers((PageHeader) page))
+			TPDPageLock(relation, buffer);
+
+		/* NO EREPORT(ERROR) from here till changes are logged */
+		START_CRIT_SECTION();
+
+		/*
+		 * RelationGetBufferForZTuple has ensured that the first tuple fits.
+		 * Keep calm and put that on the page, and then as many other tuples
+		 * as fit.
+		 */
+		nthispage = 0;
+		for (i = 0; i < zfree_offset_ranges->nranges; i++)
+		{
+			OffsetNumber offnum;
+
+			for (offnum = zfree_offset_ranges->startOffset[i];
+				 offnum <= zfree_offset_ranges->endOffset[i];
+				 offnum++)
+			{
+				ZHeapTuple	zheaptup;
+
+				if (ndone + nthispage == ntuples)
+					break;
+
+				zheaptup = zheaptuples[ndone + nthispage];
+
+				/* Make sure that the tuple fits in the page. */
+				if (PageGetZHeapFreeSpace(page) < zheaptup->t_len + saveFreeSpace)
+					break;
+
+				if (!(options & HEAP_INSERT_FROZEN))
+					ZHeapTupleHeaderSetXactSlot(zheaptup->t_data, trans_slot_id);
+
+				RelationPutZHeapTuple(relation, buffer, zheaptup);
+
+				/*
+				 * Let's make sure that we've decided the offset ranges
+				 * correctly.
+				 */
+				Assert(offnum == ItemPointerGetOffsetNumber(&(zheaptup->t_self)));
+
+				/* track used offsets */
+				usedoff[ucnt++] = offnum;
+
+				/*
+				 * We don't use heap_multi_insert for catalog tuples yet, but
+				 * better be prepared...
+				 * Fixme: This won't work as it needs to access cmin/cmax which
+				 * we probably needs to retrieve from TPD or UNDO.
+				 */
+				 if (needwal && need_cids)
+				 {
+					/* log_heap_new_cid(relation, heaptup); */
+				 }
+				 nthispage++;
+			}
+
+			/*
+			 * Store the offset ranges in undo payload. We've not calculated the
+			 * end offset for the last range previously. Hence, we set it to
+			 * offnum - 1. There is no harm in doing the same for previous undo
+			 * records as well.
+			 */
+			zfree_offset_ranges->endOffset[i] = offnum - 1;
+			if (!skip_undo)
+			{
+				appendBinaryStringInfo(&undorecord[i].uur_payload,
+										   (char *) &zfree_offset_ranges->startOffset[i],
+										   sizeof(OffsetNumber));
+				appendBinaryStringInfo(&undorecord[i].uur_payload,
+										  (char *) &zfree_offset_ranges->endOffset[i],
+										   sizeof(OffsetNumber));
+			}
+			elog(DEBUG1, "start offset: %d, end offset: %d",
+				 zfree_offset_ranges->startOffset[i], zfree_offset_ranges->endOffset[i]);
+		}
+
+		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) ||
+			(vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE))
+		{
+			all_visible_cleared = true;
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+								vmbuffer, VISIBILITYMAP_VALID_BITS);
+		}
+
+		/*
+		 * XXX Should we set PageSetPrunable on this page ? See heap_insert()
+		 */
+
+		MarkBufferDirty(buffer);
+
+		if (!skip_undo)
+		{
+			/* Insert the undo */
+			InsertPreparedUndo();
+
+			/*
+			 * We're sending the undo record for debugging purpose. So, just send
+			 * the last one.
+			 */
+			if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			{
+				PageSetUNDO(undorecord[zfree_offset_ranges->nranges - 1],
+							buffer,
+							trans_slot_id,
+							true,
+							epoch,
+							xid,
+							urecptr,
+							usedoff,
+							ucnt);
+			}
+			else
+			{
+				PageSetUNDO(undorecord[zfree_offset_ranges->nranges - 1],
+							buffer,
+							trans_slot_id,
+							true,
+							epoch,
+							xid,
+							urecptr,
+							NULL,
+							0);
+			}
+		}
+
+		/* XLOG stuff */
+		if (needwal)
+		{
+			xl_undo_header	xlundohdr;
+			XLogRecPtr	recptr;
+			xl_zheap_multi_insert *xlrec;
+			uint8		info = XLOG_ZHEAP_MULTI_INSERT;
+			char	   *tupledata;
+			int			totaldatalen;
+			char	   *scratchptr = scratch;
+			bool		init;
+			int			bufflags = 0;
+			XLogRecPtr	RedoRecPtr;
+			bool		doPageWrites;
+
+			/*
+			 * Store the information required to generate undo record during
+			 * replay. All undo records have same information apart from the
+			 * payload data. Hence, we can copy the same from the last record.
+			 */
+			xlundohdr.reloid = relation->rd_id;
+			xlundohdr.urec_ptr = urecptr;
+			xlundohdr.blkprev = prev_urecptr;
+
+			/* allocate xl_zheap_multi_insert struct from the scratch area */
+			xlrec = (xl_zheap_multi_insert *) scratchptr;
+			xlrec->flags = all_visible_cleared ? XLZ_INSERT_ALL_VISIBLE_CLEARED : 0;
+			if (skip_undo)
+				xlrec->flags |= XLZ_INSERT_IS_FROZEN;
+			xlrec->ntuples = nthispage;
+			scratchptr += SizeOfZHeapMultiInsert;
+
+			/* copy the offset ranges as well */
+			memcpy((char *) scratchptr, (char *) &zfree_offset_ranges->nranges, sizeof(int));
+			scratchptr += sizeof(int);
+			for (i = 0; i < zfree_offset_ranges->nranges; i++)
+			{
+				memcpy((char *)scratchptr, (char *)&zfree_offset_ranges->startOffset[i], sizeof(OffsetNumber));
+				scratchptr += sizeof(OffsetNumber);
+				memcpy((char *)scratchptr, (char *)&zfree_offset_ranges->endOffset[i], sizeof(OffsetNumber));
+				scratchptr += sizeof(OffsetNumber);
+			}
+
+			/* the rest of the scratch space is used for tuple data */
+			tupledata = scratchptr;
+
+			/*
+			 * Write out an xl_multi_insert_tuple and the tuple data itself
+			 * for each tuple.
+			 */
+			for (i = 0; i < nthispage; i++)
+			{
+				ZHeapTuple	zheaptup = zheaptuples[ndone + i];
+				xl_multi_insert_ztuple *tuphdr;
+				int			datalen;
+
+				/* xl_multi_insert_tuple needs two-byte alignment. */
+				tuphdr = (xl_multi_insert_ztuple *) SHORTALIGN(scratchptr);
+				scratchptr = ((char *) tuphdr) + SizeOfMultiInsertZTuple;
+
+				tuphdr->t_infomask2 = zheaptup->t_data->t_infomask2;
+				tuphdr->t_infomask = zheaptup->t_data->t_infomask;
+				tuphdr->t_hoff = zheaptup->t_data->t_hoff;
+
+				/* write bitmap [+ padding] [+ oid] + data */
+				datalen = zheaptup->t_len - SizeofZHeapTupleHeader;
+				memcpy(scratchptr,
+					   (char *) zheaptup->t_data + SizeofZHeapTupleHeader,
+					   datalen);
+				tuphdr->datalen = datalen;
+				scratchptr += datalen;
+			}
+			totaldatalen = scratchptr - tupledata;
+			Assert((scratchptr - scratch) < BLCKSZ);
+
+			if (need_tuple_data)
+				xlrec->flags |= XLZ_INSERT_CONTAINS_NEW_TUPLE;
+
+			/*
+			 * Signal that this is the last xl_zheap_multi_insert record
+			 * emitted by this call to zheap_multi_insert(). Needed for logical
+			 * decoding so it knows when to cleanup temporary data.
+			 */
+			if (ndone + nthispage == ntuples)
+				xlrec->flags |= XLZ_INSERT_LAST_IN_MULTI;
+
+			/*
+			 * If the page was previously empty, we can reinit the page
+			 * instead of restoring the whole thing.
+			 */
+			init = (ItemPointerGetOffsetNumber(&(zheaptuples[ndone]->t_self)) == FirstOffsetNumber &&
+					PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1);
+
+			if (init)
+			{
+				info |= XLOG_ZHEAP_INIT_PAGE;
+				bufflags |= REGBUF_WILL_INIT;
+			}
+
+			/*
+			 * If we're doing logical decoding, include the new tuple data
+			 * even if we take a full-page image of the page.
+			 */
+			if (need_tuple_data)
+				bufflags |= REGBUF_KEEP_DATA;
+
+prepare_xlog:
+			/* LOG undolog meta if this is the first WAL after the checkpoint. */
+			LogUndoMetaData(&undometa);
+			GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+			XLogBeginInsert();
+			/* copy undo related info in maindata */
+			XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+			/* copy xl_multi_insert_tuple in maindata */
+			XLogRegisterData((char *) xlrec, tupledata - scratch);
+
+			/* If we've skipped undo insertion, we don't need a slot in page. */
+			if (!skip_undo && trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			{
+				xlrec->flags |= XLZ_INSERT_CONTAINS_TPD_SLOT;
+				XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id));
+			}
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+
+			/* copy tuples in block data */
+			XLogRegisterBufData(0, tupledata, totaldatalen);
+			if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT)
+				(void) RegisterTPDBuffer(page, 1);
+
+			/* filtering by origin on a row level is much more efficient */
+			XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+			recptr = XLogInsertExtended(RM_ZHEAP_ID, info, RedoRecPtr,
+										doPageWrites);
+			if (recptr == InvalidXLogRecPtr)
+			{
+				ResetRegisteredTPDBuffers();
+				goto prepare_xlog;
+			}
+
+			PageSetLSN(page, recptr);
+			if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT)
+				TPDPageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		/* be tidy */
+		if (!skip_undo)
+		{
+			for (i = 0; i < zfree_offset_ranges->nranges; i++)
+				pfree(undorecord[i].uur_payload.data);
+			pfree(undorecord);
+		}
+		pfree(zfree_offset_ranges);
+
+		UnlockReleaseBuffer(buffer);
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+		UnlockReleaseUndoBuffers();
+		UnlockReleaseTPDBuffers();
+
+		ndone += nthispage;
+	}
+
+	/*
+	 * We're done with the actual inserts.  Check for conflicts again, to
+	 * ensure that all rw-conflicts in to these inserts are detected.  Without
+	 * this final check, a sequential scan of the heap may have locked the
+	 * table after the "before" check, missing one opportunity to detect the
+	 * conflict, and then scanned the table before the new tuples were there,
+	 * missing the other chance to detect the conflict.
+	 *
+	 * For heap inserts, we only need to check for table-level SSI locks. Our
+	 * new tuples can't possibly conflict with existing tuple locks, and heap
+	 * page locks are only consolidated versions of tuple locks; they do not
+	 * lock "gaps" as index page locks do.  So we don't need to specify a
+	 * buffer when making the call.
+	 */
+	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+
+	/*
+	 * If tuples are cachable, mark them for invalidation from the caches in
+	 * case we abort.  Note it is OK to do this after releasing the buffer,
+	 * because the heaptuples data structure is all in local memory, not in
+	 * the shared buffer.
+	 */
+	if (IsCatalogRelation(relation))
+	{
+		/*
+		for (i = 0; i < ntuples; i++)
+			CacheInvalidateHeapTuple(relation, zheaptuples[i], NULL); */
+	}
+
+	/*
+	 * Copy t_self fields back to the caller's original tuples. This does
+	 * nothing for untoasted tuples (tuples[i] == heaptuples[i)], but it's
+	 * probably faster to always copy than check.
+	 */
+	for (i = 0; i < ntuples; i++)
+		slots[i]->tts_tid = zheaptuples[i]->t_self;
+
+	pgstat_count_heap_insert(relation, ntuples);
+}
+
+/*
+ * Mask a zheap page before performing consistency checks on it.
+ */
+void
+zheap_mask(char *pagedata, BlockNumber blkno)
+{
+	Page		page = (Page) pagedata;
+
+	mask_page_lsn_and_checksum(page);
+
+	mask_page_hint_bits(page);
+	mask_unused_space(page);
+
+	if (PageGetSpecialSize(page) == MAXALIGN(BLCKSZ))
+	{
+		ZHeapMetaPage metap PG_USED_FOR_ASSERTS_ONLY;
+		metap = ZHeapPageGetMeta(page);
+		/* It's a meta-page, no need to mask further. */
+		Assert(metap->zhm_magic == ZHEAP_MAGIC);
+		Assert(metap->zhm_version == ZHEAP_VERSION);
+		return;
+	}
+
+	if (PageGetSpecialSize(page) == MAXALIGN(sizeof(TPDPageOpaqueData)))
+	{
+		/* It's a TPD page, no need to mask further. */
+		return;
+	}
+}
+
+/*
+ * Per-undorecord callback from UndoFetchRecord to check whether
+ * an undorecord satisfies the given conditions.
+ */
+bool
+ZHeapSatisfyUndoRecord(UnpackedUndoRecord* urec, BlockNumber blkno,
+								OffsetNumber offset, TransactionId xid)
+{
+	Assert(urec != NULL);
+	Assert(blkno != InvalidBlockNumber);
+
+	if ((urec->uur_block != blkno ||
+		(TransactionIdIsValid(xid) && !TransactionIdEquals(xid, urec->uur_xid))))
+		return false;
+
+	switch (urec->uur_type)
+	{
+		case UNDO_MULTI_INSERT:
+			{
+				OffsetNumber	start_offset;
+				OffsetNumber	end_offset;
+
+				start_offset = ((OffsetNumber *) urec->uur_payload.data)[0];
+				end_offset = ((OffsetNumber *) urec->uur_payload.data)[1];
+
+				if (offset >= start_offset && offset <= end_offset)
+					return true;
+			}
+			break;
+		case UNDO_ITEMID_UNUSED:
+			{
+				/*
+				 * We don't expect to check the visibility of any unused item,
+				 * but the undo record of same can be present in chain which
+				 * we need to ignore.
+				 */
+			}
+			break;
+		default:
+			{
+				Assert(offset != InvalidOffsetNumber);
+				if (urec->uur_offset == offset)
+					return true;
+			}
+			break;
+	}
+
+	return false;
+}
+
+/*
+ *	zheap_get_latest_tid -  get the latest tid of a specified tuple
+ *
+ * Functionally, it serves the same purpose as heap_get_latest_tid(), but it
+ * follows a different way of traversing the ctid chain of updated tuples.
+ */
+void
+zheap_get_latest_tid(Relation relation,
+					 Snapshot snapshot,
+					 ItemPointer tid)
+{
+	BlockNumber blk;
+	ItemPointerData ctid;
+	TransactionId priorXmax;
+	int			tup_len;
+
+	/* this is to avoid Assert failures on bad input */
+	if (!ItemPointerIsValid(tid))
+		return;
+
+	/*
+	 * Since this can be called with user-supplied TID, don't trust the input
+	 * too much.  (RelationGetNumberOfBlocks is an expensive check, so we
+	 * don't check t_ctid links again this way.  Note that it would not do to
+	 * call it just once and save the result, either.)
+	 */
+	blk = ItemPointerGetBlockNumber(tid);
+	if (blk >= RelationGetNumberOfBlocks(relation))
+		elog(ERROR, "block number %u is out of range for relation \"%s\"",
+			 blk, RelationGetRelationName(relation));
+
+	/*
+	 * Loop to chase down ctid links.  At top of loop, ctid is the tuple we
+	 * need to examine, and *tid is the TID we will return if ctid turns out
+	 * to be bogus.
+	 *
+	 * Note that we will loop until we reach the end of the t_ctid chain.
+	 * Depending on the snapshot passed, there might be at most one visible
+	 * version of the row, but we don't try to optimize for that.
+	 */
+	ctid = *tid;
+	priorXmax = InvalidTransactionId;
+	for (;;)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber offnum;
+		ItemId		lp;
+		ZHeapTuple	tp = NULL;
+		ZHeapTuple	resulttup = NULL;
+		ItemPointerData new_ctid;
+		uint16		infomask;
+
+		/*
+		 * Read, pin, and lock the page.
+		 */
+		buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&ctid));
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * Check for bogus item number.  This is not treated as an error
+		 * condition because it can happen while following a ctid link. We
+		 * just assume that the prior tid is OK and return it unchanged.
+		 */
+		offnum = ItemPointerGetOffsetNumber(&ctid);
+		if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			break;
+		}
+		lp = PageGetItemId(page, offnum);
+		if (!ItemIdIsNormal(lp))
+		{
+			UnlockReleaseBuffer(buffer);
+			break;
+		}
+
+		/*
+		 * We always need to make a copy of zheap tuple; if an older version is
+		 * returned from the undo record, the passed in tuple gets freed.
+		 */
+		tup_len = ItemIdGetLength(lp);
+		tp = palloc(ZHEAPTUPLESIZE + tup_len);
+		tp->t_data = (ZHeapTupleHeader) (((char *) tp) + ZHEAPTUPLESIZE);
+		tp->t_tableOid = RelationGetRelid(relation);
+		tp->t_len = tup_len;
+		tp->t_self = ctid;
+
+		memcpy(tp->t_data, ((ZHeapTupleHeader) PageGetItem(page, lp)),
+			   tup_len);
+
+		/* Save the infomask. The tuple might get freed, as mentioned above */
+		infomask = tp->t_data->t_infomask;
+
+		/*
+		 * Ensure that the tuple is same as what we are expecting.  If the
+		 * the current or any prior version of tuple doesn't contain the
+		 * effect of priorXmax, then the slot must have been recycled and
+		 * reused for an unrelated tuple.  This implies that the latest
+		 * version of the row was deleted, so we need do nothing.
+		 */
+		if (TransactionIdIsValid(priorXmax) &&
+			!ValidateTuplesXact(tp, snapshot, buffer, priorXmax, false))
+		{
+			UnlockReleaseBuffer(buffer);
+			break;
+		}
+
+		/*
+		 * Get the transaction which modified this tuple. Ideally we need to
+		 * get this only when there is a ctid chain to follow. But since the
+		 * visibility function frees the tuple, we have to do this here
+		 * regardless of the existence of a ctid chain.
+		 */
+		ZHeapTupleGetTransInfo(tp, buffer, NULL, NULL, &priorXmax, NULL, NULL,
+							   false);
+
+		/*
+		 * Check time qualification of tuple; if visible, set it as the new
+		 * result candidate.
+		 */
+		ItemPointerSetInvalid(&new_ctid);
+		resulttup = ZHeapTupleSatisfies(tp, snapshot, buffer, &new_ctid);
+
+		/*
+		 * If any prior version is visible, we pass latest visible as
+		 * true. The state of latest version of tuple is determined by
+		 * the called function.
+		 *
+		 * Note that, it's possible that tuple is updated in-place and
+		 * we're seeing some prior version of that. We handle that case
+		 * in ZHeapTupleHasSerializableConflictOut.
+		 */
+		CheckForSerializableConflictOut((resulttup != NULL), relation,
+										(void *) &ctid,
+										buffer, snapshot);
+
+		/* Pass back the tuple ctid if it's visible */
+		if (resulttup != NULL)
+			*tid = ctid;
+
+		/* If there's a valid ctid link, follow it, else we're done. */
+		if (!ItemPointerIsValid(&new_ctid) ||
+			ZHEAP_XID_IS_LOCKED_ONLY(infomask) ||
+			ZHeapTupleIsMoved(infomask) ||
+			ItemPointerEquals(&ctid, &new_ctid))
+		{
+			if (resulttup != NULL)
+				zheap_freetuple(resulttup);
+			UnlockReleaseBuffer(buffer);
+			break;
+		}
+
+		ctid = new_ctid;
+
+		if (resulttup != NULL)
+			zheap_freetuple(resulttup);
+		UnlockReleaseBuffer(buffer);
+	}							/* end of loop */
+}
+
+/*
+ * Perform XLogInsert for a zheap-visible operation. vm_buffer is the buffer
+ * containing the corresponding visibility map block.  The vm_buffer should
+ * have already been modified and dirtied.
+ */
+XLogRecPtr
+log_zheap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+				 TransactionId cutoff_xid, uint8 vmflags)
+{
+	xl_zheap_visible xlrec;
+	XLogRecPtr	recptr;
+
+	Assert(BufferIsValid(heap_buffer));
+	Assert(BufferIsValid(vm_buffer));
+
+	xlrec.cutoff_xid = cutoff_xid;
+	xlrec.flags = vmflags;
+	xlrec.heapBlk = BufferGetBlockNumber(heap_buffer);
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfZHeapVisible);
+
+	XLogRegisterBuffer(0, vm_buffer, 0);
+
+	recptr = XLogInsert(RM_ZHEAP2_ID, XLOG_ZHEAP_VISIBLE);
+
+	return recptr;
+}
+
+/*
+ * GetTransactionsSlotsForPage - returns transaction slots for a zheap page
+ *
+ * This method returns all the transaction slots for the input zheap page
+ * including the corresponding TPD page. It also returns the corresponding
+ * TPD buffer if there is one.
+ */
+TransInfo *
+GetTransactionsSlotsForPage(Relation rel, Buffer buf, int *total_trans_slots,
+							BlockNumber *tpd_blkno)
+{
+	Page	page;
+	PageHeader	phdr;
+	TransInfo *tpd_trans_slots;
+	TransInfo *trans_slots = NULL;
+	bool	tpd_e_pruned;
+
+	*total_trans_slots = 0;
+	if (tpd_blkno)
+		*tpd_blkno = InvalidBlockNumber;
+
+	page = BufferGetPage(buf);
+	phdr = (PageHeader) page;
+
+	if (ZHeapPageHasTPDSlot(phdr))
+	{
+		int		num_tpd_trans_slots;
+
+		tpd_trans_slots = TPDPageGetTransactionSlots(rel,
+													 buf,
+													 InvalidOffsetNumber,
+													 false,
+													 false,
+													 NULL,
+													 &num_tpd_trans_slots,
+													 NULL,
+													 &tpd_e_pruned,
+													 NULL);
+		if (!tpd_e_pruned)
+		{
+			ZHeapPageOpaque	zopaque;
+			TransInfo	last_trans_slot_info;
+
+			zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+			last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1];
+
+			if (tpd_blkno)
+				*tpd_blkno = last_trans_slot_info.xid_epoch;
+
+			/*
+			 * The last slot in page contains TPD information, so we don't need to
+			 * include it.
+			 */
+			*total_trans_slots = num_tpd_trans_slots + ZHEAP_PAGE_TRANS_SLOTS - 1;
+			trans_slots = (TransInfo *)
+					palloc(*total_trans_slots * sizeof(TransInfo));
+			/* Copy the transaction slots from the page. */
+			memcpy(trans_slots, page + phdr->pd_special,
+				   (ZHEAP_PAGE_TRANS_SLOTS - 1) * sizeof(TransInfo));
+			/* Copy the transaction slots from the tpd entry. */
+			memcpy((char *) trans_slots + ((ZHEAP_PAGE_TRANS_SLOTS - 1) * sizeof(TransInfo)),
+				   tpd_trans_slots, num_tpd_trans_slots * sizeof(TransInfo));
+
+			pfree(tpd_trans_slots);
+		}
+	}
+
+	if (!ZHeapPageHasTPDSlot(phdr) || tpd_e_pruned)
+	{
+		Assert (trans_slots == NULL);
+
+		*total_trans_slots = ZHEAP_PAGE_TRANS_SLOTS;
+		trans_slots = (TransInfo *)
+				palloc(*total_trans_slots * sizeof(TransInfo));
+		memcpy(trans_slots, page + phdr->pd_special,
+			   *total_trans_slots * sizeof(TransInfo));
+	}
+
+	Assert(*total_trans_slots >= ZHEAP_PAGE_TRANS_SLOTS);
+
+	return trans_slots;
+}
+
+/*
+ * CheckAndLockTPDPage - Check and lock the TPD page before starting critical
+ * section.
+ *
+ * We might need to access it in ZPageAddItemExtended.  Note that if the
+ * transaction slot belongs to TPD entry, then the TPD page must be locked during
+ * slot reservation.  Also, if the old buffer and new buffer refers to the
+ * same TPD page and the old transaction slot corresponds to a TPD slot,
+ * the TPD page must be locked during slot reservation.
+ *
+ * XXX We can optimize this by avoid taking TPD page lock unless the page
+ * has some unused item which requires us to fetch the transaction
+ * information from TPD.
+ */
+static inline void
+CheckAndLockTPDPage(Relation relation, int new_trans_slot_id, int old_trans_slot_id,
+					Buffer newbuf, Buffer oldbuf)
+{
+	if (new_trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS &&
+		ZHeapPageHasTPDSlot((PageHeader) BufferGetPage(newbuf)) &&
+		PageHasFreeLinePointers((PageHeader)BufferGetPage(newbuf)))
+	{
+		/*
+		 * If the old buffer and new buffer refers to the same TPD page
+		 * and the old transaction slot corresponds to a TPD slot,
+		 * we must have locked the TPD page during slot reservation.
+		 */
+		if (ZHeapPageHasTPDSlot((PageHeader) BufferGetPage(oldbuf)) &&
+			(old_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS))
+		{
+			Page oldpage, newpage;
+			ZHeapPageOpaque oldopaque, newopaque;
+			BlockNumber oldtpdblk, newtpdblk;
+
+			oldpage = BufferGetPage(oldbuf);
+			newpage = BufferGetPage(newbuf);
+			oldopaque = (ZHeapPageOpaque) PageGetSpecialPointer(oldpage);
+			newopaque = (ZHeapPageOpaque) PageGetSpecialPointer(newpage);
+
+			oldtpdblk = oldopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch;
+			newtpdblk = newopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch;
+
+			if (oldtpdblk != newtpdblk)
+				TPDPageLock(relation, newbuf);
+		}
+		else
+			TPDPageLock(relation, newbuf);
+	}
+}
diff --git a/src/backend/access/zheap/zheapam_handler.c b/src/backend/access/zheap/zheapam_handler.c
new file mode 100644
index 0000000000..2ecbeccc0d
--- /dev/null
+++ b/src/backend/access/zheap/zheapam_handler.c
@@ -0,0 +1,1867 @@
+/*-------------------------------------------------------------------------
+ *
+ * zheapam_handler.c
+ *	  zheap table access method code
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/zheapam_handler.c
+ *
+ *
+ * NOTES
+ *	  This file contains the zheap_ routines which implement
+ *	  the POSTGRES zheap table access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "miscadmin.h"
+
+#include "access/zheap.h"
+#include "access/relscan.h"
+#include "access/rewritezheap.h"
+#include "access/tableam.h"
+#include "access/tpd.h"
+#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
+#include "access/zheapscan.h"
+#include "access/zheaputils.h"
+#include "access/xact.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/catalog.h"
+#include "catalog/storage_xlog.h"
+#include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bufmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "storage/smgr.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "utils/tqual.h"
+
+
+/* ----------------------------------------------------------------
+ *				storage AM support routines for heapam
+ * ----------------------------------------------------------------
+ */
+
+static bool
+zheapam_fetch_row_version(Relation relation,
+						  ItemPointer tid,
+						  Snapshot snapshot,
+						  TupleTableSlot *slot,
+						  Relation stats_relation)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+	Buffer buffer;
+
+	ExecClearTuple(slot);
+
+	if (zheap_fetch(relation, snapshot, tid, &zslot->tuple, &buffer, false, stats_relation))
+	{
+		ExecStoreZTuple(zslot->tuple, slot, buffer, true);
+		ReleaseBuffer(buffer);
+
+		slot->tts_tableOid = RelationGetRelid(relation);
+
+		return true;
+	}
+
+	slot->tts_tableOid = RelationGetRelid(relation);
+
+	return false;
+}
+
+/*
+ * Insert a heap tuple from a slot, which may contain an OID and speculative
+ * insertion token.
+ */
+static void
+zheapam_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
+					int options, BulkInsertState bistate)
+{
+	ZHeapTuple	tuple = ExecGetZHeapTupleFromSlot(slot);
+
+	/* Update the tuple with table oid */
+	slot->tts_tableOid = RelationGetRelid(relation);
+	if (slot->tts_tableOid != InvalidOid)
+		tuple->t_tableOid = slot->tts_tableOid;
+
+	/* Perform the insertion, and copy the resulting ItemPointer */
+	zheap_insert(relation, tuple, cid, options, bistate);
+	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
+}
+
+static void
+zheapam_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid,
+								int options, BulkInsertState bistate, uint32 specToken)
+{
+	ZHeapTuple	tuple = ExecGetZHeapTupleFromSlot(slot);
+
+	/* Update the tuple with table oid */
+	slot->tts_tableOid = RelationGetRelid(relation);
+	if (slot->tts_tableOid != InvalidOid)
+		tuple->t_tableOid = slot->tts_tableOid;
+
+#ifdef ZBORKED
+	HeapTupleHeaderSetSpeculativeToken(tuple->t_data, specToken);
+#endif
+
+	/* Perform the insertion, and copy the resulting ItemPointer */
+	zheap_insert(relation, tuple, cid, options, bistate);
+	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
+}
+
+static void
+zheapam_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken,
+								  bool succeeded)
+{
+	ZHeapTuple	tuple = ExecGetZHeapTupleFromSlot(slot);
+
+	/* adjust the tuple's state accordingly */
+	if (!succeeded)
+		zheap_finish_speculative(relation, tuple);
+	else
+	{
+		zheap_abort_speculative(relation, tuple);
+	}
+}
+
+
+static HTSU_Result
+zheapam_delete(Relation relation, ItemPointer tid, CommandId cid,
+			   Snapshot snapshot, Snapshot crosscheck, bool wait,
+			   HeapUpdateFailureData *hufd, bool changingPart)
+{
+	/*
+	 * Currently Deleting of index tuples are handled at vacuum, in case
+	 * if the storage itself is cleaning the dead tuples by itself, it is
+	 * the time to call the index tuple deletion also.
+	 */
+	return zheap_delete(relation, tid, cid, crosscheck, snapshot, wait, hufd, changingPart);
+}
+
+
+/*
+ * Locks tuple and fetches its newest version and TID.
+ *
+ *	relation - table containing tuple
+ *	tid - TID of tuple to lock
+ *	snapshot - snapshot indentifying required version (used for assert check only)
+ *	slot - tuple to be returned
+ *	cid - current command ID (used for visibility test, and stored into
+ *		  tuple's cmax if lock is successful)
+ *	mode - indicates if shared or exclusive tuple lock is desired
+ *	wait_policy - what to do if tuple lock is not available
+ *	flags â indicating how do we handle updated tuples
+ *	*hufd - filled in failure cases
+ *
+ * Function result may be:
+ *	HeapTupleMayBeUpdated: lock was successfully acquired
+ *	HeapTupleInvisible: lock failed because tuple was never visible to us
+ *	HeapTupleSelfUpdated: lock failed because tuple updated by self
+ *	HeapTupleUpdated: lock failed because tuple updated by other xact
+ *	HeapTupleDeleted: lock failed because tuple deleted by other xact
+ *	HeapTupleWouldBlock: lock couldn't be acquired and wait_policy is skip
+ *
+ * In the failure cases other than HeapTupleInvisible, the routine fills
+ * *hufd with the tuple's t_ctid, t_xmax (resolving a possible MultiXact,
+ * if necessary), and t_cmax (the last only for HeapTupleSelfUpdated,
+ * since we cannot obtain cmax from a combocid generated by another
+ * transaction).
+ * See comments for struct HeapUpdateFailureData for additional info.
+ */
+static HTSU_Result
+zheapam_lock_tuple(Relation relation, ItemPointer tid, Snapshot snapshot,
+				TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
+				LockWaitPolicy wait_policy, uint8 flags,
+				HeapUpdateFailureData *hufd)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+	HTSU_Result		result;
+	Buffer			buffer;
+	ZHeapTuple		tuple = &zslot->tupdata;
+	bool			doWeirdEval = (flags & TUPLE_LOCK_FLAG_WEIRD) != 0;
+
+	hufd->traversed = false;
+
+retry:
+	result = zheap_lock_tuple(relation, tid, cid, mode, wait_policy,
+							  (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) ? true : false,
+							  doWeirdEval,
+							  snapshot, tuple, &buffer, hufd);
+
+	if (result == HeapTupleUpdated &&
+		(flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
+	{
+		SnapshotData SnapshotDirty;
+		TransactionId priorXmax = hufd->xmax;
+
+		ReleaseBuffer(buffer);
+
+		/* Should not encounter speculative tuple on recheck */
+		Assert(!(tuple->t_data->t_infomask & ZHEAP_SPECULATIVE_INSERT));
+
+		/* it was updated, so look at the updated version */
+		*tid = hufd->ctid;
+		/* updated row should have xmin matching this xmax */
+		priorXmax = hufd->xmax;
+
+		if (ItemPointerEquals(&hufd->ctid, &tuple->t_self) && false)
+		{
+			/* tuple was deleted, so give up */
+			return HeapTupleDeleted;
+		}
+
+		/*
+		 * fetch target tuple
+		 *
+		 * Loop here to deal with updated or busy tuples
+		 */
+		InitDirtySnapshot(SnapshotDirty);
+		for (;;)
+		{
+			/* check whether next version would be in a different partition */
+			if (ItemPointerIndicatesMovedPartitions(&hufd->ctid))
+				ereport(ERROR,
+						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+						 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
+			if (zheap_fetch(relation, &SnapshotDirty, tid, &tuple, &buffer, true, NULL))
+			{
+				/*
+				 * Ensure that the tuple is same as what we are expecting.  If the
+				 * the current or any prior version of tuple doesn't contain the
+				 * effect of priorXmax, then the slot must have been recycled and
+				 * reused for an unrelated tuple.  This implies that the latest
+				 * version of the row was deleted, so we need do nothing.
+				 */
+				if (!ValidateTuplesXact(tuple, &SnapshotDirty, buffer, priorXmax, false))
+				{
+					ReleaseBuffer(buffer);
+					return HeapTupleDeleted;
+				}
+
+				/* otherwise xmin should not be dirty... */
+				if (TransactionIdIsValid(SnapshotDirty.xmin))
+					elog(ERROR, "t_xmin is uncommitted in tuple to be updated");
+
+				/*
+				 * If tuple is being updated by other (sub)transaction then we have to
+				 * wait for its commit/abort, or die trying.
+				 */
+				if (SnapshotDirty.subxid != InvalidSubTransactionId &&
+					TransactionIdIsValid(SnapshotDirty.xmax))
+				{
+					ReleaseBuffer(buffer);
+					switch (wait_policy)
+					{
+						case LockWaitBlock:
+							SubXactLockTableWait(SnapshotDirty.xmax,
+												 SnapshotDirty.subxid,
+												 relation, &tuple->t_self,
+												 XLTW_FetchUpdated);
+							break;
+						case LockWaitSkip:
+							if (!ConditionalSubXactLockTableWait(SnapshotDirty.xmax,
+																 SnapshotDirty.subxid))
+								return result;		/* skip instead of waiting */
+							break;
+						case LockWaitError:
+							if (ConditionalSubXactLockTableWait(SnapshotDirty.xmax,
+																SnapshotDirty.subxid))
+								ereport(ERROR,
+										(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+										 errmsg("could not obtain lock on row in relation \"%s\"",
+												RelationGetRelationName(relation))));
+
+							break;
+					}
+					continue;		/* loop back to repeat zheap_fetch */
+				}
+				else if (TransactionIdIsValid(SnapshotDirty.xmax))
+				{
+					ReleaseBuffer(buffer);
+					switch (wait_policy)
+					{
+						case LockWaitBlock:
+							XactLockTableWait(SnapshotDirty.xmax, relation,
+											  &tuple->t_self, XLTW_FetchUpdated);
+							break;
+						case LockWaitSkip:
+							if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
+								return result;		/* skip instead of waiting */
+							break;
+						case LockWaitError:
+							if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
+								ereport(ERROR,
+										(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+										 errmsg("could not obtain lock on row in relation \"%s\"",
+												RelationGetRelationName(relation))));
+							break;
+					}
+					continue;		/* loop back to repeat zheap_fetch */
+				}
+
+				/*
+				 * If tuple was inserted by our own transaction, we have to check
+				 * cmin against es_output_cid: cmin >= current CID means our
+				 * command cannot see the tuple, so we should ignore it. Otherwise
+				 * zheap_lock_tuple() will throw an error, and so would any later
+				 * attempt to update or delete the tuple.  (We need not check cmax
+				 * because ZHeapTupleSatisfiesDirty will consider a tuple deleted
+				 * by our transaction dead, regardless of cmax.) We just checked
+				 * that priorXmax == xmin, so we can test that variable instead of
+				 * doing ZHeapTupleHeaderGetXid again.
+				 */
+				if (TransactionIdIsCurrentTransactionId(priorXmax))
+				{
+					LockBuffer(buffer, BUFFER_LOCK_SHARE);
+					/*
+					 * Fixme -If the tuple is updated such that its transaction slot
+					 * has been changed, then we will never be able to get the correct
+					 * tuple from undo.  To avoid, that we need to get the latest tuple
+					 * from page rather than relying on it's in-memory copy.  See
+					 * ValidateTuplesXact.
+					 */
+					if (ZHeapTupleGetCid(tuple, buffer, InvalidUndoRecPtr,
+										 InvalidXactSlotId) >= cid)
+					{
+						UnlockReleaseBuffer(buffer);
+						// ZBORKED: is this correct?
+						return HeapTupleSelfUpdated;
+					}
+					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+				}
+
+				doWeirdEval = true;
+				hufd->traversed = true;
+				ReleaseBuffer(buffer);
+				goto retry;
+			}
+
+			/*
+			 * If we don't get any tuple, the latest version of the row must have
+			 * been deleted, so we need do nothing.
+			 */
+			if (tuple == NULL)
+			{
+				ReleaseBuffer(buffer);
+				return HeapTupleDeleted;
+			}
+
+			/* Ensure that the tuple is same as what we are expecting as above. */
+			if (!ValidateTuplesXact(tuple, &SnapshotDirty, buffer, priorXmax, true))
+			{
+				if (BufferIsValid(buffer))
+					ReleaseBuffer(buffer);
+				return HeapTupleDeleted;
+			}
+
+			/* check whether next version would be in a different partition */
+			if (ZHeapTupleIsMoved(tuple->t_data->t_infomask))
+				ereport(ERROR,
+						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+						 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
+			if (ItemPointerEquals(&(tuple->t_self), tid))
+			{
+				/* deleted, so forget about it */
+				ReleaseBuffer(buffer);
+				return HeapTupleDeleted;
+			}
+
+			/* updated row should have xid matching this xmax */
+			ZHeapTupleGetTransInfo(tuple, buffer, NULL, NULL, &priorXmax, NULL,
+								   NULL, true);
+
+			/*
+			 * As we still hold a snapshot to which priorXmax is not visible, neither
+			 * the transaction slot on tuple can be marked as frozen nor the
+			 * corresponding undo be discarded.
+			 */
+			Assert(TransactionIdIsValid(priorXmax));
+
+			priorXmax = hufd->xmax;
+
+			/* be tidy */
+			zheap_freetuple(tuple);
+			ReleaseBuffer(buffer);
+			/* loop back to fetch next in chain */
+		}
+	}
+
+	slot->tts_tableOid = RelationGetRelid(relation);
+	ExecStoreZTuple(tuple, slot, buffer, false);
+	ReleaseBuffer(buffer); // FIXME: invent option to just transfer pin?
+
+	return result;
+}
+
+
+static HTSU_Result
+zheapam_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+			   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+			   bool wait, HeapUpdateFailureData *hufd, LockTupleMode *lockmode,
+			   bool *update_indexes)
+{
+	ZHeapTuple	ztuple = ExecGetZHeapTupleFromSlot(slot);
+	HTSU_Result result;
+
+	/* Update the tuple with table oid */
+	if (slot->tts_tableOid != InvalidOid)
+		ztuple->t_tableOid = slot->tts_tableOid;
+
+	result = zheap_update(relation, otid, ztuple, cid, crosscheck, snapshot,
+						  wait, hufd, lockmode);
+	ItemPointerCopy(&ztuple->t_self, &slot->tts_tid);
+
+	slot->tts_tableOid = RelationGetRelid(relation);
+
+	/*
+	 * Note: instead of having to update the old index tuples associated with
+	 * the heap tuple, all we do is form and insert new index tuples. This is
+	 * because UPDATEs are actually DELETEs and INSERTs, and index tuple
+	 * deletion is done later by VACUUM (see notes in ExecDelete). All we do
+	 * here is insert new index tuples.  -cim 9/27/89
+	 */
+
+	/*
+	 * insert index entries for tuple
+	 *
+	 * Note: heap_update returns the tid (location) of the new tuple in the
+	 * t_self field.
+	 *
+	 * If it's a HOT update, we mustn't insert new index entries.
+	 */
+	*update_indexes = result == HeapTupleMayBeUpdated &&
+		!ZHeapTupleIsInPlaceUpdated(ztuple->t_data->t_infomask);
+
+	return result;
+}
+
+static const TupleTableSlotOps *
+zheapam_slot_callbacks(Relation relation)
+{
+	return &TTSOpsZHeapTuple;
+}
+
+static bool
+zheapam_satisfies(Relation rel, TupleTableSlot *slot, Snapshot snapshot)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+	Buffer		buffer;
+	Page		page;
+	ItemId		lp;
+	ItemPointer	tid;
+	ZHeapTupleData	zhtup;
+	ZHeapTuple tup;
+	Oid tableoid;
+	bool res;
+
+	Assert(TTS_IS_ZHEAP(slot));
+	Assert(zslot->tuple);
+
+	tableoid = zslot->tuple->t_tableOid;
+	tid = &(zslot->tuple->t_self);
+
+	buffer = ReadBuffer(rel, ItemPointerGetBlockNumber(tid));
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+	page = BufferGetPage(buffer);
+	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
+
+	/*
+	 * Since the current transaction has inserted/updated the tuple, it
+	 * can't be deleted.
+	 */
+	Assert(ItemIdIsNormal(lp));
+
+	zhtup.t_tableOid = tableoid;
+	zhtup.t_data = (ZHeapTupleHeader) PageGetItem((Page) page, lp);
+	zhtup.t_len = ItemIdGetLength(lp);
+	zhtup.t_self = *tid;
+
+	tup = ZHeapTupleSatisfies(&zhtup, snapshot, buffer, tid);
+
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+	ReleaseBuffer(buffer);
+
+	if (!tup)
+	{
+		/* satisfies routine returned no tuple, so clearly invisible */
+		res = false;
+	}
+	else if (tup->t_len != zslot->tuple->t_len)
+	{
+		/* length differs, the input tuple can't be visible */
+		res = false;
+	}
+	else if (memcmp(tup->t_data, zslot->tuple->t_data, zhtup.t_len) != 0)
+	{
+		/*
+		 * ZBORKED: compare tuple contents, to be sure the tuple returned by
+		 * the visibility routine is the input tuple. There *got* to be a
+		 * better solution than this.
+		 */
+		res = false;
+	}
+	else
+		res = true;
+
+	if (tup && tup != &zhtup)
+		pfree(tup);
+
+	return res;
+}
+
+static IndexFetchTableData*
+zheapam_begin_index_fetch(Relation rel)
+{
+	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
+
+	hscan->xs_base.rel = rel;
+	hscan->xs_cbuf = InvalidBuffer;
+	//hscan->xs_continue_hot = false;
+
+	return &hscan->xs_base;
+}
+
+
+static void
+zheapam_reset_index_fetch(IndexFetchTableData* scan)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+
+	if (BufferIsValid(hscan->xs_cbuf))
+	{
+		ReleaseBuffer(hscan->xs_cbuf);
+		hscan->xs_cbuf = InvalidBuffer;
+	}
+
+	//hscan->xs_continue_hot = false;
+}
+
+static void
+zheapam_end_index_fetch(IndexFetchTableData* scan)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+
+	zheapam_reset_index_fetch(scan);
+
+	pfree(hscan);
+}
+
+static bool
+zheapam_fetch_follow(struct IndexFetchTableData *scan,
+					 ItemPointer tid,
+					 Snapshot snapshot,
+					 TupleTableSlot *slot,
+					 bool *call_again, bool *all_dead)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+	ZHeapTuple	zheapTuple = NULL;
+
+	/*
+	 * No HOT chains in zheap.
+	 */
+	Assert(!*call_again);
+
+	/* Switch to correct buffer if we don't have it already */
+	hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+										  hscan->xs_base.rel,
+										  ItemPointerGetBlockNumber(tid));
+
+	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
+	zheapTuple = zheap_search_buffer(tid, hscan->xs_base.rel,
+									 hscan->xs_cbuf,
+									 snapshot,
+									 all_dead);
+	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+
+	if (zheapTuple)
+	{
+		slot->tts_tableOid = RelationGetRelid(scan->rel);
+		ExecStoreZTuple(zheapTuple, slot, hscan->xs_cbuf, false);
+	}
+
+	return zheapTuple != NULL;
+}
+
+/*
+ * Similar to IndexBuildHeapRangeScan, but for zheap relations.
+ */
+static double
+IndexBuildZHeapRangeScan(Relation heapRelation,
+						Relation indexRelation,
+						IndexInfo *indexInfo,
+						bool allow_sync,
+						bool anyvisible,
+						BlockNumber start_blockno,
+						BlockNumber numblocks,
+						IndexBuildCallback callback,
+						void *callback_state,
+						TableScanDesc sscan)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	bool		is_system_catalog;
+	bool		checking_uniqueness;
+	HeapTuple	heapTuple;
+	ZHeapTuple	zheapTuple;
+	Datum		values[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	double		reltuples;
+	ExprState  *predicate;
+	TupleTableSlot *slot;
+	ZHeapTupleTableSlot *zslot;
+	EState	   *estate;
+	ExprContext *econtext;
+	Snapshot	snapshot;
+	TransactionId OldestXmin;
+	bool		need_unregister_snapshot = false;
+	SubTransactionId subxid_xwait = InvalidSubTransactionId;
+
+	/*
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+	Assert(RelationStorageIsZHeap(heapRelation));
+
+	/* Remember if it's a system catalog */
+	is_system_catalog = IsSystemRelation(heapRelation);
+
+	/* See whether we're verifying uniqueness/exclusion properties */
+	checking_uniqueness = (indexInfo->ii_Unique ||
+						   indexInfo->ii_ExclusionOps != NULL);
+
+	/*
+	 * "Any visible" mode is not compatible with uniqueness checks; make sure
+	 * only one of those is requested.
+	 */
+	Assert(!(anyvisible && checking_uniqueness));
+
+	/*
+	 * Need an EState for evaluation of index expressions and partial-index
+	 * predicates.  Also a slot to hold the current tuple.
+	 */
+	estate = CreateExecutorState();
+	econtext = GetPerTupleExprContext(estate);
+	slot = table_gimmegimmeslot(heapRelation, NULL);
+	zslot = (ZHeapTupleTableSlot *) slot;
+
+	/* Arrange for econtext's scan tuple to be the tuple under test */
+	econtext->ecxt_scantuple = slot;
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+
+	heapTuple = (HeapTuple) palloc0(SizeofHeapTupleHeader);
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
+	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
+	 * and index whatever's live according to that.
+	 */
+	OldestXmin = InvalidTransactionId;
+
+	/* okay to ignore lazy VACUUMs here */
+	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
+		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+
+	if (!scan)
+	{
+		/*
+		 * Serial index build.
+		 *
+		 * Must begin our own heap scan in this case.  We may also need to
+		 * register a snapshot whose lifetime is under our direct control.
+		 */
+		if (!TransactionIdIsValid(OldestXmin))
+		{
+			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+			need_unregister_snapshot = true;
+		}
+		else
+			snapshot = SnapshotAny;
+
+		sscan = table_beginscan_strat(heapRelation,	/* relation */
+									  snapshot,	/* snapshot */
+									  0,	/* number of keys */
+									  NULL,	/* scan key */
+									  true,	/* buffer access strategy OK */
+									  allow_sync);	/* syncscan OK? */
+		scan = (ZHeapScanDesc) sscan;
+	}
+	else
+	{
+		/*
+		 * Parallel index build.
+		 *
+		 * Parallel case never registers/unregisters own snapshot.  Snapshot
+		 * is taken from parallel heap scan, and is SnapshotAny or an MVCC
+		 * snapshot, based on same criteria as serial case.
+		 */
+		Assert(!IsBootstrapProcessingMode());
+		Assert(allow_sync);
+		snapshot = scan->rs_scan.rs_snapshot;
+	}
+
+	/*
+	 * Must call GetOldestXmin() with SnapshotAny.  Should never call
+	 * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
+	 * this for parallel builds, since ambuild routines that support parallel
+	 * builds must work these details out for themselves.)
+	 */
+	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+		   !TransactionIdIsValid(OldestXmin));
+	Assert(snapshot == SnapshotAny || !anyvisible);
+
+	/* set our scan endpoints */
+	if (!allow_sync)
+		zheap_setscanlimits(sscan, start_blockno, numblocks);
+	else
+	{
+		/* syncscan can only be requested on whole relation */
+		Assert(start_blockno == 0);
+		start_blockno = ZHEAP_METAPAGE + 1;
+		Assert(numblocks == InvalidBlockNumber);
+	}
+
+	reltuples = 0;
+
+	/*
+	 * Scan all tuples in the base relation.
+	 */
+	// ZBORKED: move to slot API
+	while ((zheapTuple = zheap_getnext(sscan, ForwardScanDirection)) != NULL)
+	{
+		bool		tupleIsAlive;
+		ZHeapTuple		targztuple = NULL;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (snapshot == SnapshotAny)
+		{
+			/* do our own time qual check */
+			bool		indexIt;
+			TransactionId xwait;
+
+	recheck:
+
+			/*
+			 * We could possibly get away with not locking the buffer here,
+			 * since caller should hold ShareLock on the relation, but let's
+			 * be conservative about it.
+			 */
+			LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+			targztuple = zheap_copytuple(zheapTuple);
+			switch (ZHeapTupleSatisfiesOldestXmin(&targztuple, OldestXmin,
+												  scan->rs_cbuf, &xwait, &subxid_xwait))
+			{
+				case HEAPTUPLE_DEAD:
+					/* Definitely dead, we can ignore it */
+					indexIt = false;
+					tupleIsAlive = false;
+					break;
+				case HEAPTUPLE_LIVE:
+					/* Normal case, index and unique-check it */
+					indexIt = true;
+					tupleIsAlive = true;
+					break;
+				case HEAPTUPLE_RECENTLY_DEAD:
+					/*
+					 * If tuple is recently deleted then we must index it
+					 * anyway to preserve MVCC semantics.  (Pre-existing
+					 * transactions could try to use the index after we finish
+					 * building it, and may need to see such tuples.)
+					 */
+					indexIt = true;
+					tupleIsAlive = false;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+					/*
+					 * In "anyvisible" mode, this tuple is visible and we
+					 * don't need any further checks.
+					 */
+					if (anyvisible)
+					{
+						indexIt = true;
+						tupleIsAlive = true;
+						break;
+					}
+
+					/*
+					 * Since caller should hold ShareLock or better, normally
+					 * the only way to see this is if it was inserted earlier
+					 * in our own transaction.  However, it can happen in
+					 * system catalogs, since we tend to release write lock
+					 * before commit there.  Give a warning if neither case
+					 * applies.
+					 */
+					if (!TransactionIdIsCurrentTransactionId(xwait))
+					{
+						if (!is_system_catalog)
+							elog(WARNING, "concurrent insert in progress within table \"%s\"",
+								 RelationGetRelationName(heapRelation));
+
+						/*
+						 * If we are performing uniqueness checks, indexing
+						 * such a tuple could lead to a bogus uniqueness
+						 * failure.  In that case we wait for the inserting
+						 * transaction to finish and check again.
+						 */
+						if (checking_uniqueness)
+						{
+							/*
+							 * Must drop the lock on the buffer before we wait
+							 */
+							LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+							if (subxid_xwait != InvalidSubTransactionId)
+								SubXactLockTableWait(xwait, subxid_xwait, heapRelation,
+													 &zheapTuple->t_self,
+													 XLTW_InsertIndexUnique);
+							else
+								XactLockTableWait(xwait, heapRelation,
+												  &zheapTuple->t_self,
+												  XLTW_InsertIndexUnique);
+							CHECK_FOR_INTERRUPTS();
+
+							if (targztuple != NULL)
+								pfree(targztuple);
+
+							goto recheck;
+						}
+					}
+
+					/*
+					 * We must index such tuples, since if the index build
+					 * commits then they're good.
+					 */
+					indexIt = true;
+					tupleIsAlive = true;
+					break;
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+
+					/*
+					 * As with INSERT_IN_PROGRESS case, this is unexpected
+					 * unless it's our own deletion or a system catalog; but
+					 * in anyvisible mode, this tuple is visible.
+					 */
+					if (anyvisible)
+					{
+						indexIt = true;
+						tupleIsAlive = false;
+						break;
+					}
+
+					if (!TransactionIdIsCurrentTransactionId(xwait))
+					{
+						if (!is_system_catalog)
+							elog(WARNING, "concurrent insert in progress within table \"%s\"",
+								 RelationGetRelationName(heapRelation));
+
+						/*
+						 * If we are performing uniqueness checks, indexing
+						 * such a tuple could lead to a bogus uniqueness
+						 * failure.  In that case we wait for the inserting
+						 * transaction to finish and check again.
+						 */
+						if (checking_uniqueness)
+						{
+							/*
+							 * Must drop the lock on the buffer before we wait
+							 */
+							LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+							if (subxid_xwait != InvalidTransactionId)
+								SubXactLockTableWait(xwait, subxid_xwait,
+													 heapRelation,
+													 &zheapTuple->t_self,
+													 XLTW_InsertIndexUnique);
+							else
+								XactLockTableWait(xwait, heapRelation,
+												  &zheapTuple->t_self,
+												  XLTW_InsertIndexUnique);
+							CHECK_FOR_INTERRUPTS();
+
+							if (targztuple != NULL)
+								pfree(targztuple);
+
+							goto recheck;
+						}
+
+						/*
+						 * Otherwise index it but don't check for uniqueness,
+						 * the same as a RECENTLY_DEAD tuple.
+						 */
+						indexIt = true;
+					}
+					else
+					{
+						/*
+						 * It's a regular tuple deleted by our own xact. Index
+						 * it but don't check for uniqueness, the same as a
+						 * RECENTLY_DEAD tuple.
+						 */
+						indexIt = true;
+					}
+					/* In any case, exclude the tuple from unique-checking */
+					tupleIsAlive = false;
+					break;
+				default:
+					elog(ERROR, "unexpected ZHeapTupleSatisfiesOldestXmin result");
+					indexIt = tupleIsAlive = false;		/* keep compiler quiet */
+					break;
+			}
+
+			LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+
+			if (!indexIt)
+				continue;
+		}
+		else
+		{
+			/* zheap_getnext did the time qual check */
+			tupleIsAlive = true;
+			targztuple = zheapTuple;
+		}
+
+		reltuples += 1;
+
+		MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+		/* Set up for predicate or expression evaluation */
+		/* ZBORKED: shouldfree = true if !scan->rs_pagescan.rs_pageatatime */
+		ExecStoreZTuple(zheapTuple, slot, InvalidBuffer, false);
+
+		/*
+		 * In a partial index, discard tuples that don't satisfy the
+		 * predicate.
+		 */
+		if (predicate != NULL)
+		{
+			if (!ExecQual(predicate, econtext))
+			{
+				/*
+				 * For SnapshotAny, targztuple is locally palloced above. So,
+				 * free it.
+				 */
+				if (snapshot == SnapshotAny && targztuple != NULL)
+					pfree(targztuple);
+				continue;
+			}
+		}
+
+		/*
+		 * For the current tuple, extract all the attributes we use in this
+		 * index, and note which are null.  This also performs evaluation
+		 * of any expressions needed.
+		 *
+		 * NOTE: We can't free the zheap tuple fetched by the scan method before
+		 * next iteration since this tuple is also referenced by scan->rs_cztup.
+		 * which is used by zheap scan API's to fetch the next tuple. But, for
+		 * forming and creating the index, we've to store the correct version of
+		 * the tuple in the slot. Hence, after forming the index and calling the
+		 * callback function, we restore the zheap tuple fetched by the scan
+		 * method in the slot.
+		 */
+		zslot->tuple = targztuple;
+		FormIndexDatum(indexInfo,
+					   slot,
+					   estate,
+					   values,
+					   isnull);
+
+		/*
+		 * FIXME: buildCallback functions accepts heaptuple as an argument. But,
+		 * it needs only the tid. So, we set t_self for the zheap tuple and call
+		 * the AM's callback.
+		 */
+		heapTuple->t_self = zheapTuple->t_self;
+
+		/* Call the AM's callback routine to process the tuple */
+		callback(indexRelation, heapTuple, values, isnull, tupleIsAlive,
+				 callback_state);
+
+		zslot->tuple = zheapTuple;
+
+		/*
+		 * For SnapshotAny, targztuple is locally palloced above. So,
+		 * free it.
+		 */
+		if (snapshot == SnapshotAny && targztuple != NULL)
+			pfree(targztuple);
+	}
+
+	table_endscan(sscan);
+
+	/* we can now forget our snapshot, if set and registered by us */
+	if (need_unregister_snapshot)
+		UnregisterSnapshot(snapshot);
+
+	ExecDropSingleTupleTableSlot(slot);
+
+	pfree(heapTuple);
+
+	/* These may have been pointing to the now-gone estate */
+	indexInfo->ii_ExpressionsState = NIL;
+	indexInfo->ii_PredicateState = NULL;
+
+	return reltuples;
+}
+
+/*
+ * validate_index_zheapscan - second table scan for concurrent index build
+ *
+ * This has much code in common with IndexBuildZHeapScan, but it's enough
+ * different that it seems cleaner to have two routines not one.
+ */
+static void
+validate_index_zheapscan(Relation heapRelation,
+						Relation indexRelation,
+						IndexInfo *indexInfo,
+						Snapshot snapshot,
+						ValidateIndexState *state)
+{
+	TableScanDesc sscan;
+	HeapScanDesc scan;
+	Datum		values[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	ExprState  *predicate;
+	TupleTableSlot *slot;
+	EState	   *estate;
+	ExprContext *econtext;
+	bool		in_index[MaxHeapTuplesPerPage];
+
+	/* state variables for the merge */
+	ItemPointer indexcursor = NULL;
+	ItemPointerData decoded;
+	bool		tuplesort_empty = false;
+
+	/*
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	/*
+	 * Need an EState for evaluation of index expressions and partial-index
+	 * predicates.  Also a slot to hold the current tuple.
+	 */
+	estate = CreateExecutorState();
+	econtext = GetPerTupleExprContext(estate);
+	slot = table_gimmegimmeslot(heapRelation, NULL);
+
+	/* Arrange for econtext's scan tuple to be the tuple under test */
+	econtext->ecxt_scantuple = slot;
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+	/*
+	 * Prepare for scan of the base relation.  We need just those tuples
+	 * satisfying the passed-in reference snapshot.  We must disable syncscan
+	 * here, because it's critical that we read from block zero forward to
+	 * match the sorted TIDs.
+	 */
+	sscan = table_beginscan_strat(heapRelation,	/* relation */
+								   snapshot,	/* snapshot */
+								   0,	/* number of keys */
+								   NULL,	/* scan key */
+								   true,	/* buffer access strategy OK */
+								   false);	/* syncscan not OK */
+	scan = (HeapScanDesc) sscan;
+
+	/*
+	 * Scan all tuples matching the snapshot.
+	 */
+	while (zheap_getnextslot(sscan, ForwardScanDirection, slot))
+	{
+		OffsetNumber offnum = ItemPointerGetOffsetNumber(&slot->tts_tid);
+
+		CHECK_FOR_INTERRUPTS();
+
+		state->htups += 1;
+
+		/*
+		 * "merge" by skipping through the index tuples until we find or pass
+		 * the current tuple.
+		 */
+		while (!tuplesort_empty &&
+			   (!indexcursor ||
+				ItemPointerCompare(indexcursor, &slot->tts_tid) < 0))
+		{
+			Datum		ts_val;
+			bool		ts_isnull;
+
+			if (indexcursor)
+			{
+				/*
+				 * Remember index items seen earlier on the current heap page
+				 */
+				if (ItemPointerGetBlockNumber(indexcursor) == scan->rs_cblock)
+					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+			}
+
+			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+												  &ts_val, &ts_isnull, NULL);
+			Assert(tuplesort_empty || !ts_isnull);
+			if (!tuplesort_empty)
+			{
+				itemptr_decode(&decoded, DatumGetInt64(ts_val));
+				indexcursor = &decoded;
+
+				/* If int8 is pass-by-ref, free (encoded) TID Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+				pfree(DatumGetPointer(ts_val));
+#endif
+			}
+			else
+			{
+				/* Be tidy */
+				indexcursor = NULL;
+			}
+		}
+
+		/*
+		 * If the tuplesort has overshot *and* we didn't see a match earlier,
+		 * then this tuple is missing from the index, so insert it.
+		 */
+		if ((tuplesort_empty ||
+			 ItemPointerCompare(indexcursor, &slot->tts_tid) > 0) &&
+			!in_index[offnum - 1])
+		{
+
+			/* Set up for predicate or expression evaluation */
+
+			/*
+			 * In a partial index, discard tuples that don't satisfy the
+			 * predicate.
+			 */
+			if (predicate != NULL)
+			{
+				if (!ExecQual(predicate, econtext))
+					continue;
+			}
+
+			/*
+			 * For the current heap tuple, extract all the attributes we use
+			 * in this index, and note which are null.  This also performs
+			 * evaluation of any expressions needed.
+			 */
+			FormIndexDatum(indexInfo,
+						   slot,
+						   estate,
+						   values,
+						   isnull);
+
+			/*
+			 * You'd think we should go ahead and build the index tuple here,
+			 * but some index AMs want to do further processing on the data
+			 * first. So pass the values[] and isnull[] arrays, instead.
+			 */
+
+			/*
+			 * If the tuple is already committed dead, you might think we
+			 * could suppress uniqueness checking, but this is no longer true
+			 * in the presence of HOT, because the insert is actually a proxy
+			 * for a uniqueness check on the whole HOT-chain.  That is, the
+			 * tuple we have here could be dead because it was already
+			 * HOT-updated, and if so the updating transaction will not have
+			 * thought it should insert index entries.  The index AM will
+			 * check the whole HOT-chain and correctly detect a conflict if
+			 * there is one.
+			 */
+
+			index_insert(indexRelation,
+						 values,
+						 isnull,
+						 &slot->tts_tid,
+						 heapRelation,
+						 indexInfo->ii_Unique ?
+						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+						 indexInfo);
+
+			state->tups_inserted += 1;
+
+			MemoryContextReset(econtext->ecxt_per_tuple_memory);
+		}
+	}
+
+	table_endscan(sscan);
+
+	ExecDropSingleTupleTableSlot(slot);
+
+	FreeExecutorState(estate);
+
+	/* These may have been pointing to the now-gone estate */
+	indexInfo->ii_ExpressionsState = NIL;
+	indexInfo->ii_PredicateState = NULL;
+}
+
+static void
+zheapam_scan_analyze_next_block(TableScanDesc sscan, BlockNumber blockno, BufferAccessStrategy bstrategy)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+
+	/*
+	 * We must maintain a pin on the target page's buffer to ensure that
+	 * the maxoffset value stays good (else concurrent VACUUM might delete
+	 * tuples out from under us).  Hence, pin the page until we are done
+	 * looking at it.  We also choose to hold sharelock on the buffer
+	 * throughout --- we could release and re-acquire sharelock for each
+	 * tuple, but since we aren't doing much work per tuple, the extra
+	 * lock traffic is probably better avoided.
+	 */
+	scan->rs_cblock = blockno;
+	scan->rs_cindex = FirstOffsetNumber;
+	if (blockno != ZHEAP_METAPAGE)
+	{
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_scan.rs_rd, MAIN_FORKNUM, blockno,
+										   RBM_NORMAL, bstrategy);
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+	}
+}
+
+static bool
+zheapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, double *liverows, double *deadrows, TupleTableSlot *slot)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	Page		targpage;
+	OffsetNumber maxoffset;
+	ZHeapTupleTableSlot *zslot;
+
+	Assert(TTS_IS_ZHEAP(slot));
+	zslot = (ZHeapTupleTableSlot *) slot;
+
+	if (scan->rs_cblock == ZHEAP_METAPAGE)
+		return false;
+
+	targpage = BufferGetPage(scan->rs_cbuf);
+	maxoffset = PageGetMaxOffsetNumber(targpage);
+
+	/* Skip TPD pages for zheap relations. */
+	if (PageGetSpecialSize(targpage) == sizeof(TPDPageOpaqueData))
+	{
+		UnlockReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+
+		return false;
+	}
+
+	/* Inner loop over all tuples on the selected page */
+	for (; scan->rs_cindex <= maxoffset; scan->rs_cindex++)
+	{
+		ItemId		itemid;
+		ZHeapTuple	targtuple = &zslot->tupdata;
+		Size        targztuple_len;
+		bool		sample_it = false;
+		TransactionId xid;
+
+		itemid = PageGetItemId(targpage, scan->rs_cindex);
+
+		/*
+		 * For zheap, we need to count delete committed rows towards
+		 * dead rows which would have been same, if the tuple was
+		 * present in heap.
+		 */
+		if (ItemIdIsDeleted(itemid))
+		{
+			*deadrows += 1;
+			continue;
+		}
+
+		/*
+		 * We ignore unused and redirect line pointers.  DEAD line
+		 * pointers should be counted as dead, because we need vacuum to
+		 * run to get rid of them.  Note that this rule agrees with the
+		 * way that heap_page_prune() counts things.
+		 */
+		if (!ItemIdIsNormal(itemid))
+		{
+			if (ItemIdIsDead(itemid))
+				*deadrows += 1;
+			continue;
+		}
+
+		targztuple_len = ItemIdGetLength(itemid);
+		targtuple->t_len = targztuple_len;
+		targtuple->t_data = palloc(targztuple_len);
+		targtuple->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+
+		ItemPointerSet(&targtuple->t_self, scan->rs_cblock, scan->rs_cindex);
+		memcpy(targtuple->t_data,
+			   ((ZHeapTupleHeader) PageGetItem((Page) targpage, itemid)),
+			   targztuple_len);
+
+		switch (ZHeapTupleSatisfiesOldestXmin(&targtuple, OldestXmin, scan->rs_cbuf, &xid, NULL))
+		{
+			case HEAPTUPLE_LIVE:
+				sample_it = true;
+				*liverows += 1;
+				break;
+
+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+				/* Count dead and recently-dead rows */
+				*deadrows += 1;
+				break;
+
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Insert-in-progress rows are not counted.  We assume
+				 * that when the inserting transaction commits or aborts,
+				 * it will send a stats message to increment the proper
+				 * count.  This works right only if that transaction ends
+				 * after we finish analyzing the table; if things happen
+				 * in the other order, its stats update will be
+				 * overwritten by ours.  However, the error will be large
+				 * only if the other transaction runs long enough to
+				 * insert many tuples, so assuming it will finish after us
+				 * is the safer option.
+				 *
+				 * A special case is that the inserting transaction might
+				 * be our own.  In this case we should count and sample
+				 * the row, to accommodate users who load a table and
+				 * analyze it in one transaction.  (pgstat_report_analyze
+				 * has to adjust the numbers we send to the stats
+				 * collector to make this come out right.)
+				 */
+				if (TransactionIdIsCurrentTransactionId(xid))
+				{
+					sample_it = true;
+					*liverows += 1;
+				}
+				break;
+
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+
+				/*
+				 * We count delete-in-progress rows as still live, using
+				 * the same reasoning given above; but we don't bother to
+				 * include them in the sample.
+				 *
+				 * If the delete was done by our own transaction, however,
+				 * we must count the row as dead to make
+				 * pgstat_report_analyze's stats adjustments come out
+				 * right.  (Note: this works out properly when the row was
+				 * both inserted and deleted in our xact.)
+				 */
+				if (TransactionIdIsCurrentTransactionId(xid))
+					*deadrows += 1;
+				else
+					*liverows += 1;
+				break;
+
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (sample_it)
+		{
+			ExecStoreZTuple(targtuple, slot, InvalidBuffer, false);
+			scan->rs_cindex++;
+
+			/* note that we leave the buffer locked here! */
+			return true;
+		}
+	}
+
+	/* Now release the lock and pin on the page */
+	UnlockReleaseBuffer(scan->rs_cbuf);
+	scan->rs_cbuf = InvalidBuffer;
+
+	return false;
+}
+
+static bool
+zheap_scan_sample_next_block(TableScanDesc sscan, struct SampleScanState *scanstate)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	TsmRoutine *tsm = scanstate->tsmroutine;
+	BlockNumber blockno;
+
+	/* return false immediately if relation is empty */
+	if (scan->rs_scan.rs_nblocks == 0)
+		return false;
+
+nextblock:
+	if (tsm->NextSampleBlock)
+	{
+		blockno = tsm->NextSampleBlock(scanstate, scan->rs_scan.rs_nblocks);
+		scan->rs_cblock = blockno;
+	}
+	else
+	{
+		/* scanning table sequentially */
+
+		if (scan->rs_cblock == InvalidBlockNumber)
+		{
+			Assert(!scan->rs_inited);
+			blockno = scan->rs_scan.rs_startblock;
+		}
+		else
+		{
+			Assert(scan->rs_inited);
+
+			blockno = scan->rs_cblock + 1;
+
+			if (blockno >= scan->rs_scan.rs_nblocks)
+			{
+				/* wrap to begining of rel, might not have started at 0 */
+				blockno = 0;
+			}
+
+			/*
+			 * Report our new scan position for synchronization purposes.
+			 *
+			 * Note: we do this before checking for end of scan so that the
+			 * final state of the position hint is back at the start of the
+			 * rel.  That's not strictly necessary, but otherwise when you run
+			 * the same query multiple times the starting position would shift
+			 * a little bit backwards on every invocation, which is confusing.
+			 * We don't guarantee any specific ordering in general, though.
+			 */
+			if (scan->rs_scan.rs_syncscan)
+				ss_report_location(scan->rs_scan.rs_rd, blockno);
+
+			if (blockno == scan->rs_scan.rs_startblock)
+			{
+				blockno = InvalidBlockNumber;
+			}
+		}
+	}
+
+	if (!BlockNumberIsValid(blockno))
+	{
+		if (BufferIsValid(scan->rs_cbuf))
+			ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+		scan->rs_cblock = InvalidBlockNumber;
+		scan->rs_inited = false;
+
+		return false;
+	}
+
+	scan->rs_inited = true;
+
+	/*
+	 * If the target block isn't valid, e.g. because it's a tpd page, got to
+	 * the next block.
+	 */
+	if (!zheapgetpage(sscan, blockno))
+	{
+		CHECK_FOR_INTERRUPTS();
+		goto nextblock;
+	}
+
+	return true;
+}
+
+static bool
+zheap_scan_sample_next_tuple(TableScanDesc sscan, struct SampleScanState *scanstate, TupleTableSlot *slot)
+{
+	ZHeapScanDesc scan = (ZHeapScanDesc) sscan;
+	TsmRoutine *tsm = scanstate->tsmroutine;
+	BlockNumber blockno = scan->rs_cblock;
+	bool		pagemode = scan->rs_scan.rs_pageatatime;
+	Page		page;
+	bool		all_visible;
+	OffsetNumber maxoffset;
+	uint8       vmstatus;
+	Buffer      vmbuffer = InvalidBuffer;
+
+	ExecClearTuple(slot);
+
+	if (scan->rs_cindex == -1)
+		return false;
+
+	/*
+	 * When not using pagemode, we must lock the buffer during tuple
+	 * visibility checks.
+	 */
+	if (!pagemode)
+	{
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+		page = (Page) BufferGetPage(scan->rs_cbuf);
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		if (!scan->rs_scan.rs_snapshot->takenDuringRecovery)
+		{
+			vmstatus = visibilitymap_get_status(scan->rs_scan.rs_rd,
+												BufferGetBlockNumber(scan->rs_cbuf),
+												&vmbuffer);
+
+			all_visible = vmstatus;
+
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+		}
+		else
+			all_visible = false;
+	}
+	else
+	{
+		maxoffset = scan->rs_ntuples;
+	}
+
+	for (;;)
+	{
+		OffsetNumber tupoffset;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Ask the tablesample method which tuples to check on this page. */
+		tupoffset = tsm->NextSampleTuple(scanstate,
+										 blockno,
+										 maxoffset);
+
+		if (OffsetNumberIsValid(tupoffset))
+		{
+			ZHeapTuple	tuple;
+
+			if (!pagemode)
+			{
+				ItemId      itemid;
+				bool        visible;
+				ZHeapTuple loctup = NULL;
+				Size        loctup_len;
+				ItemPointerData tid;
+
+				/* Skip invalid tuple pointers. */
+				itemid = PageGetItemId(page, tupoffset);
+				if (!ItemIdIsNormal(itemid))
+					continue;
+
+				ItemPointerSet(&tid, blockno, tupoffset);
+				loctup_len = ItemIdGetLength(itemid);
+
+				loctup = palloc(ZHEAPTUPLESIZE + loctup_len);
+				loctup->t_data = (ZHeapTupleHeader) ((char *) loctup +
+													 ZHEAPTUPLESIZE);
+
+				loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+				loctup->t_len = loctup_len;
+				loctup->t_self = tid;
+
+				/*
+				 * We always need to make a copy of zheap tuple as once we release
+				 * the buffer an in-place update can change the tuple.
+				 */
+				memcpy(loctup->t_data,
+					   ((ZHeapTupleHeader) PageGetItem((Page) page, itemid)),
+					   loctup->t_len);
+
+				if (all_visible)
+				{
+					tuple = loctup;
+					visible = true;
+				}
+				else
+				{
+					tuple = ZHeapTupleSatisfies(loctup,
+												scan->rs_scan.rs_snapshot,
+												scan->rs_cbuf,
+												NULL);
+					visible = (tuple != NULL);
+				}
+
+				/*
+				 * If any prior version is visible, we pass latest visible as
+				 * true. The state of latest version of tuple is determined by
+				 * the called function.
+				 *
+				 * Note that, it's possible that tuple is updated in-place and
+				 * we're seeing some prior version of that. We handle that case
+				 * in ZHeapTupleHasSerializableConflictOut.
+				 */
+				CheckForSerializableConflictOut(visible, scan->rs_scan.rs_rd, (void *) &tid,
+												scan->rs_cbuf, scan->rs_scan.rs_snapshot);
+
+				/* Try next tuple from same page. */
+				if (!visible)
+					continue;
+
+				ExecStoreZTuple(tuple, slot, InvalidBuffer, false);
+
+				/* Found visible tuple, return it. */
+				LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+
+				/* Count successfully-fetched tuples as heap fetches */
+				pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+
+				return true;
+			}
+			else
+			{
+				tuple = scan->rs_visztuples[tupoffset - 1];
+				if (tuple == NULL)
+					continue;
+
+				ExecStoreZTuple(tuple, slot, InvalidBuffer, false);
+
+				return true;
+			}
+		}
+		else
+		{
+			/*
+			 * If we get here, it means we've exhausted the items on this page and
+			 * it's time to move to the next.
+			 */
+			if (!pagemode)
+				LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+
+			break;
+		}
+	}
+
+	return false;
+}
+
+static void
+zheap_copy_for_cluster(Relation OldHeap, Relation NewHeap, Relation OldIndex,
+					 bool use_sort,
+					 TransactionId OldestXmin, TransactionId FreezeXid, MultiXactId MultiXactCutoff,
+					 double *num_tuples, double *tups_vacuumed, double *tups_recently_dead)
+{
+	RewriteZheapState rwstate;
+	IndexScanDesc indexScan;
+	TableScanDesc heapScan;
+	bool		use_wal;
+	Tuplesortstate *tuplesort;
+	TupleDesc	oldTupDesc = RelationGetDescr(OldHeap);
+	TupleDesc	newTupDesc = RelationGetDescr(NewHeap);
+	TupleTableSlot *slot;
+	int			natts;
+	Datum	   *values;
+	bool	   *isnull;
+
+	/*
+	 * We need to log the copied data in WAL iff WAL archiving/streaming is
+	 * enabled AND it's a WAL-logged rel.
+	 */
+	use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
+
+	/* use_wal off requires smgr_targblock be initially invalid */
+	Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
+
+	/* Preallocate values/isnull arrays */
+	natts = newTupDesc->natts;
+	values = (Datum *) palloc(natts * sizeof(Datum));
+	isnull = (bool *) palloc(natts * sizeof(bool));
+
+	/* Initialize the rewrite operation */
+	rwstate = begin_zheap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
+								  MultiXactCutoff, use_wal);
+
+
+	/* Set up sorting if wanted */
+	if (use_sort)
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+											maintenance_work_mem,
+											NULL, false);
+	else
+		tuplesort = NULL;
+
+	/*
+	 * Prepare to scan the OldHeap.
+	 *
+	 * We don't have a way to copy visibility information in zheap, so we
+	 * just copy LIVE tuples.  See comments atop rewritezheap.c
+	 *
+	 * While scanning, we skip meta and tpd pages (done by *getnext API's)
+	 * which is okay because we mark the tuples as frozen.  However, when we
+	 * extend current implementation to copy visibility information of tuples,
+	 * we would require to copy meta page and or TPD page information as well
+	 */
+	if (OldIndex != NULL && !use_sort)
+	{
+		heapScan = NULL;
+		indexScan = index_beginscan(OldHeap, OldIndex, GetTransactionSnapshot(), 0, 0);
+		index_rescan(indexScan, NULL, 0, NULL, 0);
+	}
+	else
+	{
+		heapScan = table_beginscan(OldHeap, GetTransactionSnapshot(), 0, (ScanKey) NULL);
+		indexScan = NULL;
+	}
+
+	slot = table_gimmegimmeslot(OldHeap, NULL);
+
+	/*
+	 * Scan through the OldHeap, either in OldIndex order or sequentially;
+	 * copy each tuple into the NewHeap, or transiently to the tuplesort
+	 * module.  Note that we don't bother sorting dead tuples (they won't get
+	 * to the new table anyway).  While scanning, we skip meta and tpd pages
+	 * (done by *getnext API's) which is okay because we mark the tuples as
+	 * frozen.  However, when we extend current implementation to copy
+	 * visibility information of tuples, we would require to copy meta page
+	 * and or TPD page information as well.
+	 */
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		if (indexScan != NULL)
+		{
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+				break;
+
+			/* Since we used no scan keys, should never need to recheck */
+			if (indexScan->xs_recheck)
+				elog(ERROR, "CLUSTER does not support lossy index conditions");
+		}
+		else
+		{
+			if (!table_scan_getnextslot(heapScan, ForwardScanDirection, slot))
+				break;
+		}
+
+		num_tuples += 1;
+		if (tuplesort != NULL)
+			tuplesort_puttupleslot(tuplesort, slot);
+		else
+			reform_and_rewrite_ztuple(ExecGetZHeapTupleFromSlot(slot), oldTupDesc, newTupDesc,
+									  values, isnull, rwstate);
+	}
+
+	if (indexScan != NULL)
+		index_endscan(indexScan);
+	if (heapScan != NULL)
+		table_endscan(heapScan);
+
+	ExecDropSingleTupleTableSlot(slot);
+
+	/*
+	 * In scan-and-sort mode, complete the sort, then read out all live tuples
+	 * from the tuplestore and write them to the new relation.
+	 */
+	if (tuplesort != NULL)
+	{
+		tuplesort_performsort(tuplesort);
+
+		for (;;)
+		{
+			HeapTuple	heapTuple;
+			ZHeapTuple	zheapTuple;
+
+			CHECK_FOR_INTERRUPTS();
+
+			heapTuple = tuplesort_getheaptuple(tuplesort, true);
+			if (heapTuple == NULL)
+				break;
+
+			zheapTuple = heap_to_zheap(heapTuple, oldTupDesc);
+
+			reform_and_rewrite_ztuple(zheapTuple, oldTupDesc, newTupDesc,
+									  values, isnull, rwstate);
+			/* be tidy */
+			pfree(zheapTuple);
+		}
+
+		tuplesort_end(tuplesort);
+	}
+
+	/* Write out any remaining tuples, and fsync if needed */
+	end_zheap_rewrite(rwstate);
+
+	/* Clean up */
+	pfree(values);
+	pfree(isnull);
+}
+
+/*
+ * Set up an init fork for an unlogged table so that it can be correctly
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.  Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
+ */
+static void
+zheap_create_init_fork(Relation rel)
+{
+	Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+		   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+		   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+	RelationOpenSmgr(rel);
+	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
+	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+
+	/* ZBORKED: This causes separate WAL, which doesn't seem optimal */
+	ZheapInitMetaPage(rel, INIT_FORKNUM);
+}
+
+static const TableAmRoutine zheapam_methods = {
+	.type = T_TableAmRoutine,
+
+	.slot_callbacks = zheapam_slot_callbacks,
+
+	.snapshot_satisfies = zheapam_satisfies,
+
+	.scan_begin = zheap_beginscan,
+	.scansetlimits = zheap_setscanlimits,
+	.scan_getnextslot = zheap_getnextslot,
+	.scan_end = heap_endscan,
+	.scan_rescan = zheap_rescan,
+	.scan_update_snapshot = zheap_update_snapshot,
+
+	.scan_bitmap_pagescan = zheap_scan_bitmap_pagescan,
+	.scan_bitmap_pagescan_next = zheap_scan_bitmap_pagescan_next,
+
+	.scan_sample_next_block = zheap_scan_sample_next_block,
+	.scan_sample_next_tuple = zheap_scan_sample_next_tuple,
+
+	.tuple_fetch_row_version = zheapam_fetch_row_version,
+	.tuple_fetch_follow = zheapam_fetch_follow,
+	.tuple_insert = zheapam_insert,
+	.tuple_insert_speculative = zheapam_insert_speculative,
+	.tuple_complete_speculative = zheapam_complete_speculative,
+	.tuple_delete = zheapam_delete,
+	.tuple_update = zheapam_update,
+	.tuple_lock = zheapam_lock_tuple,
+	.multi_insert = zheap_multi_insert,
+
+	.tuple_get_latest_tid = zheap_get_latest_tid,
+
+	.relation_vacuum = lazy_vacuum_zheap_rel,
+	.scan_analyze_next_block = zheapam_scan_analyze_next_block,
+	.scan_analyze_next_tuple = zheapam_scan_analyze_next_tuple,
+	.relation_copy_for_cluster = zheap_copy_for_cluster,
+	.relation_create_init_fork = zheap_create_init_fork,
+	.relation_sync = heap_sync,
+
+	.begin_index_fetch = zheapam_begin_index_fetch,
+	.reset_index_fetch = zheapam_reset_index_fetch,
+	.end_index_fetch = zheapam_end_index_fetch,
+
+	.index_build_range_scan = IndexBuildZHeapRangeScan,
+	.index_validate_scan = validate_index_zheapscan
+};
+
+Datum
+zheap_tableam_handler(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_POINTER(&zheapam_methods);
+}
diff --git a/src/backend/access/zheap/zheapamutils.c b/src/backend/access/zheap/zheapamutils.c
new file mode 100644
index 0000000000..f4bf32d56b
--- /dev/null
+++ b/src/backend/access/zheap/zheapamutils.c
@@ -0,0 +1,121 @@
+/*-------------------------------------------------------------------------
+ *
+ * zheapamutils.c
+ *	  zheap utility method code
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/zheapamutils.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		zheap_to_heap	- convert zheap tuple to heap tuple
+ *
+ * NOTES
+ *	  This file contains utility functions for the zheap
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/zheap.h"
+#include "access/zheaputils.h"
+#include "storage/bufmgr.h"
+
+/*
+ * zheap_to_heap
+ *
+ * convert zheap tuple to heap tuple
+ */
+HeapTuple
+zheap_to_heap(ZHeapTuple ztuple, TupleDesc tupDesc)
+{
+	HeapTuple tuple;
+	Datum	*values = palloc0(sizeof(Datum) * tupDesc->natts);
+	bool	*nulls = palloc0(sizeof(bool) * tupDesc->natts);
+
+	zheap_deform_tuple(ztuple, tupDesc, values, nulls);
+	tuple = heap_form_tuple(tupDesc, values, nulls);
+	tuple->t_self = ztuple->t_self;
+	tuple->t_tableOid = ztuple->t_tableOid;
+
+	pfree(values);
+	pfree(nulls);
+
+	return tuple;
+}
+
+/*
+ * zheap_to_heap
+ *
+ * convert zheap tuple to a minimal tuple
+ */
+MinimalTuple
+zheap_to_minimal(ZHeapTuple ztuple, TupleDesc tupDesc)
+{
+	MinimalTuple tuple;
+	Datum	*values = palloc0(sizeof(Datum) * tupDesc->natts);
+	bool	*nulls = palloc0(sizeof(bool) * tupDesc->natts);
+
+	zheap_deform_tuple(ztuple, tupDesc, values, nulls);
+	tuple = heap_form_minimal_tuple(tupDesc, values, nulls);
+
+	pfree(values);
+	pfree(nulls);
+
+	return tuple;
+}
+
+/*
+ * heap_to_zheap
+ *
+ * convert heap tuple to zheap tuple
+ */
+ZHeapTuple
+heap_to_zheap(HeapTuple tuple, TupleDesc tupDesc)
+{
+	ZHeapTuple ztuple;
+	Datum	*values = palloc0(sizeof(Datum) * tupDesc->natts);
+	bool	*nulls = palloc0(sizeof(bool) * tupDesc->natts);
+
+	heap_deform_tuple(tuple, tupDesc, values, nulls);
+	ztuple = zheap_form_tuple(tupDesc, values, nulls);
+	ztuple->t_self = tuple->t_self;
+	ztuple->t_tableOid = tuple->t_tableOid;
+
+	pfree(values);
+	pfree(nulls);
+
+	return ztuple;
+}
+
+/* ----------------
+ *		zheap_copytuple
+ *
+ *		returns a copy of an entire tuple
+ *
+ * The ZHeapTuple struct, tuple header, and tuple data are all allocated
+ * as a single palloc() block.
+ * ----------------
+ */
+ZHeapTuple
+zheap_copytuple(ZHeapTuple tuple)
+{
+	ZHeapTuple	newTuple;
+
+	if (!ZHeapTupleIsValid(tuple) || tuple->t_data == NULL)
+		return NULL;
+
+	newTuple = (ZHeapTuple) palloc(ZHEAPTUPLESIZE + tuple->t_len);
+	newTuple->t_len = tuple->t_len;
+	newTuple->t_self = tuple->t_self;
+	newTuple->t_tableOid = tuple->t_tableOid;
+	newTuple->t_data = (ZHeapTupleHeader) ((char *) newTuple + ZHEAPTUPLESIZE);
+	memcpy((char *) newTuple->t_data, (char *) tuple->t_data, tuple->t_len);
+	return newTuple;
+}
diff --git a/src/backend/access/zheap/zheapamxlog.c b/src/backend/access/zheap/zheapamxlog.c
new file mode 100644
index 0000000000..0caa82d18d
--- /dev/null
+++ b/src/backend/access/zheap/zheapamxlog.c
@@ -0,0 +1,2198 @@
+/*-------------------------------------------------------------------------
+ *
+ * zheapamxlog.c
+ *	  WAL replay logic for zheap.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/zheapamxlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/visibilitymap.h"
+#include "access/xlog.h"
+#include "access/xlogutils.h"
+#include "access/zheap.h"
+#include "access/zheapam_xlog.h"
+#include "storage/standby.h"
+#include "storage/freespace.h"
+
+static void
+zheap_xlog_insert(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undo_header	*xlundohdr = (xl_undo_header *) XLogRecGetData(record);
+	xl_zheap_insert *xlrec;
+	Buffer		buffer;
+	Page		page;
+	union
+	{
+		ZHeapTupleHeaderData hdr;
+		char		data[MaxZHeapTupleSize];
+	}			tbuf;
+	ZHeapTupleHeader zhtup;
+	UnpackedUndoRecord	undorecord;
+	UndoRecPtr	urecptr = InvalidUndoRecPtr;
+	xl_zheap_header xlhdr;
+	uint32		newlen;
+	RelFileNode target_node;
+	BlockNumber blkno;
+	ItemPointerData target_tid;
+	XLogRedoAction action;
+	int			*tpd_trans_slot_id = NULL;
+	TransactionId	xid = XLogRecGetXid(record);
+	uint32	xid_epoch = GetEpochForXid(xid);
+	bool	skip_undo;
+
+	xlrec = (xl_zheap_insert *) ((char *) xlundohdr + SizeOfUndoHeader);
+	if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT)
+		tpd_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapInsert);
+
+	XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno);
+	ItemPointerSetBlockNumber(&target_tid, blkno);
+	ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum);
+
+	/*
+	 * The visibility map may need to be fixed even if the heap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLZ_INSERT_ALL_VISIBLE_CLEARED)
+	{
+		Relation	reln = CreateFakeRelcacheEntry(target_node);
+		Buffer		vmbuffer = InvalidBuffer;
+
+		visibilitymap_pin(reln, blkno, &vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+
+	/*
+	 * We can skip inserting undo records if the tuples are to be marked
+	 * as frozen.
+	 */
+	skip_undo = (xlrec->flags & XLZ_INSERT_IS_FROZEN);
+
+	if (!skip_undo)
+	{
+		/* prepare an undo record */
+		undorecord.uur_type = UNDO_INSERT;
+		undorecord.uur_info = 0;
+		undorecord.uur_prevlen = 0;
+		undorecord.uur_reloid = xlundohdr->reloid;
+		undorecord.uur_prevxid = FrozenTransactionId;
+		undorecord.uur_xid = xid;
+		undorecord.uur_cid = FirstCommandId;
+		undorecord.uur_fork = MAIN_FORKNUM;
+		undorecord.uur_blkprev = xlundohdr->blkprev;
+		undorecord.uur_block = ItemPointerGetBlockNumber(&target_tid);
+		undorecord.uur_offset = ItemPointerGetOffsetNumber(&target_tid);
+		undorecord.uur_payload.len = 0;
+		undorecord.uur_tuple.len = 0;
+
+		/*
+		 * For speculative insertions, we store the dummy speculative token in the
+		 * undorecord so that, the size of undorecord in DO function matches with
+		 * the size of undorecord in REDO function. This ensures that, for INSERT
+		 * ... ON CONFLICT statements, the assert condition used later in this
+		 * function to ensure that the undo pointer in DO and REDO function remains
+		 * the same is true. However, it might not be useful in the REDO function as
+		 * it is just required in the master node to detect conflicts for insert ...
+		 * on conflict.
+		 *
+		 * Fixme - Once we have undo consistency checker that we can remove the
+		 * assertion as well dummy speculative token.
+		 */
+		if (xlrec->flags & XLZ_INSERT_IS_SPECULATIVE)
+		{
+			uint32 dummy_specToken = 1;
+
+			initStringInfo(&undorecord.uur_payload);
+			appendBinaryStringInfo(&undorecord.uur_payload,
+				(char *)&dummy_specToken,
+				sizeof(uint32));
+		}
+		else
+			undorecord.uur_payload.len = 0;
+
+		urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL);
+		InsertPreparedUndo();
+
+		/*
+		 * undo should be inserted at same location as it was during the actual
+		 * insert (DO operation).
+		 */
+
+		Assert(urecptr == xlundohdr->urec_ptr);
+	}
+
+	/*
+	 * If we inserted the first and only tuple on the page, re-initialize the
+	 * page from scratch.
+	 */
+	if (XLogRecGetInfo(record) & XLOG_ZHEAP_INIT_PAGE)
+	{
+		/* It is asked for page init, insert should not have tpd slot. */
+		Assert(!(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT));
+		buffer = XLogInitBufferForRedo(record, 0);
+		page = BufferGetPage(buffer);
+		ZheapInitPage(page, BufferGetPageSize(buffer));
+		action = BLK_NEEDS_REDO;
+	}
+	else
+		action = XLogReadBufferForRedo(record, 0, &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *data;
+		int			trans_slot_id;
+
+		page = BufferGetPage(buffer);
+
+		if (PageGetMaxOffsetNumber(page) + 1 < xlrec->offnum)
+			elog(PANIC, "invalid max offset number");
+
+		data = XLogRecGetBlockData(record, 0, &datalen);
+
+		newlen = datalen - SizeOfZHeapHeader;
+		Assert(datalen > SizeOfZHeapHeader && newlen <= MaxZHeapTupleSize);
+		memcpy((char *) &xlhdr, data, SizeOfZHeapHeader);
+		data += SizeOfZHeapHeader;
+
+		zhtup = &tbuf.hdr;
+		MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader);
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) zhtup + SizeofZHeapTupleHeader,
+			   data,
+			   newlen);
+		newlen += SizeofZHeapTupleHeader;
+		zhtup->t_infomask2 = xlhdr.t_infomask2;
+		zhtup->t_infomask = xlhdr.t_infomask;
+		zhtup->t_hoff = xlhdr.t_hoff;
+
+		if (ZPageAddItem(buffer, NULL, (Item) zhtup, newlen, xlrec->offnum,
+						 true, true, true) == InvalidOffsetNumber)
+			elog(PANIC, "failed to add tuple");
+
+		if (!skip_undo)
+		{
+			if (tpd_trans_slot_id)
+				trans_slot_id = *tpd_trans_slot_id;
+			else
+				trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup);
+
+			PageSetUNDO(undorecord, buffer, trans_slot_id, false, xid_epoch,
+						xid, urecptr, NULL, 0);
+		}
+
+		PageSetLSN(page, lsn);
+
+		MarkBufferDirty(buffer);
+	}
+
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		/* We can't have a valid transaction slot when we are skipping undo. */
+		Assert(!skip_undo);
+
+		/*
+		 * We need to replay the record for TPD only when this record contains
+		 * slot from TPD.
+		 */
+		Assert(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT);
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			TPDPageSetUndo(buffer,
+						   *tpd_trans_slot_id,
+						   true,
+						   xid_epoch,
+						   xid,
+						   urecptr,
+						   &undorecord.uur_offset,
+						   1);
+			TPDPageSetLSN(BufferGetPage(buffer), lsn);
+		}
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+}
+
+static void
+zheap_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undo_header	*xlundohdr = (xl_undo_header *) XLogRecGetData(record);
+	Size	recordlen = XLogRecGetDataLen(record);
+	xl_zheap_delete *xlrec;
+	Buffer		buffer;
+	Page		page;
+	ZHeapTupleData	zheaptup;
+	UnpackedUndoRecord	undorecord;
+	UndoRecPtr	urecptr;
+	RelFileNode target_node;
+	BlockNumber blkno;
+	ItemPointerData target_tid;
+	XLogRedoAction action;
+	Relation	reln;
+	ItemId	lp = NULL;
+	TransactionId	xid = XLogRecGetXid(record);
+	uint32	xid_epoch = GetEpochForXid(xid);
+	int		*tpd_trans_slot_id = NULL;
+	bool		hasPayload = false;
+
+	xlrec = (xl_zheap_delete *) ((char *) xlundohdr + SizeOfUndoHeader);
+	if (xlrec->flags & XLZ_DELETE_CONTAINS_TPD_SLOT)
+		tpd_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapDelete);
+
+	XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno);
+	ItemPointerSetBlockNumber(&target_tid, blkno);
+	ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum);
+
+	reln = CreateFakeRelcacheEntry(target_node);
+
+	/*
+	 * The visibility map may need to be fixed even if the heap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLZ_DELETE_ALL_VISIBLE_CLEARED)
+	{
+		Buffer		vmbuffer = InvalidBuffer;
+
+		visibilitymap_pin(reln, blkno, &vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
+		ReleaseBuffer(vmbuffer);
+	}
+
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+
+	page = BufferGetPage(buffer);
+
+	if (PageGetMaxOffsetNumber(page) >= xlrec->offnum)
+		lp = PageGetItemId(page, xlrec->offnum);
+
+	if (PageGetMaxOffsetNumber(page) < xlrec->offnum || !ItemIdIsNormal(lp))
+		elog(PANIC, "invalid lp");
+
+	zheaptup.t_tableOid = RelationGetRelid(reln);
+	zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	zheaptup.t_len = ItemIdGetLength(lp);
+	zheaptup.t_self = target_tid;
+
+	/*
+	 * If the WAL stream contains undo tuple, then replace it with the
+	 * explicitly stored tuple.
+	 */
+	if (xlrec->flags & XLZ_HAS_DELETE_UNDOTUPLE)
+	{
+		char	   *data;
+		xl_zheap_header xlhdr;
+		union
+		{
+			ZHeapTupleHeaderData hdr;
+			char		data[MaxZHeapTupleSize];
+		} tbuf;
+		ZHeapTupleHeader zhtup;
+		Size	datalen;
+
+		if (xlrec->flags & XLZ_DELETE_CONTAINS_TPD_SLOT)
+		{
+			data = (char *) xlrec + SizeOfZHeapDelete +
+											sizeof(*tpd_trans_slot_id);
+			datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapDelete -
+							SizeOfZHeapHeader - sizeof(*tpd_trans_slot_id);
+		}
+		else
+		{
+			data = (char *) xlrec + SizeOfZHeapDelete;
+			datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapDelete -
+											SizeOfZHeapHeader;
+		}
+		memcpy((char *) &xlhdr, data, SizeOfZHeapHeader);
+		data += SizeOfZHeapHeader;
+
+		zhtup = &tbuf.hdr;
+		MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader);
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) zhtup + SizeofZHeapTupleHeader,
+			   data,
+			   datalen);
+		datalen += SizeofZHeapTupleHeader;
+		zhtup->t_infomask2 = xlhdr.t_infomask2;
+		zhtup->t_infomask = xlhdr.t_infomask;
+		zhtup->t_hoff = xlhdr.t_hoff;
+
+		zheaptup.t_data = zhtup;
+		zheaptup.t_len = datalen;
+	}
+
+	/* prepare an undo record */
+	undorecord.uur_type = UNDO_DELETE;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = xlundohdr->reloid;
+	undorecord.uur_prevxid = xlrec->prevxid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = FirstCommandId;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = xlundohdr->blkprev;
+	undorecord.uur_block = ItemPointerGetBlockNumber(&target_tid);
+	undorecord.uur_offset = ItemPointerGetOffsetNumber(&target_tid);
+
+	initStringInfo(&undorecord.uur_tuple);
+
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &zheaptup.t_len,
+						   sizeof(uint32));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &zheaptup.t_self,
+						   sizeof(ItemPointerData));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &zheaptup.t_tableOid,
+						   sizeof(Oid));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) zheaptup.t_data,
+						   zheaptup.t_len);
+
+	if (xlrec->flags & XLZ_DELETE_CONTAINS_TPD_SLOT)
+	{
+		initStringInfo(&undorecord.uur_payload);
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) tpd_trans_slot_id,
+							   sizeof(*tpd_trans_slot_id));
+		hasPayload = true;
+	}
+
+	/*
+	 * For sub-tranasctions, we store the dummy contains subxact token in the
+	 * undorecord so that, the size of undorecord in DO function matches with
+	 * the size of undorecord in REDO function. This ensures that, for
+	 * sub-transactions, the assert condition used later in this
+	 * function to ensure that the undo pointer in DO and REDO function remains
+	 * the same is true.
+	 */
+	if (xlrec->flags & XLZ_DELETE_CONTAINS_SUBXACT)
+	{
+		SubTransactionId dummy_subXactToken = 1;
+
+		if (!hasPayload)
+		{
+			initStringInfo(&undorecord.uur_payload);
+			hasPayload = true;
+		}
+
+		appendBinaryStringInfo(&undorecord.uur_payload,
+								(char *) &dummy_subXactToken,
+								sizeof(SubTransactionId));
+	}
+
+	if (!hasPayload)
+		undorecord.uur_payload.len = 0;
+
+	urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL);
+	InsertPreparedUndo();
+
+	/*
+	 * undo should be inserted at same location as it was during the actual
+	 * insert (DO operation).
+	 */
+	Assert (urecptr == xlundohdr->urec_ptr);
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		zheaptup.t_len = ItemIdGetLength(lp);
+		ZHeapTupleHeaderSetXactSlot(zheaptup.t_data, xlrec->trans_slot_id);
+		zheaptup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+		zheaptup.t_data->t_infomask = xlrec->infomask;
+
+		if (xlrec->flags & XLZ_DELETE_IS_PARTITION_MOVE)
+			ZHeapTupleHeaderSetMovedPartitions(zheaptup.t_data);
+
+		PageSetUNDO(undorecord, buffer, xlrec->trans_slot_id,
+					false, xid_epoch, xid, urecptr, NULL, 0);
+
+		/* Mark the page as a candidate for pruning */
+		ZPageSetPrunable(page, XLogRecGetXid(record));
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			TPDPageSetUndo(buffer,
+						   xlrec->trans_slot_id,
+						   true,
+						   xid_epoch,
+						   xid,
+						   urecptr,
+						   &undorecord.uur_offset,
+						   1);
+			TPDPageSetLSN(page, lsn);
+		}
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	/* be tidy */
+	pfree(undorecord.uur_tuple.data);
+	if (undorecord.uur_payload.len > 0)
+		pfree(undorecord.uur_payload.data);
+
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+	FreeFakeRelcacheEntry(reln);
+}
+
+static void
+zheap_xlog_update(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undo_header	*xlundohdr;
+	xl_undo_header	*xlnewundohdr = NULL;
+	xl_zheap_header xlhdr;
+	Size	recordlen;
+	Size		freespace = 0;
+	xl_zheap_update *xlrec;
+	Buffer		oldbuffer, newbuffer;
+	Page		oldpage, newpage;
+	ZHeapTupleData	oldtup;
+	ZHeapTupleHeader newtup;
+	union
+	{
+		ZHeapTupleHeaderData hdr;
+		char		data[MaxZHeapTupleSize];
+	} tbuf;
+	UnpackedUndoRecord	undorecord, newundorecord;
+	UndoRecPtr	urecptr = InvalidUndoRecPtr;
+	UndoRecPtr	newurecptr = InvalidUndoRecPtr;
+	RelFileNode rnode;
+	BlockNumber oldblk, newblk;
+	ItemPointerData oldtid, newtid;
+	XLogRedoAction oldaction, newaction;
+	Relation	reln;
+	ItemId	lp = NULL;
+	TransactionId	xid = XLogRecGetXid(record);
+	uint32	xid_epoch = GetEpochForXid(xid);
+	int			*old_tup_trans_slot_id = NULL;
+	int			*new_trans_slot_id = NULL;
+	int			trans_slot_id;
+	bool	inplace_update;
+
+	xlundohdr = (xl_undo_header *) XLogRecGetData(record);
+	xlrec = (xl_zheap_update *) ((char *) xlundohdr + SizeOfUndoHeader);
+	recordlen = XLogRecGetDataLen(record);
+
+	if (xlrec->flags & XLZ_UPDATE_OLD_CONTAINS_TPD_SLOT)
+	{
+		old_tup_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapUpdate);
+	}
+	if (xlrec->flags & XLZ_NON_INPLACE_UPDATE)
+	{
+		inplace_update = false;
+		if (old_tup_trans_slot_id)
+			xlnewundohdr = (xl_undo_header *) ((char *) old_tup_trans_slot_id + sizeof(*old_tup_trans_slot_id));
+		else
+			xlnewundohdr = (xl_undo_header *) ((char *) xlrec + SizeOfZHeapUpdate);
+
+		if (xlrec->flags & XLZ_UPDATE_NEW_CONTAINS_TPD_SLOT)
+			new_trans_slot_id = (int *) ((char *) xlnewundohdr + SizeOfUndoHeader);
+	}
+	else
+	{
+		inplace_update = true;
+	}
+
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk);
+	if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
+	{
+		/* inplace updates are never done across pages */
+		Assert(!inplace_update);
+	}
+	else
+		oldblk = newblk;
+
+	ItemPointerSet(&oldtid, oldblk, xlrec->old_offnum);
+	ItemPointerSet(&newtid, newblk, xlrec->new_offnum);
+
+	reln = CreateFakeRelcacheEntry(rnode);
+
+	/*
+	 * The visibility map may need to be fixed even if the zheap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLZ_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+	{
+		Buffer		vmbuffer = InvalidBuffer;
+
+		visibilitymap_pin(reln, oldblk, &vmbuffer);
+		visibilitymap_clear(reln, oldblk, vmbuffer, VISIBILITYMAP_VALID_BITS);
+		ReleaseBuffer(vmbuffer);
+	}
+
+	oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1, &oldbuffer);
+
+	oldpage = BufferGetPage(oldbuffer);
+
+	if (PageGetMaxOffsetNumber(oldpage) >= xlrec->old_offnum)
+		lp = PageGetItemId(oldpage, xlrec->old_offnum);
+
+	if (PageGetMaxOffsetNumber(oldpage) < xlrec->old_offnum || !ItemIdIsNormal(lp))
+		elog(PANIC, "invalid lp");
+
+	oldtup.t_tableOid = RelationGetRelid(reln);
+	oldtup.t_data = (ZHeapTupleHeader) PageGetItem(oldpage, lp);
+	oldtup.t_len = ItemIdGetLength(lp);
+	oldtup.t_self = oldtid;
+
+	/*
+	 * If the WAL stream contains undo tuple, then replace it with the
+	 * explicitly stored tuple.
+	 */
+	if (xlrec->flags & XLZ_HAS_UPDATE_UNDOTUPLE)
+	{
+		ZHeapTupleHeader zhtup;
+		Size	datalen;
+		char	*data;
+
+		/* There is an additional undo header for non-inplace-update. */
+		if (inplace_update)
+		{
+			if (old_tup_trans_slot_id)
+			{
+				data = (char *) ((char *) old_tup_trans_slot_id + sizeof(*old_tup_trans_slot_id));
+				datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapUpdate -
+							sizeof(*old_tup_trans_slot_id) - SizeOfZHeapHeader;
+			}
+			else
+			{
+				data = (char *) xlrec + SizeOfZHeapUpdate;
+				datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapUpdate - SizeOfZHeapHeader;
+			}
+		}
+		else
+		{
+			if (old_tup_trans_slot_id && new_trans_slot_id)
+			{
+				datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate -
+					sizeof(*old_tup_trans_slot_id) - sizeof(*new_trans_slot_id) -
+					SizeOfZHeapHeader;
+				data = (char *) ((char *) new_trans_slot_id + sizeof(*new_trans_slot_id));
+			}
+			else if (new_trans_slot_id)
+			{
+				datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate -
+						sizeof(*new_trans_slot_id) - SizeOfZHeapHeader;
+				data = (char *) ((char *) new_trans_slot_id + sizeof(*new_trans_slot_id));
+			}
+			else if (old_tup_trans_slot_id)
+			{
+				datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate -
+						sizeof(*old_tup_trans_slot_id) - SizeOfZHeapHeader;
+				data = (char *) xlnewundohdr + SizeOfUndoHeader;
+			}
+			else
+			{
+				datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate -
+									SizeOfZHeapHeader;
+				data = (char *) xlnewundohdr + SizeOfUndoHeader;
+			}
+		}
+
+		memcpy((char *) &xlhdr, data, SizeOfZHeapHeader);
+		data += SizeOfZHeapHeader;
+
+		zhtup = &tbuf.hdr;
+		MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader);
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) zhtup + SizeofZHeapTupleHeader,
+			   data,
+			   datalen);
+		datalen += SizeofZHeapTupleHeader;
+		zhtup->t_infomask2 = xlhdr.t_infomask2;
+		zhtup->t_infomask = xlhdr.t_infomask;
+		zhtup->t_hoff = xlhdr.t_hoff;
+
+		oldtup.t_data = zhtup;
+		oldtup.t_len = datalen;
+	}
+
+	/* prepare an undo record */
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = xlundohdr->reloid;
+	undorecord.uur_prevxid = xlrec->prevxid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = FirstCommandId;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = xlundohdr->blkprev;
+	undorecord.uur_block = ItemPointerGetBlockNumber(&oldtid);
+	undorecord.uur_offset = ItemPointerGetOffsetNumber(&oldtid);
+	undorecord.uur_payload.len = 0;
+
+	initStringInfo(&undorecord.uur_tuple);
+
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &oldtup.t_len,
+						   sizeof(uint32));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &oldtup.t_self,
+						   sizeof(ItemPointerData));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) &oldtup.t_tableOid,
+						   sizeof(Oid));
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) oldtup.t_data,
+						   oldtup.t_len);
+
+	if (inplace_update)
+	{
+		bool	hasPayload = false;
+
+		undorecord.uur_type =  UNDO_INPLACE_UPDATE;
+		if (old_tup_trans_slot_id)
+		{
+			Assert(*old_tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS);
+			initStringInfo(&undorecord.uur_payload);
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) old_tup_trans_slot_id,
+								   sizeof(*old_tup_trans_slot_id));
+			hasPayload = true;
+		}
+
+		/*
+		 * For sub-tranasctions, we store the dummy contains subxact token in the
+		 * undorecord so that, the size of undorecord in DO function matches with
+		 * the size of undorecord in REDO function. This ensures that, for
+		 * sub-transactions, the assert condition used later in this
+		 * function to ensure that the undo pointer in DO and REDO function remains
+		 * the same is true.
+		 */
+		if (xlrec->flags & XLZ_UPDATE_CONTAINS_SUBXACT)
+		{
+			SubTransactionId dummy_subXactToken = 1;
+
+			if (!hasPayload)
+			{
+				initStringInfo(&undorecord.uur_payload);
+				hasPayload = true;
+			}
+
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &dummy_subXactToken,
+								   sizeof(SubTransactionId));
+		}
+
+		if (!hasPayload)
+			undorecord.uur_payload.len = 0;
+
+		urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL);
+	}
+	else
+	{
+		UnpackedUndoRecord	undorec[2];
+
+		undorecord.uur_type = UNDO_UPDATE;
+		initStringInfo(&undorecord.uur_payload);
+		/* update new tuple location in undo record */
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) &newtid,
+							   sizeof(ItemPointerData));
+		/* add the TPD slot id */
+		if (old_tup_trans_slot_id)
+		{
+			Assert(*old_tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS);
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) old_tup_trans_slot_id,
+								   sizeof(*old_tup_trans_slot_id));
+		}
+
+		/*
+		 * For sub-tranasctions, we store the dummy contains subxact token in the
+		 * undorecord so that, the size of undorecord in DO function matches with
+		 * the size of undorecord in REDO function. This ensures that, for
+		 * sub-transactions, the assert condition used later in this
+		 * function to ensure that the undo pointer in DO and REDO function remains
+		 * the same is true.
+		 */
+		if (xlrec->flags & XLZ_UPDATE_CONTAINS_SUBXACT)
+		{
+			SubTransactionId dummy_subXactToken = 1;
+
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &dummy_subXactToken,
+								   sizeof(SubTransactionId));
+		}
+
+		/* prepare an undo record for new tuple */
+		newundorecord.uur_type = UNDO_INSERT;
+		newundorecord.uur_info = 0;
+		newundorecord.uur_prevlen = 0;
+		newundorecord.uur_reloid = xlnewundohdr->reloid;
+		newundorecord.uur_prevxid = xid;
+		newundorecord.uur_xid = xid;
+		newundorecord.uur_cid = FirstCommandId;
+		newundorecord.uur_fork = MAIN_FORKNUM;
+		newundorecord.uur_blkprev = xlnewundohdr->blkprev;
+		newundorecord.uur_block = ItemPointerGetBlockNumber(&newtid);
+		newundorecord.uur_offset = ItemPointerGetOffsetNumber(&newtid);
+		newundorecord.uur_tuple.len = 0;
+
+		if (new_trans_slot_id)
+		{
+			Assert(*new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS);
+			initStringInfo(&newundorecord.uur_payload);
+			appendBinaryStringInfo(&newundorecord.uur_payload,
+								   (char *) new_trans_slot_id,
+								   sizeof(*new_trans_slot_id));
+		}
+		else
+			newundorecord.uur_payload.len = 0;
+
+		undorec[0] = undorecord;
+		undorec[1] = newundorecord;
+
+		UndoSetPrepareSize(undorec, 2, xid, UNDO_PERMANENT, NULL);
+		undorecord = undorec[0];
+		newundorecord = undorec[1];
+
+		urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL);
+		newurecptr = PrepareUndoInsert(&newundorecord, xid, UNDO_PERMANENT, NULL);
+
+		Assert (newurecptr == xlnewundohdr->urec_ptr);
+	}
+
+	/*
+	 * undo should be inserted at same location as it was during the actual
+	 * insert (DO operation).
+	 */
+	Assert (urecptr == xlundohdr->urec_ptr);
+
+	InsertPreparedUndo();
+
+	/* Ensure old tuple points to the tuple in page. */
+	oldtup.t_data = (ZHeapTupleHeader) PageGetItem(oldpage, lp);
+	oldtup.t_len = ItemIdGetLength(lp);
+
+	/* First deal with old tuple */
+	if (oldaction == BLK_NEEDS_REDO)
+	{
+		oldtup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK;
+		oldtup.t_data->t_infomask = xlrec->old_infomask;
+		ZHeapTupleHeaderSetXactSlot(oldtup.t_data, xlrec->old_trans_slot_id);
+
+		if (oldblk != newblk)
+			PageSetUNDO(undorecord, oldbuffer, xlrec->old_trans_slot_id,
+						false, xid_epoch, xid, urecptr, NULL,
+						0);
+
+		/* Mark the page as a candidate for pruning */
+		if (!inplace_update)
+			ZPageSetPrunable(oldpage, XLogRecGetXid(record));
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuffer);
+	}
+
+	/*
+	 * Read the page the new tuple goes into, if different from old.
+	 */
+	if (oldblk == newblk)
+	{
+		newbuffer = oldbuffer;
+		newaction = oldaction;
+	}
+	else if (XLogRecGetInfo(record) & XLOG_ZHEAP_INIT_PAGE)
+	{
+		newbuffer = XLogInitBufferForRedo(record, 0);
+		newpage = (Page) BufferGetPage(newbuffer);
+		ZheapInitPage(newpage, BufferGetPageSize(newbuffer));
+		newaction = BLK_NEEDS_REDO;
+	}
+	else
+		newaction = XLogReadBufferForRedo(record, 0, &newbuffer);
+
+	newpage = BufferGetPage(newbuffer);
+
+	/*
+	 * The visibility map may need to be fixed even if the zheap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLZ_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+	{
+		Buffer		vmbuffer = InvalidBuffer;
+
+		visibilitymap_pin(reln, newblk, &vmbuffer);
+		visibilitymap_clear(reln, newblk, vmbuffer, VISIBILITYMAP_VALID_BITS);
+		ReleaseBuffer(vmbuffer);
+	}
+
+	if (newaction == BLK_NEEDS_REDO)
+	{
+		uint16		prefixlen = 0,
+					suffixlen = 0;
+		char	   *newp;
+		char	   *recdata;
+		char	   *recdata_end;
+		Size		datalen;
+		Size		tuplen;
+		uint32		newlen;
+
+		if (PageGetMaxOffsetNumber(newpage) + 1 < xlrec->new_offnum)
+			elog(PANIC, "invalid max offset number");
+
+		recdata = XLogRecGetBlockData(record, 0, &datalen);
+		recdata_end = recdata + datalen;
+
+		if (xlrec->flags & XLZ_UPDATE_PREFIX_FROM_OLD)
+		{
+			Assert(newblk == oldblk);
+			memcpy(&prefixlen, recdata, sizeof(uint16));
+			recdata += sizeof(uint16);
+		}
+		if (xlrec->flags & XLZ_UPDATE_SUFFIX_FROM_OLD)
+		{
+			Assert(newblk == oldblk);
+			memcpy(&suffixlen, recdata, sizeof(uint16));
+			recdata += sizeof(uint16);
+		}
+
+		memcpy((char *) &xlhdr, recdata, SizeOfZHeapHeader);
+		recdata += SizeOfZHeapHeader;
+
+		tuplen = recdata_end - recdata;
+		Assert(tuplen <= MaxZHeapTupleSize);
+
+		newtup = &tbuf.hdr;
+		MemSet((char *) newtup, 0, SizeofZHeapTupleHeader);
+
+		/*
+		 * Reconstruct the new tuple using the prefix and/or suffix from the
+		 * old tuple, and the data stored in the WAL record.
+		 */
+		newp = (char *) newtup + SizeofZHeapTupleHeader;
+		if (prefixlen > 0)
+		{
+			int			len;
+
+			/* copy bitmap [+ padding] [+ oid] from WAL record */
+			len = xlhdr.t_hoff - SizeofZHeapTupleHeader;
+			memcpy(newp, recdata, len);
+			recdata += len;
+			newp += len;
+
+			/* copy prefix from old tuple */
+			memcpy(newp, (char *) oldtup.t_data + oldtup.t_data->t_hoff, prefixlen);
+			newp += prefixlen;
+
+			/* copy new tuple data from WAL record */
+			len = tuplen - (xlhdr.t_hoff - SizeofZHeapTupleHeader);
+			memcpy(newp, recdata, len);
+			recdata += len;
+			newp += len;
+		}
+		else
+		{
+			/*
+			 * copy bitmap [+ padding] [+ oid] + data from record, all in one
+			 * go
+			 */
+			memcpy(newp, recdata, tuplen);
+			recdata += tuplen;
+			newp += tuplen;
+		}
+		Assert(recdata == recdata_end);
+
+		/* copy suffix from old tuple */
+		if (suffixlen > 0)
+			memcpy(newp, (char *) oldtup.t_data + oldtup.t_len - suffixlen, suffixlen);
+
+		newlen = SizeofZHeapTupleHeader + tuplen + prefixlen + suffixlen;
+		newtup->t_infomask2 = xlhdr.t_infomask2;
+		newtup->t_infomask = xlhdr.t_infomask;
+		newtup->t_hoff = xlhdr.t_hoff;
+		if (new_trans_slot_id)
+			trans_slot_id = *new_trans_slot_id;
+		else
+			trans_slot_id = ZHeapTupleHeaderGetXactSlot(newtup);
+
+		if (inplace_update)
+		{
+			/*
+			 * For inplace updates, we copy the entire data portion including the
+			 * tuple header.
+			 */
+			ItemIdChangeLen(lp, newlen);
+			memcpy((char *) oldtup.t_data, (char *) newtup, newlen);
+
+			if (newlen < oldtup.t_len)
+			{
+				/* new tuple is smaller, a prunable cadidate */
+				Assert (oldpage == newpage);
+				ZPageSetPrunable(newpage, XLogRecGetXid(record));
+			}
+
+			PageSetUNDO(undorecord, newbuffer, xlrec->old_trans_slot_id,
+						false, xid_epoch, xid, urecptr,
+						NULL, 0);
+		}
+		else
+		{
+			if (ZPageAddItem(newbuffer, NULL, (Item) newtup, newlen, xlrec->new_offnum,
+						 true, true, true) == InvalidOffsetNumber)
+				elog(PANIC, "failed to add tuple");
+			PageSetUNDO((newbuffer == oldbuffer) ? undorecord : newundorecord,
+						newbuffer, trans_slot_id, false, xid_epoch, xid,
+						newurecptr, NULL, 0);
+		}
+
+		freespace = PageGetHeapFreeSpace(newpage); /* needed to update FSM below */
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuffer);
+	}
+
+	/* replay the record for tpd buffer corresponding to oldbuf */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		if (XLogReadTPDBuffer(record, 2) == BLK_NEEDS_REDO)
+		{
+			OffsetNumber usedoff[2];
+			int			 ucnt;
+
+			if (!inplace_update && newbuffer == oldbuffer)
+			{
+				usedoff[0] = undorecord.uur_offset;
+				usedoff[1] = newundorecord.uur_offset;
+				ucnt = 2;
+			}
+			else
+			{
+				usedoff[0] = undorecord.uur_offset;
+				ucnt = 1;
+			}
+			if (xlrec->old_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			{
+				if (inplace_update)
+				{
+					TPDPageSetUndo(oldbuffer,
+								   xlrec->old_trans_slot_id,
+								   true,
+								   xid_epoch,
+								   xid,
+								   urecptr,
+								   usedoff,
+								   ucnt);
+				}
+				else
+				{
+					TPDPageSetUndo(oldbuffer,
+								   xlrec->old_trans_slot_id,
+								   true,
+								   xid_epoch,
+								   xid,
+								   (oldblk == newblk) ? newurecptr : urecptr,
+								   usedoff,
+								   ucnt);
+				}
+				TPDPageSetLSN(oldpage, lsn);
+			}
+		}
+	}
+
+	/* replay the record for tpd buffer corresponding to newbuf */
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		if (XLogReadTPDBuffer(record, 3) == BLK_NEEDS_REDO)
+		{
+			TPDPageSetUndo(newbuffer,
+						   *new_trans_slot_id,
+						   true,
+						   xid_epoch,
+						   xid,
+						   newurecptr,
+						   &newundorecord.uur_offset,
+						   1);
+			TPDPageSetLSN(newpage, lsn);
+		}
+	}
+	else if (new_trans_slot_id && (*new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS))
+	{
+		TPDPageSetUndo(newbuffer,
+					   *new_trans_slot_id,
+					   true,
+					   xid_epoch,
+					   xid,
+					   newurecptr,
+					   &newundorecord.uur_offset,
+					   1);
+		TPDPageSetLSN(newpage, lsn);
+	}
+	if (BufferIsValid(newbuffer) && newbuffer != oldbuffer)
+		UnlockReleaseBuffer(newbuffer);
+	if (BufferIsValid(oldbuffer))
+		UnlockReleaseBuffer(oldbuffer);
+
+	/* be tidy */
+	pfree(undorecord.uur_tuple.data);
+	if (undorecord.uur_payload.len > 0)
+		pfree(undorecord.uur_payload.data);
+
+	if (!inplace_update && newundorecord.uur_payload.len > 0)
+		pfree(newundorecord.uur_payload.data);
+
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+	FreeFakeRelcacheEntry(reln);
+
+	/*
+	 * Update the freespace.  We don't need to update it for inplace updates as
+	 * they won't freeup any space or consume any extra space assuming the new
+	 * tuple is about the same size as the old one.  See heap_xlog_update.
+	 */
+	if (newaction == BLK_NEEDS_REDO && !inplace_update && freespace < BLCKSZ / 5)
+		XLogRecordPageWithFreeSpace(rnode, newblk, freespace);
+}
+
+static void
+zheap_xlog_freeze_xact_slot(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+	xl_zheap_freeze_xact_slot *xlrec =
+			(xl_zheap_freeze_xact_slot *) XLogRecGetData(record);
+	XLogRedoAction action, tpdaction = -1;
+	int	   *frozen;
+	int		i;
+	bool	hasTPDSlot = false;
+
+	/* There must be some frozen slots.*/
+	Assert(xlrec->nFrozen > 0);
+
+	/*
+	 * In Hot Standby mode, ensure that no running query conflicts with the
+	 * frozen xids.
+	 */
+	if (InHotStandby)
+	{
+		RelFileNode rnode;
+
+		/*
+		 * FIXME: We need some handling for transaction wraparound.
+		 */
+		TransactionId lastestFrozenXid = xlrec->lastestFrozenXid;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(lastestFrozenXid, rnode);
+	}
+
+	frozen = (int *) ((char *) xlrec + SizeOfZHeapFreezeXactSlot);
+
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		tpdaction = XLogReadTPDBuffer(record, 1);
+		hasTPDSlot = true;
+	}
+
+	page = BufferGetPage(buffer);
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		ZHeapPageOpaque	opaque;
+		int		slot_no;
+		if (hasTPDSlot)
+		{
+			zheap_freeze_or_invalidate_tuples(buffer, xlrec->nFrozen, frozen,
+				true, true);
+		}
+		else
+		{
+			zheap_freeze_or_invalidate_tuples(buffer, xlrec->nFrozen, frozen,
+				true, false);
+			opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+			/* Initialize the frozen slots. */
+			for (i = 0; i < xlrec->nFrozen; i++)
+			{
+				slot_no = frozen[i];
+
+				opaque->transinfo[slot_no].xid_epoch = 0;
+				opaque->transinfo[slot_no].xid = InvalidTransactionId;
+				opaque->transinfo[slot_no].urec_ptr = InvalidUndoRecPtr;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (tpdaction == BLK_NEEDS_REDO)
+	{
+		/* Initialize the frozen slots. */
+		for (i = 0; i < xlrec->nFrozen; i++)
+		{
+			int	tpd_slot_id;
+
+			/* Calculate the actual slot no. */
+			tpd_slot_id = frozen[i] + ZHEAP_PAGE_TRANS_SLOTS + 1;
+
+			/* Clear slot information from the TPD slot. */
+			TPDPageSetTransactionSlotInfo(buffer, tpd_slot_id, 0,
+										  InvalidTransactionId,
+										  InvalidUndoRecPtr);
+		}
+
+		TPDPageSetLSN(page, lsn);
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	UnlockReleaseTPDBuffers();
+}
+
+static void
+zheap_xlog_invalid_xact_slot(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+	char	   *data = XLogRecGetData(record);
+	uint16		nCompletedSlots;
+	XLogRedoAction action, tpdaction = -1;
+	int	   *completed_slots;
+	int		i;
+	bool	hasTPDSlot = false;
+
+	nCompletedSlots = *(uint16 *) data;
+
+	/* There must be some frozen slots.*/
+	Assert(nCompletedSlots > 0);
+
+	completed_slots = (int *) ((char *) data + sizeof(uint16));
+
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		tpdaction = XLogReadTPDBuffer(record, 1);
+		hasTPDSlot = true;
+	}
+	page = BufferGetPage(buffer);
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		ZHeapPageOpaque	opaque;
+		int		slot_no;
+
+		opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page);
+
+		/* clear the transaction slot info on tuples. */
+		if (hasTPDSlot)
+		{
+			zheap_freeze_or_invalidate_tuples(buffer, nCompletedSlots,
+											  completed_slots, false, true);
+		}
+		else
+		{
+			zheap_freeze_or_invalidate_tuples(buffer, nCompletedSlots,
+											  completed_slots, false, false);
+
+			/* Clear xid from the slots. */
+			for (i = 0; i < nCompletedSlots; i++)
+			{
+				slot_no = completed_slots[i];
+				opaque->transinfo[slot_no].xid_epoch = 0;
+				opaque->transinfo[slot_no].xid = InvalidTransactionId;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (tpdaction == BLK_NEEDS_REDO)
+	{
+		TransInfo *tpd_slots;
+
+		/*
+		 * Read TPD slot array. So that we can keep the slot urec_ptr
+		 * intact while clearing the transaction id from the slot.
+		 */
+		tpd_slots = TPDPageGetTransactionSlots(NULL, buffer,
+											   InvalidOffsetNumber,
+											   true, false, NULL, NULL,
+											   NULL, NULL, NULL);
+
+		for (i = 0; i < nCompletedSlots; i++)
+		{
+			int tpd_slot_id;
+
+			/* Calculate the actual slot no. */
+			tpd_slot_id = completed_slots[i] + ZHEAP_PAGE_TRANS_SLOTS + 1;
+
+			/* Clear the XID information from the TPD. */
+			TPDPageSetTransactionSlotInfo(buffer, tpd_slot_id, 0,
+										  InvalidTransactionId,
+										  tpd_slots[completed_slots[i]].urec_ptr);
+		}
+
+		TPDPageSetLSN(page, lsn);
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	UnlockReleaseTPDBuffers();
+}
+
+static void
+zheap_xlog_lock(XLogReaderState *record)
+{
+	XLogRecPtr  lsn = record->EndRecPtr;
+	xl_undo_header  *xlundohdr = (xl_undo_header *) XLogRecGetData(record);
+	xl_zheap_lock *xlrec;
+	Buffer      buffer;
+	Page        page;
+	ZHeapTupleData  zheaptup;
+	char		*tup_hdr;
+	UnpackedUndoRecord  undorecord;
+	UndoRecPtr  urecptr;
+	RelFileNode target_node;
+	BlockNumber blkno;
+	ItemPointerData target_tid;
+	XLogRedoAction action;
+	Relation    reln;
+	ItemId  lp = NULL;
+	TransactionId	xid = XLogRecGetXid(record);
+	uint32	xid_epoch = GetEpochForXid(xid);
+	int		*trans_slot_for_urec = NULL;
+	int		*tup_trans_slot_id = NULL;
+	int		undo_slot_no;
+
+	xlrec = (xl_zheap_lock *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+	XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno);
+	ItemPointerSet(&target_tid, blkno, xlrec->offnum);
+
+	reln = CreateFakeRelcacheEntry(target_node);
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+	page = BufferGetPage(buffer);
+
+	if (PageGetMaxOffsetNumber(page) >= xlrec->offnum)
+		lp = PageGetItemId(page, xlrec->offnum);
+
+	if (PageGetMaxOffsetNumber(page) < xlrec->offnum || !ItemIdIsNormal(lp))
+		elog(PANIC, "invalid lp");
+
+	zheaptup.t_tableOid = RelationGetRelid(reln);
+	zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+	zheaptup.t_len = ItemIdGetLength(lp);
+	zheaptup.t_self = target_tid;
+
+	/*
+	 * WAL stream contains undo tuple header, replace it with the explicitly
+	 * stored tuple header.
+	 */
+	tup_hdr = (char *) xlrec + SizeOfZHeapLock;
+
+	/* prepare an undo record */
+	if (ZHeapTupleHasMultiLockers(xlrec->infomask))
+		undorecord.uur_type = UNDO_XID_MULTI_LOCK_ONLY;
+	else if (xlrec->flags & XLZ_LOCK_FOR_UPDATE)
+		undorecord.uur_type = UNDO_XID_LOCK_FOR_UPDATE;
+	else
+		undorecord.uur_type = UNDO_XID_LOCK_ONLY;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = xlundohdr->reloid;
+	undorecord.uur_prevxid = xlrec->prev_xid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = FirstCommandId;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = xlundohdr->blkprev;
+	undorecord.uur_block = ItemPointerGetBlockNumber(&target_tid);
+	undorecord.uur_offset = ItemPointerGetOffsetNumber(&target_tid);
+
+	initStringInfo(&undorecord.uur_payload);
+	initStringInfo(&undorecord.uur_tuple);
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   tup_hdr,
+						   SizeofZHeapTupleHeader);
+
+	appendBinaryStringInfo(&undorecord.uur_payload,
+						   (char *) (tup_hdr + SizeofZHeapTupleHeader),
+						   sizeof(LockTupleMode));
+
+	if (xlrec->flags & XLZ_LOCK_TRANS_SLOT_FOR_UREC)
+	{
+		trans_slot_for_urec = (int *) ((char *) tup_hdr +
+							SizeofZHeapTupleHeader + sizeof(LockTupleMode));
+		if (xlrec->trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			appendBinaryStringInfo(&undorecord.uur_payload,
+								   (char *) &(xlrec->trans_slot_id),
+								   sizeof(int));
+	}
+	else if (xlrec->flags & XLZ_LOCK_CONTAINS_TPD_SLOT)
+	{
+		tup_trans_slot_id = (int *) ((char *) tup_hdr +
+							SizeofZHeapTupleHeader + sizeof(LockTupleMode));
+		/*
+		 * We must have logged the tuple's original transaction slot if it is a TPD
+		 * slot.
+		 */
+		Assert(*tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS);
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) tup_trans_slot_id,
+							   sizeof(*tup_trans_slot_id));
+	}
+
+	/*
+	 * For sub-tranasctions, we store the dummy contains subxact token in the
+	 * undorecord so that, the size of undorecord in DO function matches with
+	 * the size of undorecord in REDO function. This ensures that, for
+	 * sub-transactions, the assert condition used later in this
+	 * function to ensure that the undo pointer in DO and REDO function remains
+	 * the same is true.
+	 */
+	if (xlrec->flags & XLZ_LOCK_CONTAINS_SUBXACT)
+	{
+		SubTransactionId dummy_subXactToken = 1;
+
+		appendBinaryStringInfo(&undorecord.uur_payload,
+							   (char *) &dummy_subXactToken,
+							   sizeof(SubTransactionId));
+	}
+
+	urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL);
+	InsertPreparedUndo();
+
+	/*
+	 * undo should be inserted at same location as it was during the actual
+	 * insert (DO operation).
+	 */
+	Assert (urecptr == xlundohdr->urec_ptr);
+
+	if (trans_slot_for_urec)
+		undo_slot_no = *trans_slot_for_urec;
+	else
+		undo_slot_no = xlrec->trans_slot_id;
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp);
+		zheaptup.t_len = ItemIdGetLength(lp);
+		ZHeapTupleHeaderSetXactSlot(zheaptup.t_data, xlrec->trans_slot_id);
+		zheaptup.t_data->t_infomask = xlrec->infomask;
+		PageSetUNDO(undorecord, buffer, undo_slot_no, false, xid_epoch,
+					xid, urecptr, NULL, 0);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			TPDPageSetUndo(buffer,
+						   undo_slot_no,
+						   (xlrec->flags & XLZ_LOCK_FOR_UPDATE) ? true : false,
+						   xid_epoch,
+						   xid,
+						   urecptr,
+						   &undorecord.uur_offset,
+						   1);
+			TPDPageSetLSN(page, lsn);
+		}
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	/* be tidy */
+	pfree(undorecord.uur_tuple.data);
+	pfree(undorecord.uur_payload.data);
+
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+	FreeFakeRelcacheEntry(reln);
+}
+
+static void
+zheap_xlog_multi_insert(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undo_header	*xlundohdr;
+	xl_zheap_multi_insert *xlrec;
+	RelFileNode rnode;
+	BlockNumber blkno;
+	Buffer		buffer;
+	Page		page;
+	union
+	{
+		ZHeapTupleHeaderData hdr;
+		char		data[MaxZHeapTupleSize];
+	}			tbuf;
+	ZHeapTupleHeader zhtup;
+	uint32		newlen;
+	UnpackedUndoRecord	*undorecord = NULL;
+	UndoRecPtr	urecptr = InvalidUndoRecPtr,
+						prev_urecptr = InvalidUndoRecPtr;
+	int			i;
+	int			nranges;
+	int			ucnt = 0;
+	OffsetNumber	usedoff[MaxOffsetNumber];
+	bool		isinit = (XLogRecGetInfo(record) & XLOG_ZHEAP_INIT_PAGE) != 0;
+	XLogRedoAction action;
+	char		*ranges_data;
+	int			*tpd_trans_slot_id = NULL;
+	Size		ranges_data_size = 0;
+	TransactionId	xid = XLogRecGetXid(record);
+	uint32	xid_epoch = GetEpochForXid(xid);
+	ZHeapFreeOffsetRanges	*zfree_offset_ranges;
+	bool		skip_undo;
+
+	xlundohdr = (xl_undo_header *) XLogRecGetData(record);
+	xlrec = (xl_zheap_multi_insert *) ((char *) xlundohdr + SizeOfUndoHeader);
+
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+	/*
+	 * The visibility map may need to be fixed even if the heap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLZ_INSERT_ALL_VISIBLE_CLEARED)
+	{
+		Relation	reln = CreateFakeRelcacheEntry(rnode);
+		Buffer		vmbuffer = InvalidBuffer;
+
+		visibilitymap_pin(reln, blkno, &vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+
+	if (isinit)
+	{
+		/* It is asked for page init, insert should not have tpd slot. */
+		Assert(!(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT));
+		buffer = XLogInitBufferForRedo(record, 0);
+		page = BufferGetPage(buffer);
+		ZheapInitPage(page, BufferGetPageSize(buffer));
+		action = BLK_NEEDS_REDO;
+	}
+	else
+		action = XLogReadBufferForRedo(record, 0, &buffer);
+
+	/* allocate the information related to offset ranges */
+	ranges_data = (char *) xlrec + SizeOfZHeapMultiInsert;
+
+	/* fetch number of distinct ranges */
+	nranges = *(int *) ranges_data;
+	ranges_data += sizeof(int);
+	ranges_data_size += sizeof(int);
+
+	zfree_offset_ranges = (ZHeapFreeOffsetRanges *) palloc0(sizeof(ZHeapFreeOffsetRanges));
+	Assert(nranges > 0);
+	for (i = 0; i < nranges; i++)
+	{
+		memcpy(&zfree_offset_ranges->startOffset[i],(char *) ranges_data, sizeof(OffsetNumber));
+		ranges_data += sizeof(OffsetNumber);
+		memcpy(&zfree_offset_ranges->endOffset[i],(char *) ranges_data, sizeof(OffsetNumber));
+		ranges_data += sizeof(OffsetNumber);
+	}
+
+	/*
+	 * We can skip inserting undo records if the tuples are to be marked
+	 * as frozen.
+	 */
+	skip_undo= (xlrec->flags & XLZ_INSERT_IS_FROZEN);
+	if (!skip_undo)
+	{
+		undorecord = (UnpackedUndoRecord *) palloc(nranges * sizeof(UnpackedUndoRecord));
+
+		/* Start UNDO prepare Stuff */
+		prev_urecptr = xlundohdr->blkprev;
+		urecptr = prev_urecptr;
+
+		for (i = 0; i < nranges; i++)
+		{
+			/* prepare an undo record */
+			undorecord[i].uur_type = UNDO_MULTI_INSERT;
+			undorecord[i].uur_info = 0;
+			undorecord[i].uur_prevlen = 0;
+			undorecord[i].uur_reloid = xlundohdr->reloid;
+			undorecord[i].uur_prevxid = xid;
+			undorecord[i].uur_prevxid = FrozenTransactionId;
+			undorecord[i].uur_cid = FirstCommandId;
+			undorecord[i].uur_fork = MAIN_FORKNUM;
+			undorecord[i].uur_blkprev = urecptr;
+			undorecord[i].uur_block = blkno;
+			undorecord[i].uur_offset = 0;
+			undorecord[i].uur_tuple.len = 0;
+			undorecord[i].uur_payload.len = 2 * sizeof(OffsetNumber);
+			initStringInfo(&undorecord[i].uur_payload);
+			appendBinaryStringInfo(&undorecord[i].uur_payload,
+								   (char *) ranges_data,
+								   2 * sizeof(OffsetNumber));
+
+			ranges_data += undorecord[i].uur_payload.len;
+			ranges_data_size += undorecord[i].uur_payload.len;
+		}
+
+		UndoSetPrepareSize(undorecord, nranges, xid,
+						   UNDO_PERMANENT, NULL);
+		for (i = 0; i < nranges; i++)
+		{
+			undorecord[i].uur_blkprev = urecptr;
+			urecptr = PrepareUndoInsert(&undorecord[i], xid, UNDO_PERMANENT, NULL);
+		}
+
+		elog(DEBUG1, "Undo record prepared: %d for Block Number: %d",
+			 nranges, blkno);
+
+		/*
+		 * undo should be inserted at same location as it was during the actual
+		 * insert (DO operation).
+		 */
+		Assert (urecptr == xlundohdr->urec_ptr);
+
+		InsertPreparedUndo();
+	}
+
+	/* Get the tpd transaction slot number */
+	if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT)
+	{
+		tpd_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapMultiInsert +
+									 ranges_data_size);
+	}
+
+	/* Apply the wal for data */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *tupdata;
+		char	   *endptr;
+		int		trans_slot_id = 0;
+		int		prev_trans_slot_id PG_USED_FOR_ASSERTS_ONLY;
+		Size		len;
+		OffsetNumber offnum;
+		int			j = 0;
+		bool		first_time = true;
+
+		prev_trans_slot_id = -1;
+		page = BufferGetPage(buffer);
+
+		/* Tuples are stored as block data */
+		tupdata = XLogRecGetBlockData(record, 0, &len);
+		endptr = tupdata + len;
+
+		offnum = zfree_offset_ranges->startOffset[j];
+		for (i = 0; i < xlrec->ntuples; i++)
+		{
+			xl_multi_insert_ztuple *xlhdr;
+
+			/*
+			 * If we're reinitializing the page, the tuples are stored in
+			 * order from FirstOffsetNumber. Otherwise there's an array of
+			 * offsets in the WAL record, and the tuples come after that.
+			 */
+			if (isinit)
+				offnum = FirstOffsetNumber + i;
+			else
+			{
+				/*
+				 * Change the offset range if we've reached the end of current
+				 * range.
+				 */
+				if (offnum > zfree_offset_ranges->endOffset[j])
+				{
+					j++;
+					offnum = zfree_offset_ranges->startOffset[j];
+				}
+			}
+			if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+				elog(PANIC, "invalid max offset number");
+
+			xlhdr = (xl_multi_insert_ztuple *) SHORTALIGN(tupdata);
+			tupdata = ((char *) xlhdr) + SizeOfMultiInsertZTuple;
+
+			newlen = xlhdr->datalen;
+			Assert(newlen <= MaxZHeapTupleSize);
+			zhtup = &tbuf.hdr;
+			MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader);
+			/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+			memcpy((char *) zhtup + SizeofZHeapTupleHeader,
+				   (char *) tupdata,
+				   newlen);
+			tupdata += newlen;
+
+			newlen += SizeofZHeapTupleHeader;
+			zhtup->t_infomask2 = xlhdr->t_infomask2;
+			zhtup->t_infomask = xlhdr->t_infomask;
+			zhtup->t_hoff = xlhdr->t_hoff;
+
+			if (ZPageAddItem(buffer, NULL, (Item) zhtup, newlen, offnum,
+							 true, true, true) == InvalidOffsetNumber)
+				elog(PANIC, "failed to add tuple");
+
+			/* track used offsets */
+			usedoff[ucnt++] = offnum;
+
+			/* increase the offset to store next tuple */
+			offnum++;
+
+			if (!skip_undo)
+			{
+				if (tpd_trans_slot_id)
+					trans_slot_id = *tpd_trans_slot_id;
+				else
+					trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup);
+				if (first_time)
+				{
+					prev_trans_slot_id = trans_slot_id;
+					first_time = false;
+				}
+				else
+				{
+					/* All the tuples must refer to same transaction slot. */
+					Assert(prev_trans_slot_id == trans_slot_id);
+					prev_trans_slot_id = trans_slot_id;
+				}
+			}
+		}
+
+		if (!skip_undo)
+			PageSetUNDO(undorecord[nranges-1], buffer, trans_slot_id, false,
+						xid_epoch, xid, urecptr, NULL, 0);
+
+		PageSetLSN(page, lsn);
+
+		MarkBufferDirty(buffer);
+
+		if (tupdata != endptr)
+			elog(ERROR, "total tuple length mismatch");
+	}
+
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		/*
+		 * We need to replay the record for TPD only when this record contains
+		 * slot from TPD.
+		 */
+		Assert(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT);
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			/* prepare for the case where the data page is restored as is */
+			if (ucnt == 0)
+			{
+				for (i = 0; i < nranges; i++)
+				{
+					OffsetNumber	start_off,
+									end_off;
+
+					start_off = ((OffsetNumber *) undorecord[i].uur_payload.data)[0];
+					end_off = ((OffsetNumber *) undorecord[i].uur_payload.data)[1];
+
+					while (start_off <= end_off)
+						usedoff[ucnt++] = start_off++;
+				}
+			}
+
+			TPDPageSetUndo(buffer,
+						   *tpd_trans_slot_id,
+						   true,
+						   xid_epoch,
+						   xid,
+						   urecptr,
+						   usedoff,
+						   ucnt);
+			TPDPageSetLSN(BufferGetPage(buffer), lsn);
+		}
+	}
+
+	/* be tidy */
+	if (!skip_undo)
+	{
+		for (i = 0; i < nranges; i++)
+			pfree(undorecord[i].uur_payload.data);
+		pfree(undorecord);
+	}
+	pfree(zfree_offset_ranges);
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+}
+
+/*
+ * Handles ZHEAP_CLEAN record type
+ */
+static void
+zheap_xlog_clean(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_zheap_clean *xlrec = (xl_zheap_clean *) XLogRecGetData(record);
+	Buffer		buffer;
+	Size		freespace = 0;
+	RelFileNode rnode;
+	BlockNumber blkno;
+	XLogRedoAction action;
+	OffsetNumber	*target_offnum;
+	Size			*space_required;
+
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+	/*
+	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
+	 * no queries running for which the removed tuples are still visible.
+	 *
+	 * Not all ZHEAP_CLEAN records remove tuples with xids, so we only want to
+	 * conflict on the records that cause MVCC failures for user queries. If
+	 * latestRemovedXid is invalid, skip conflict processing.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+
+	/*
+	 * If we have a full-page image, restore it (using a cleanup lock) and
+	 * we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page)BufferGetPage(buffer);
+		OffsetNumber *end;
+		OffsetNumber *deleted;
+		OffsetNumber *nowdead;
+		OffsetNumber *nowunused;
+		OffsetNumber tmp_target_off;
+		int			ndeleted;
+		int			ndead;
+		int			nunused;
+		Size		datalen;
+		Size		tmp_spc_rqd;
+
+		deleted = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		ndeleted = xlrec->ndeleted;
+		ndead = xlrec->ndead;
+		end = (OffsetNumber *) ((char *) deleted + datalen);
+		nowdead = deleted + (ndeleted * 2);
+		nowunused = nowdead + ndead;
+		nunused = (end - nowunused);
+		Assert(nunused >= 0);
+
+		/* Update all item pointers per the record, and repair fragmentation */
+		if (xlrec->flags & XLZ_CLEAN_CONTAINS_OFFSET)
+		{
+			target_offnum = (OffsetNumber *) ((char *) xlrec + SizeOfZHeapClean);
+			space_required = (Size *) ((char *) target_offnum + sizeof(OffsetNumber));
+		}
+		else
+		{
+			target_offnum = &tmp_target_off;
+			*target_offnum = InvalidOffsetNumber;
+			space_required = &tmp_spc_rqd;
+			*space_required = 0;
+		}
+
+		zheap_page_prune_execute(buffer, *target_offnum, deleted, ndeleted,
+								 nowdead, ndead, nowunused, nunused);
+
+		if (xlrec->flags & XLZ_CLEAN_ALLOW_PRUNING)
+		{
+			bool	pruned PG_USED_FOR_ASSERTS_ONLY = false;
+			Page	tmppage = NULL;
+
+			/*
+			 * We prepare the temporary copy of the page so that during page
+			 * repair fragmentation we can use it to copy the actual tuples.
+			 * See comments atop zheap_page_prune_guts.
+			 */
+			tmppage = PageGetTempPageCopy(BufferGetPage(buffer));
+			ZPageRepairFragmentation(buffer, tmppage, *target_offnum,
+									 *space_required, true, &pruned);
+
+			/*
+			 * Pruning must be successful at redo time, otherwise the page
+			 * contents on master and standby might differ.
+			 */
+			Assert(pruned);
+
+			/* be tidy. */
+			pfree(tmppage);
+		}
+
+		freespace = PageGetZHeapFreeSpace(page); /* needed to update FSM below */
+
+		/*
+		 * Note: we don't worry about updating the page's prunability hints.
+		 * At worst this will cause an extra prune cycle to occur soon.
+		 */
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	/*
+	 * Update the FSM as well.
+	 *
+	 * XXX: Don't do this if the page was restored from full page image. We
+	 * don't bother to update the FSM in that case, it doesn't need to be
+	 * totally accurate anyway.
+	 */
+	if (action == BLK_NEEDS_REDO)
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+}
+
+/*
+ * Handles XLOG_ZHEAP_CONFIRM record type
+ */
+static void
+zheap_xlog_confirm(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_zheap_confirm *xlrec = (xl_zheap_confirm *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+	OffsetNumber offnum;
+	ItemId		lp = NULL;
+	ZHeapTupleHeader zhtup;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(buffer);
+
+		offnum = xlrec->offnum;
+		if (PageGetMaxOffsetNumber(page) >= offnum)
+			lp = PageGetItemId(page, offnum);
+
+		if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
+			elog(PANIC, "invalid lp");
+
+		zhtup = (ZHeapTupleHeader) PageGetItem(page, lp);
+
+		if (xlrec->flags == XLZ_SPEC_INSERT_SUCCESS)
+		{
+			/* Confirm tuple as actually inserted */
+			zhtup->t_infomask &= ~ZHEAP_SPECULATIVE_INSERT;
+		}
+		else
+		{
+			Assert(xlrec->flags == XLZ_SPEC_INSERT_FAILED);
+			ItemIdSetDead(lp);
+			ZPageSetPrunable(page, XLogRecGetXid(record));
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * Handles XLOG_ZHEAP_UNUSED record type
+ */
+static void
+zheap_xlog_unused(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_undo_header	*xlundohdr;
+	xl_zheap_unused *xlrec;
+	UnpackedUndoRecord	undorecord;
+	UndoRecPtr	urecptr;
+	TransactionId	xid = XLogRecGetXid(record);
+	uint32	xid_epoch = GetEpochForXid(xid);
+	uint16	i, uncnt;
+	Buffer		buffer;
+	OffsetNumber *unused;
+	Size		freespace = 0;
+	RelFileNode rnode;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	xlundohdr = (xl_undo_header *) XLogRecGetData(record);
+	xlrec = (xl_zheap_unused *) ((char *) xlundohdr + SizeOfUndoHeader);
+	/* extract the information related to unused offsets */
+	unused = (OffsetNumber *) ((char *) xlrec + SizeOfZHeapUnused);
+	uncnt = xlrec->nunused;
+
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+	/*
+	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
+	 * no queries running for which the removed tuples are still visible.
+	 *
+	 * Not all ZHEAP_UNUSED records remove tuples with xids, so we only want to
+	 * conflict on the records that cause MVCC failures for user queries. If
+	 * latestRemovedXid is invalid, skip conflict processing.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+
+	/* prepare an undo record */
+	undorecord.uur_type = UNDO_ITEMID_UNUSED;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = xlundohdr->reloid;
+	undorecord.uur_prevxid = xid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = FirstCommandId;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = xlundohdr->blkprev;
+	undorecord.uur_block = blkno;
+	undorecord.uur_offset = 0;
+	undorecord.uur_tuple.len = 0;
+	undorecord.uur_payload.len = uncnt * sizeof(OffsetNumber);
+	undorecord.uur_payload.data = 
+			(char *) palloc(uncnt * sizeof(OffsetNumber));
+	memcpy(undorecord.uur_payload.data,
+		   (char *) unused,
+		   undorecord.uur_payload.len);
+
+	urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL);
+	InsertPreparedUndo();
+
+	/*
+	 * undo should be inserted at same location as it was during the actual
+	 * insert (DO operation).
+	 */
+	Assert (urecptr == xlundohdr->urec_ptr);
+
+	/*
+	 * If we have a full-page image, restore it (using a cleanup lock) and
+	 * we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+
+		// ZBORKED: unsigned type, can't be smaller, compiler laments
+		// Assert(uncnt >= 0);
+
+		for (i = 0; i < uncnt; i++)
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, unused[i]);
+			ItemIdSetUnusedExtended(itemid, xlrec->trans_slot_id);
+		}
+
+		PageSetUNDO(undorecord, buffer, xlrec->trans_slot_id, false, xid_epoch,
+					xid, urecptr, NULL, 0);
+
+		if (xlrec->flags & XLZ_UNUSED_ALLOW_PRUNING)
+		{
+			bool	pruned PG_USED_FOR_ASSERTS_ONLY = false;
+			Page	tmppage = NULL;
+
+			/*
+			 * We prepare the temporary copy of the page so that during page
+			 * repair fragmentation we can use it to copy the actual tuples.
+			 * See comments atop zheap_page_prune_guts.
+			 */
+			tmppage = PageGetTempPageCopy(BufferGetPage(buffer));
+			ZPageRepairFragmentation(buffer, tmppage, InvalidOffsetNumber,
+									 0, true, &pruned);
+
+			/*
+			 * Pruning must be successful at redo time, otherwise the page
+			 * contents on master and standby might differ.
+			 */
+			Assert(pruned);
+
+			pfree(tmppage);
+		}
+
+		freespace = PageGetZHeapFreeSpace(page); /* needed to update FSM below */
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	/* replay the record for tpd buffer */
+	if (XLogRecHasBlockRef(record, 1))
+	{
+		/*
+		 * We need to replay the record for TPD only when this record contains
+		 * slot from TPD.
+		 */
+		action = XLogReadTPDBuffer(record, 1);
+		if (action == BLK_NEEDS_REDO)
+		{
+			TPDPageSetUndo(buffer,
+						   xlrec->trans_slot_id,
+						   true,
+						   xid_epoch,
+						   xid,
+						   urecptr,
+						   unused,
+						   uncnt);
+			TPDPageSetLSN(BufferGetPage(buffer), lsn);
+		}
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+
+	/*
+	 * Update the FSM as well.
+	 *
+	 * XXX: Don't do this if the page was restored from full page image. We
+	 * don't bother to update the FSM in that case, it doesn't need to be
+	 * totally accurate anyway.
+	 */
+	if (action == BLK_NEEDS_REDO)
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+}
+
+/*
+ * Replay XLOG_ZHEAP_VISIBLE record.
+ */
+static void
+zheap_xlog_visible(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_zheap_visible *xlrec = (xl_zheap_visible *) XLogRecGetData(record);
+	Buffer		vmbuffer = InvalidBuffer;
+	RelFileNode rnode;
+
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
+
+	/*
+	 * If there are any Hot Standby transactions running that have an xmin
+	 * horizon old enough that this page isn't all-visible for them, they
+	 * might incorrectly decide that an index-only scan can skip a zheap fetch.
+	 *
+	 * NB: It might be better to throw some kind of "soft" conflict here that
+	 * forces any index-only scan that is in flight to perform zheap fetches,
+	 * rather than killing the transaction outright.
+	 */
+	if (InHotStandby)
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+
+	if (XLogReadBufferForRedoExtended(record, 0, RBM_ZERO_ON_ERROR, false,
+									  &vmbuffer) == BLK_NEEDS_REDO)
+	{
+		Page		vmpage = BufferGetPage(vmbuffer);
+		Relation	reln;
+		BlockNumber blkno = xlrec->heapBlk;;
+
+		/* initialize the page if it was read as zeros */
+		if (PageIsNew(vmpage))
+			PageInit(vmpage, BLCKSZ, 0);
+
+		/*
+		 * XLogReadBufferForRedoExtended locked the buffer. But
+		 * visibilitymap_set will handle locking itself.
+		 */
+		LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
+
+		reln = CreateFakeRelcacheEntry(rnode);
+		visibilitymap_pin(reln, blkno, &vmbuffer);
+
+		/*
+		 * Don't set the bit if replay has already passed this point.
+		 *
+		 * It might be safe to do this unconditionally; if replay has passed
+		 * this point, we'll replay at least as far this time as we did
+		 * before, and if this bit needs to be cleared, the record responsible
+		 * for doing so should be again replayed, and clear it.  For right
+		 * now, out of an abundance of conservatism, we use the same test here
+		 * we did for the zheap page.  If this results in a dropped bit, no
+		 * real harm is done; and the next VACUUM will fix it.
+		 */
+		if (lsn > PageGetLSN(vmpage))
+			visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
+							  xlrec->cutoff_xid, xlrec->flags);
+
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+	else if (BufferIsValid(vmbuffer))
+		UnlockReleaseBuffer(vmbuffer);
+}
+
+void
+zheap_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info & XLOG_ZHEAP_OPMASK)
+	{
+		case XLOG_ZHEAP_INSERT:
+			zheap_xlog_insert(record);
+			break;
+		case XLOG_ZHEAP_DELETE:
+			zheap_xlog_delete(record);
+			break;
+		case XLOG_ZHEAP_UPDATE:
+			zheap_xlog_update(record);
+			break;
+		case XLOG_ZHEAP_FREEZE_XACT_SLOT:
+			zheap_xlog_freeze_xact_slot(record);
+			break;
+		case XLOG_ZHEAP_INVALID_XACT_SLOT:
+			zheap_xlog_invalid_xact_slot(record);
+			break;
+		case XLOG_ZHEAP_LOCK:
+			zheap_xlog_lock(record);
+			break;
+		case XLOG_ZHEAP_MULTI_INSERT:
+			zheap_xlog_multi_insert(record);
+			break;
+		case XLOG_ZHEAP_CLEAN:
+			zheap_xlog_clean(record);
+			break;
+		default:
+			elog(PANIC, "zheap_redo: unknown op code %u", info);
+	}
+}
+
+void
+zheap2_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info & XLOG_ZHEAP_OPMASK)
+	{
+		case XLOG_ZHEAP_CONFIRM:
+			zheap_xlog_confirm(record);
+			break;
+		case XLOG_ZHEAP_UNUSED:
+			zheap_xlog_unused(record);
+			break;
+		case XLOG_ZHEAP_VISIBLE:
+			zheap_xlog_visible(record);
+			break;
+		default:
+			elog(PANIC, "zheap2_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/zheap/zhio.c b/src/backend/access/zheap/zhio.c
new file mode 100644
index 0000000000..df6656bbf7
--- /dev/null
+++ b/src/backend/access/zheap/zhio.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * zhio.c
+ *	  POSTGRES zheap access method input/output code.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/zhio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/visibilitymap.h"
+#include "access/zheap.h"
+#include "access/zhio.h"
+#include "access/zhtup.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+
+/*
+ * RelationGetBufferForZTuple
+ *
+ *	Returns pinned and exclusive-locked buffer of a page in given relation
+ *	with free space >= given len.
+ *
+ *	This is quite similar to RelationGetBufferForTuple except for zheap
+ *	specific handling.  If the last page where tuple needs to be inserted is a
+ *	TPD page, we skip it and directly extend the relation.  We could instead
+ *	check the previous page, but scanning relation backwards could be costly,
+ *	so we avoid it for now.  As we don't align tuples in zheap, use actual
+ *	length to find the required buffer.
+ */
+Buffer
+RelationGetBufferForZTuple(Relation relation, Size len,
+						   Buffer otherBuffer, int options,
+						   BulkInsertState bistate,
+						   Buffer *vmbuffer, Buffer *vmbuffer_other)
+{
+	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
+	Buffer		buffer = InvalidBuffer;
+	Page		page;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
+	BlockNumber targetBlock,
+				otherBlock;
+	bool		needLock = false;
+	bool		recheck = true;
+	bool		tpdPage = false;
+
+	/* Bulk insert is not supported for updates, only inserts. */
+	Assert(otherBuffer == InvalidBuffer || !bistate);
+
+	len = SHORTALIGN(len);
+
+	/*
+	 * If we're gonna fail for oversize tuple, do it right away
+	 */
+	if (len > MaxZHeapTupleSize)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("row is too big: size %zu, maximum size %zu",
+						len, MaxZHeapTupleSize)));
+
+	/* Compute desired extra freespace due to fillfactor option */
+	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
+												   HEAP_DEFAULT_FILLFACTOR);
+
+	if (otherBuffer != InvalidBuffer)
+		otherBlock = BufferGetBlockNumber(otherBuffer);
+	else
+		otherBlock = InvalidBlockNumber;	/* just to keep compiler quiet */
+
+	/*
+	 * We first try to put the tuple on the same page we last inserted a tuple
+	 * on, as cached in the BulkInsertState or relcache entry.  If that
+	 * doesn't work, we ask the Free Space Map to locate a suitable page.
+	 * Since the FSM's info might be out of date, we have to be prepared to
+	 * loop around and retry multiple times. (To insure this isn't an infinite
+	 * loop, we must update the FSM with the correct amount of free space on
+	 * each page that proves not to be suitable.)  If the FSM has no record of
+	 * a page with enough free space, we give up and extend the relation.
+	 *
+	 * When use_fsm is false, we either put the tuple onto the existing target
+	 * page or extend the relation.
+	 */
+	if (len + saveFreeSpace > MaxZHeapTupleSize)
+	{
+		/* can't fit, don't bother asking FSM */
+		targetBlock = InvalidBlockNumber;
+		use_fsm = false;
+	}
+	else if (bistate && bistate->current_buf != InvalidBuffer)
+		targetBlock = BufferGetBlockNumber(bistate->current_buf);
+	else
+		targetBlock = RelationGetTargetBlock(relation);
+
+	if (targetBlock == InvalidBlockNumber && use_fsm)
+	{
+		/*
+		 * We have no cached target page, so ask the FSM for an initial
+		 * target.
+		 */
+		targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+		/*
+		 * If the FSM knows nothing of the rel, try the last page before we
+		 * give up and extend.  This avoids one-tuple-per-page syndrome during
+		 * bootstrapping or in a recently-started system.
+		 */
+		if (targetBlock == InvalidBlockNumber)
+		{
+			BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+
+			/*
+			 * In zheap, first page is always a meta page, so we need to
+			 * skip it for tuple insertions.
+			 */
+			if (nblocks > ZHEAP_METAPAGE + 1)
+				targetBlock = nblocks - 1;
+		}
+	}
+
+loop:
+	while (targetBlock != InvalidBlockNumber)
+	{
+		/*
+		 * Read and exclusive-lock the target block, as well as the other
+		 * block if one was given, taking suitable care with lock ordering and
+		 * the possibility they are the same block.
+		 *
+		 * If the page-level all-visible flag is set, caller will need to
+		 * clear both that and the corresponding visibility map bit.  However,
+		 * by the time we return, we'll have x-locked the buffer, and we don't
+		 * want to do any I/O while in that state.  So we check the bit here
+		 * before taking the lock, and pin the page if it appears necessary.
+		 * Checking without the lock creates a risk of getting the wrong
+		 * answer, so we'll have to recheck after acquiring the lock.
+		 */
+		if (otherBuffer == InvalidBuffer)
+		{
+			/* easy case */
+			buffer = ReadBufferBI(relation, targetBlock, bistate);
+			visibilitymap_pin(relation, targetBlock, vmbuffer);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		}
+		else if (otherBlock == targetBlock)
+		{
+			/* also easy case */
+			buffer = otherBuffer;
+			visibilitymap_pin(relation, targetBlock, vmbuffer);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		}
+		else if (otherBlock < targetBlock)
+		{
+			/* lock other buffer first */
+			buffer = ReadBuffer(relation, targetBlock);
+			visibilitymap_pin(relation, targetBlock, vmbuffer);
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		}
+		else
+		{
+			/* lock target buffer first */
+			buffer = ReadBuffer(relation, targetBlock);
+			visibilitymap_pin(relation, targetBlock, vmbuffer);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		}
+
+		if (PageGetSpecialSize(BufferGetPage(buffer)) == MAXALIGN(sizeof(TPDPageOpaqueData)))
+		{
+			tpdPage = true;
+			page = BufferGetPage(buffer);
+
+			/* If the tpd page is empty, then we can use it as an empty zheap page. */
+			if (PageIsEmpty(page))
+			{
+				ZheapInitPage(page, BufferGetPageSize(buffer));
+				tpdPage = false;
+			}
+		}
+
+		if (!tpdPage)
+		{
+			/*
+			 * We now have the target page (and the other buffer, if any) pinned
+			 * and locked.  However, since our initial PageIsAllVisible checks
+			 * were performed before acquiring the lock, the results might now be
+			 * out of date, either for the selected victim buffer, or for the
+			 * other buffer passed by the caller.  In that case, we'll need to
+			 * give up our locks, go get the pin(s) we failed to get earlier, and
+			 * re-lock.  That's pretty painful, but hopefully shouldn't happen
+			 * often.
+			 *
+			 * Note that there's a small possibility that we didn't pin the page
+			 * above but still have the correct page pinned anyway, either because
+			 * we've already made a previous pass through this loop, or because
+			 * caller passed us the right page anyway.
+			 *
+			 * Note also that it's possible that by the time we get the pin and
+			 * retake the buffer locks, the visibility map bit will have been
+			 * cleared by some other backend anyway.  In that case, we'll have
+			 * done a bit of extra work for no gain, but there's no real harm
+			 * done.
+			 *
+			 * Fixme: GetVisibilityMapPins use PageIsAllVisible which is not
+			 * required for zheap, so either we need to rewrite that function or
+			 * somehow avoid the usage of that call.
+			 */
+			if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
+				GetVisibilityMapPins(relation, buffer, otherBuffer,
+									 targetBlock, otherBlock, vmbuffer,
+									 vmbuffer_other);
+			else
+				GetVisibilityMapPins(relation, otherBuffer, buffer,
+									 otherBlock, targetBlock, vmbuffer_other,
+									 vmbuffer);
+
+			/*
+			 * Now we can check to see if there's enough free space here. If so,
+			 * we're done.
+			 */
+			page = BufferGetPage(buffer);
+			pageFreeSpace = PageGetZHeapFreeSpace(page);
+			if (len + saveFreeSpace <= pageFreeSpace)
+			{
+				/* use this page as future insert target, too */
+				RelationSetTargetBlock(relation, targetBlock);
+				return buffer;
+			}
+		}
+
+		/*
+		 * Not enough space or a tpd page, so we must give up our page locks
+		 * and pin (if any) and prepare to look elsewhere.  We don't care
+		 * which order we unlock the two buffers in, so this can be slightly
+		 * simpler than the code above.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		if (otherBuffer == InvalidBuffer)
+			ReleaseBuffer(buffer);
+		else if (otherBlock != targetBlock)
+		{
+			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buffer);
+		}
+
+		/*
+		 * If this a tpd page or FSM doesn't need to be updated, always fall
+		 * out of the loop and extend.
+		 */
+		if (!use_fsm || tpdPage)
+			break;
+
+		/*
+		 * Update FSM as to condition of this page, and ask for another page
+		 * to try.
+		 */
+		targetBlock = RecordAndGetPageWithFreeSpace(relation,
+													targetBlock,
+													pageFreeSpace,
+													len + saveFreeSpace);
+	}
+
+	/*
+	 * Have to extend the relation.
+	 *
+	 * We have to use a lock to ensure no one else is extending the rel at the
+	 * same time, else we will both try to initialize the same new page.  We
+	 * can skip locking for new or temp relations, however, since no one else
+	 * could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(relation);
+
+recheck:
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
+	if (needLock)
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
+
+	/*
+	 * In addition to whatever extension we performed above, we always add at
+	 * least one block to satisfy our own request.
+	 *
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since it
+	 * must have a lower page number.  We don't lock other buffer while holding
+	 * extension lock.  See comments below.
+	 */
+	if (otherBuffer != InvalidBuffer && !needLock)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.  Note that we cannot release this lock before
+	 * we have buffer lock on the new page, or we risk a race condition
+	 * against vacuumlazy.c --- see comments therein.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * We need to initialize the empty new page.  Double-check that it really
+	 * is empty (this should never happen, but if it does we don't want to
+	 * risk wiping out valid data).
+	 */
+	page = BufferGetPage(buffer);
+
+	if (!PageIsNew(page))
+		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
+			 BufferGetBlockNumber(buffer),
+			 RelationGetRelationName(relation));
+
+	Assert(BufferGetBlockNumber(buffer) != ZHEAP_METAPAGE);
+	ZheapInitPage(page, BufferGetPageSize(buffer));
+
+	/*
+	 * We don't acquire lock on otherBuffer while holding extension lock as it
+	 * can create a deadlock against extending TPD entry where we take extension
+	 * lock while holding the heap buffer lock.  See TPDAllocatePageAndAddEntry.
+	 */
+	if (needLock &&
+		otherBuffer != InvalidBuffer &&
+		BufferGetBlockNumber(buffer) > otherBlock)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		recheck = true;
+	}	
+	if (len > PageGetZHeapFreeSpace(page))
+	{
+		if (recheck)
+			goto recheck;
+		
+		/* We should not get here given the test at the top */
+		elog(PANIC, "tuple is too big: size %zu", len);
+	}
+
+	/*
+	 * Remember the new page as our target for future insertions.
+	 *
+	 * XXX should we enter the new page into the free space map immediately,
+	 * or just keep it for this backend's exclusive use in the short run
+	 * (until VACUUM sees it)?	Seems to depend on whether you expect the
+	 * current backend to make more insertions or not, which is probably a
+	 * good bet most of the time.  So for now, don't add it to FSM yet.
+	 */
+	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
+
+	return buffer;
+}
diff --git a/src/backend/access/zheap/zmultilocker.c b/src/backend/access/zheap/zmultilocker.c
new file mode 100644
index 0000000000..7b39917a32
--- /dev/null
+++ b/src/backend/access/zheap/zmultilocker.c
@@ -0,0 +1,853 @@
+/*-------------------------------------------------------------------------
+ *
+ * zmultilocker.c
+ *	  zheap multi locker code
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/zheap/zmultilocker.c
+ *
+ * NOTES
+ *	  This file contains functions for the multi locker facilit1y of zheap.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tpd.h"
+#include "access/xact.h"
+#include "access/zmultilocker.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/proc.h"
+
+static bool IsZMultiLockListMember(List *members, ZMultiLockMember *mlmember);
+
+/*
+ * ZGetMultiLockMembersForCurrentXact - Return the strongest lock mode held by
+ *			the current transaction on a given tuple.
+ */
+List *
+ZGetMultiLockMembersForCurrentXact(ZHeapTuple zhtup, int trans_slot,
+								   UndoRecPtr urec_ptr)
+{
+	ZHeapTuple		undo_tup;
+	UnpackedUndoRecord	*urec = NULL;
+	ZMultiLockMember		*mlmember;
+	List	*multilockmembers = NIL;
+	int trans_slot_id = -1;
+	uint8	uur_type;
+
+	undo_tup = zhtup;
+	do
+	{
+		urec = UndoFetchRecord(urec_ptr,
+							   ItemPointerGetBlockNumber(&zhtup->t_self),
+							   ItemPointerGetOffsetNumber(&zhtup->t_self),
+							   InvalidTransactionId,
+							   NULL,
+							   ZHeapSatisfyUndoRecord);
+
+		/* If undo is discarded, we can't proceed further. */
+		if (!urec)
+			break;
+
+		/* If we encounter a different transaction, we shouldn't go ahead. */
+		if (!TransactionIdIsCurrentTransactionId(urec->uur_xid))
+			break;
+
+
+		uur_type = urec->uur_type;
+
+		if (uur_type == UNDO_INSERT || uur_type == UNDO_MULTI_INSERT)
+		{
+			/*
+			 * We are done, once we are at the end of current chain.  We
+			 * consider the chain has ended when we reach the root tuple.
+			 */
+			break;
+		}
+
+		/* don't free the tuple passed by caller */
+		undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, NULL,
+										   (undo_tup) == (zhtup) ? false : true,
+											NULL);
+
+		if (uur_type == UNDO_XID_LOCK_ONLY ||
+			uur_type == UNDO_XID_LOCK_FOR_UPDATE ||
+			uur_type == UNDO_XID_MULTI_LOCK_ONLY)
+		{
+			mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember));
+			mlmember->xid = urec->uur_xid;
+			mlmember->trans_slot_id = trans_slot;
+			mlmember->mode = *((LockTupleMode *) urec->uur_payload.data);
+			multilockmembers = lappend(multilockmembers, mlmember);
+		}
+		else if (uur_type == UNDO_UPDATE ||
+				 uur_type == UNDO_INPLACE_UPDATE)
+		{
+			mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember));
+			mlmember->xid = urec->uur_xid;
+			mlmember->trans_slot_id = trans_slot;
+
+			if (ZHEAP_XID_IS_EXCL_LOCKED(undo_tup->t_data->t_infomask))
+				mlmember->mode = LockTupleExclusive;
+			else
+				mlmember->mode = LockTupleNoKeyExclusive;
+
+			multilockmembers = lappend(multilockmembers, mlmember);
+		}
+		else if (uur_type == UNDO_DELETE)
+		{
+			mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember));
+			mlmember->xid = urec->uur_xid;
+			mlmember->trans_slot_id = trans_slot;
+			mlmember->mode = LockTupleExclusive;
+			multilockmembers = lappend(multilockmembers, mlmember);
+		}
+		else
+		{
+			/* Should not reach here */
+			Assert(0);
+		}
+
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN)
+		{
+			/*
+			 * We are done, once the the undo record suggests that prior
+			 * record is already discarded.
+			 *
+			 * Note that we record the lock mode for all these cases because
+			 * the lock mode stored in undo tuple is for the current
+			 * transaction.
+			 */
+			break;
+		}
+		urec_ptr = urec->uur_blkprev;
+
+		UndoRecordRelease(urec);
+		urec = NULL;
+	} while (UndoRecPtrIsValid(urec_ptr));
+
+	if (urec)
+	{
+		UndoRecordRelease(urec);
+		urec = NULL;
+	}
+	if (undo_tup && undo_tup != zhtup)
+		pfree(undo_tup);
+
+	return multilockmembers;
+}
+
+/*
+ * ZGetMultiLockMembers - Return the list of members that have locked a
+ *		particular tuple.
+ *
+ * This function returns the list of in-progress, committed or aborted
+ * transactions.  The purpose of returning committed or aborted transactions
+ * is that some of the callers want to take some specific action for
+ * such transactions if they have updated the tuple.
+ */
+List *
+ZGetMultiLockMembers(Relation rel, ZHeapTuple zhtup, Buffer buf,
+					 bool nobuflock)
+{
+	ZHeapTuple		undo_tup;
+	UnpackedUndoRecord	*urec = NULL;
+	UndoRecPtr		urec_ptr;
+	ZMultiLockMember		*mlmember;
+	List	*multilockmembers = NIL;
+	TransInfo *trans_slots = NULL;
+	TransactionId	xid;
+	SubTransactionId	subxid = InvalidSubTransactionId;
+	uint64	epoch_xid;
+	uint32	epoch;
+	int		prev_trans_slot_id,
+			trans_slot_id;
+	uint8	uur_type;
+	int		slot_no;
+	int		total_trans_slots = 0;
+	BlockNumber	tpd_blkno = InvalidBlockNumber;
+
+	if (nobuflock)
+	{
+		ItemId	lp;
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		lp = PageGetItemId(BufferGetPage(buf),
+						   ItemPointerGetOffsetNumber(&zhtup->t_self));
+		/*
+		 * It is quite possible that once we reacquire the lock on buffer,
+		 * some other backend would have deleted the tuple and in such case,
+		 * we don't need to do anything.  However, the tuple can't be pruned
+		 * because the current snapshot must predates the transaction that
+		 * removes the tuple.
+		 */
+		Assert(!ItemIdIsDead(lp));
+		if (ItemIdIsDeleted(lp))
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			return NIL;
+		}
+	}
+
+	trans_slots = GetTransactionsSlotsForPage(rel, buf, &total_trans_slots,
+											  &tpd_blkno);
+
+	if (nobuflock)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	for (slot_no = 0; slot_no < total_trans_slots; slot_no++)
+	{
+		bool	first_urp = true;
+		epoch = trans_slots[slot_no].xid_epoch;
+		xid = trans_slots[slot_no].xid;
+
+		epoch_xid = MakeEpochXid((uint64)epoch, xid);
+
+		/*
+		 * We need to process the undo chain only for in-progress
+		 * transactions.
+		 */
+		if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			continue;
+
+		urec_ptr = trans_slots[slot_no].urec_ptr;
+		trans_slot_id = slot_no + 1;
+		undo_tup = zhtup;
+
+		/*
+		 * If the page contains TPD slots and it's not pruned, the last slot
+		 * contains the information about the corresponding TPD entry.
+		 * Hence, if current slot refers to some TPD slot, we should skip
+		 * the last slot in the page by increasing the slot index by 1.
+		 */
+		if ((trans_slot_id >= ZHEAP_PAGE_TRANS_SLOTS) &&
+			BlockNumberIsValid(tpd_blkno))
+			trans_slot_id += 1;
+
+		do
+		{
+			UndoLogControl *log = NULL;
+
+			/*
+			 * After we release the buffer lock, the transaction can be
+			 * rolled-back and undo record poiner can be re-winded.  Ensure
+			 * that undo record pointer is sane by acquiring rewind lock so
+			 * that undo worker can't rewind it concurrently.
+			 *
+			 * It is sufficient to verify the first undo record of slot as
+			 * the previous one's can't be re-wounded.
+			 *
+			 * If we already have a buf LOCK, then there is no need to verify
+			 * undo record pointer as rollback can't rewind till the undo actions
+			 * are applied.
+			 */
+			if (nobuflock && first_urp)
+			{
+				log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr));
+
+				/*
+				 * Acquire rewind lock to prevent rewinding the undo record
+				 * pointer while we are fetching the undo record.
+				 */
+				LWLockAcquire(&log->rewind_lock, LW_SHARED);
+
+				/* Lock the buffer */
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+				/*
+				 * We can release the buffer lock after reading the slot
+				 * information as we already hold the rewind lock, so the undo
+				 * can't be re-winded.  Although, it can be discarded but we have
+				 * handling for the same.
+				 */
+				trans_slot_id = GetTransactionSlotInfo(buf,
+													   InvalidOffsetNumber,
+													   trans_slot_id,
+													   &epoch,
+													   &xid,
+													   &urec_ptr,
+													   true,
+													   true);
+
+				/* Release the buffer lock */
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+				epoch_xid = MakeEpochXid((uint64)epoch, xid);
+
+				/*
+				 * We need to process the undo chain only for in-progress
+				 * transactions.
+				 */
+				if (epoch_xid < pg_atomic_read_u64(
+							&ProcGlobal->oldestXidWithEpochHavingUndo))
+				{
+					LWLockRelease(&log->rewind_lock);
+					break;
+				}
+			}
+
+			prev_trans_slot_id = trans_slot_id;
+			urec = UndoFetchRecord(urec_ptr,
+								   ItemPointerGetBlockNumber(&undo_tup->t_self),
+								   ItemPointerGetOffsetNumber(&undo_tup->t_self),
+								   InvalidTransactionId,
+								   NULL,
+								   ZHeapSatisfyUndoRecord);
+
+			if (nobuflock && first_urp)
+			{
+				first_urp = false;
+				LWLockRelease(&log->rewind_lock);
+			}
+
+			/* If undo is discarded, we can't proceed further. */
+			if (!urec)
+				break;
+
+			ZHeapTupleGetSubXid(undo_tup, buf, urec_ptr, &subxid);
+
+			/*
+			 * Exclude undo records inserted by my own transaction.  We neither
+			 * need to check conflicts with them nor need to wait for them.
+			 */
+			if (TransactionIdEquals(urec->uur_xid, GetTopTransactionIdIfAny()))
+			{
+				urec_ptr = urec->uur_blkprev;
+				UndoRecordRelease(urec);
+				urec = NULL;
+				continue;
+			}
+
+			uur_type = urec->uur_type;
+
+			if (uur_type == UNDO_INSERT || uur_type == UNDO_MULTI_INSERT)
+			{
+				/*
+				 * We are done, once we are at the end of current chain.  We
+				 * consider the chain has ended when we reach the root tuple.
+				 */
+				break;
+			}
+
+			/* don't free the tuple passed by caller */
+			undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id,
+										NULL, (undo_tup) == (zhtup) ? false : true,
+										BufferGetPage(buf));
+
+			if (uur_type == UNDO_XID_LOCK_ONLY ||
+				uur_type == UNDO_XID_LOCK_FOR_UPDATE ||
+				uur_type == UNDO_XID_MULTI_LOCK_ONLY)
+			{
+				mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember));
+				mlmember->xid = urec->uur_xid;
+				mlmember->subxid = subxid;
+				mlmember->trans_slot_id = prev_trans_slot_id;
+				mlmember->mode = *((LockTupleMode *) urec->uur_payload.data);
+				multilockmembers = lappend(multilockmembers, mlmember);
+			}
+			else if (uur_type == UNDO_UPDATE ||
+					 uur_type == UNDO_INPLACE_UPDATE)
+			{
+				mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember));
+				mlmember->xid = urec->uur_xid;
+				mlmember->subxid = subxid;
+				mlmember->trans_slot_id = prev_trans_slot_id;
+
+				if (ZHEAP_XID_IS_EXCL_LOCKED(undo_tup->t_data->t_infomask))
+					mlmember->mode = LockTupleExclusive;
+				else
+					mlmember->mode = LockTupleNoKeyExclusive;
+
+				multilockmembers = lappend(multilockmembers, mlmember);
+			}
+			else if (uur_type == UNDO_DELETE)
+			{
+				mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember));
+				mlmember->xid = urec->uur_xid;
+				mlmember->subxid = subxid;
+				mlmember->trans_slot_id = prev_trans_slot_id;
+				mlmember->mode = LockTupleExclusive;
+				multilockmembers = lappend(multilockmembers, mlmember);
+			}
+			else
+			{
+				/* Should not reach here */
+				Assert(0);
+			}
+
+			if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+				trans_slot_id != prev_trans_slot_id)
+			{
+				/*
+				 * We are done, once the the undo record suggests that prior
+				 * record is already discarded or the prior record belongs to
+				 * a different transaction slot chain.
+				 */
+				break;
+			}
+
+			/*
+			 * We allow to move backwards in the chain even when we
+			 * encountered undo record of committed transaction
+			 * (ZHeapTupleHasInvalidXact(undo_tup->t_data)).
+			 */
+			urec_ptr = urec->uur_blkprev;
+
+			UndoRecordRelease(urec);
+			urec = NULL;
+		} while (UndoRecPtrIsValid(urec_ptr));
+
+		if (urec)
+		{
+			UndoRecordRelease(urec);
+			urec = NULL;
+		}
+
+		if (undo_tup && undo_tup != zhtup)
+			pfree(undo_tup);
+	}
+
+	/* be tidy */
+	pfree(trans_slots);
+
+	return multilockmembers;
+}
+
+/*
+ * ZMultiLockMembersWait - Wait for all the members to end.
+ *
+ * This function also applies the undo actions for aborted transactions.
+ */
+bool
+ZMultiLockMembersWait(Relation rel, List *mlmembers, ZHeapTuple zhtup,
+					  Buffer buf, TransactionId update_xact,
+					  LockTupleMode required_mode, bool nowait,
+					  XLTW_Oper oper, int *remaining, bool *upd_xact_aborted)
+{
+	bool		result = true;
+	ListCell   *lc;
+	BufferDesc *bufhdr PG_USED_FOR_ASSERTS_ONLY;
+	int			remain = 0;
+
+	bufhdr = GetBufferDescriptor(buf - 1);
+	/* buffer must be unlocked */
+	Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(bufhdr)));
+
+	*upd_xact_aborted = false;
+
+	foreach(lc, mlmembers)
+	{
+		ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+		TransactionId	memxid = mlmember->xid;
+		SubTransactionId	memsubxid = mlmember->subxid;
+		LockTupleMode	memmode = mlmember->mode;
+
+		if (TransactionIdIsCurrentTransactionId(memxid))
+		{
+			remain++;
+			continue;
+		}
+
+		if (!DoLockModesConflict(HWLOCKMODE_from_locktupmode(memmode),
+								 HWLOCKMODE_from_locktupmode(required_mode)))
+		{
+			if (remaining && TransactionIdIsInProgress(memxid))
+				remain++;
+			continue;
+		}
+
+		/*
+		 * This member conflicts with our multi, so we have to sleep (or
+		 * return failure, if asked to avoid waiting.)
+		 */
+		if (memsubxid != InvalidSubTransactionId)
+		{
+			if (nowait)
+			{
+				result = ConditionalSubXactLockTableWait(memxid, memsubxid);
+				if (!result)
+					break;
+			}
+			else
+				SubXactLockTableWait(memxid, memsubxid, rel, &zhtup->t_self,
+									 oper);
+		}
+		else if (nowait)
+		{
+			result = ConditionalXactLockTableWait(memxid);
+			if (!result)
+				break;
+		}
+		else
+			XactLockTableWait(memxid, rel, &zhtup->t_self, oper);
+
+		/*
+		 * For aborted transaction, if the undo actions are not applied yet,
+		 * then apply them before modifying the page.
+		 */
+		if (TransactionIdDidAbort(memxid))
+		{
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+			zheap_exec_pending_rollback(rel, buf, mlmember->trans_slot_id, memxid);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+			if (TransactionIdIsValid(update_xact) && memxid == update_xact)
+				*upd_xact_aborted = true;
+		}
+	}
+
+	if (remaining)
+		*remaining = remain;
+
+	return result;
+}
+
+/*
+ * ConditionalZMultiLockMembersWait
+ *		As above, but only lock if we can get the lock without blocking.
+ */
+bool
+ConditionalZMultiLockMembersWait(Relation rel, List *mlmembers,
+								 Buffer buf, TransactionId update_xact,
+								 LockTupleMode required_mode, int *remaining,
+								 bool *upd_xact_aborted)
+{
+	return ZMultiLockMembersWait(rel, mlmembers, NULL, buf, update_xact,
+								 required_mode, true, XLTW_None, remaining,
+								 upd_xact_aborted);
+}
+
+/*
+ * ZIsAnyMultiLockMemberRunning - Check if any multi lock member is running.
+ *
+ * Returns true, if any member of the multi lock is running, false otherwise.
+ *
+ * Unlike heap, we don't consider current transaction's lockers to decide
+ * if the lockers of multi lock are running. In heap, any lock taken by
+ * subtransaction is recorded separetly in the multixact, so that it can
+ * detect if the subtransaction is rolled back. Now as the lock information
+ * is tracked at subtransaction level, we can't ignore the lockers for
+ * subtransactions of current top-level transaction. For zheap, rollback to
+ * subtransaction will rewind the undo and the lockers information will
+ * be automatically removed, so we don't need to track subtransaction lockers
+ * separately and hence we can ignore lockers of current top-level
+ * transaction.
+ */
+bool
+ZIsAnyMultiLockMemberRunning(List *mlmembers, ZHeapTuple zhtup, Buffer buf)
+{
+	ListCell   *lc;
+	BufferDesc *bufhdr PG_USED_FOR_ASSERTS_ONLY;
+
+	bufhdr = GetBufferDescriptor(buf - 1);
+
+	/*
+	 * Local buffers can't be accesed by other sessions.
+	 */
+	if (BufferIsLocal(buf))
+		return false;
+
+	/* buffer must be locked by caller */
+	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufhdr)));
+
+	if (list_length(mlmembers) <= 0)
+	{
+		elog(DEBUG2, "ZIsRunning: no members");
+		return false;
+	}
+
+	foreach(lc, mlmembers)
+	{
+		ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+		TransactionId	memxid = mlmember->xid;
+
+		if (TransactionIdIsInProgress(memxid))
+		{
+			elog(DEBUG2, "ZIsRunning: member %d is running", memxid);
+			return true;
+		}
+	}
+
+	elog(DEBUG2, "ZIsRunning: no members are running");
+
+	return false;
+}
+
+/*
+ * IsZMultiLockListMember - Returns true iff mlmember is a member of list
+ *	mlmembers.  Equality is determined by comparing all the variables of
+ *	member.
+ */
+static bool
+IsZMultiLockListMember(List *members, ZMultiLockMember *mlmember)
+{
+	ListCell   *lc;
+
+	foreach(lc, members)
+	{
+		ZMultiLockMember *lc_member = (ZMultiLockMember *) lfirst(lc);
+
+		if (lc_member->xid == mlmember->xid &&
+			lc_member->trans_slot_id == mlmember->trans_slot_id &&
+			lc_member->mode == mlmember->mode)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * ZMultiLockMembersSame -  Returns true, iff all the members in list2 list
+ *	are present in list1 list
+ */
+bool
+ZMultiLockMembersSame(List *list1, List	*list2)
+{
+	ListCell   *lc;
+
+	if (list_length(list2) > list_length(list1))
+		return false;
+
+	foreach(lc, list2)
+	{
+		ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc);
+
+		if (!IsZMultiLockListMember(list1, mlmember))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * ZGetMultiLockInfo - Helper function for compute_new_xid_infomask to
+ *	get the multi lockers information.
+ */
+void
+ZGetMultiLockInfo(uint16 old_infomask, TransactionId tup_xid,
+				  int tup_trans_slot, TransactionId add_to_xid,
+				  uint16 *new_infomask, int *new_trans_slot,
+				  LockTupleMode *mode, bool *old_tuple_has_update,
+				  bool is_update)
+{
+	LockTupleMode old_mode;
+
+	old_mode = get_old_lock_mode(old_infomask);
+
+	/* We want to propagate the updaters information for lockers only. */
+	if (!is_update && IsZHeapTupleModified(old_infomask) &&
+		!ZHEAP_XID_IS_LOCKED_ONLY(old_infomask))
+	{
+		*old_tuple_has_update = true;
+
+		if (ZHeapTupleIsInPlaceUpdated(old_infomask))
+		{
+			*new_infomask |= ZHEAP_INPLACE_UPDATED;
+		}
+		else
+		{
+			Assert(ZHeapTupleIsUpdated(old_infomask));
+			*new_infomask |= ZHEAP_UPDATED;
+		}
+	}
+
+	if (tup_xid == add_to_xid)
+	{
+		if (ZHeapTupleHasMultiLockers(old_infomask))
+			*new_infomask |= ZHEAP_MULTI_LOCKERS;
+
+		/* acquire the strongest of both */
+		if (*mode < old_mode)
+			*mode = old_mode;
+	}
+	else
+	{
+		*new_infomask |= ZHEAP_MULTI_LOCKERS;
+
+		/*
+		 * Acquire the strongest of both and keep the transaction slot of
+		 * the stronger lock.
+		 */
+		if (*mode < old_mode)
+		{
+			*mode = old_mode;
+		}
+
+		/* For lockers, we want to store the updater's transaction slot. */
+		if (!is_update)
+			*new_trans_slot = tup_trans_slot;
+	}
+}
+
+/*
+ * GetLockerTransInfo - Retrieve the transaction information of single locker
+ * from undo.
+ *
+ * If the locker is already committed or too-old, we consider as if it didn't
+ * exist at all.
+ *
+ * The caller must have a lock on the buffer (buf).
+ */
+bool
+GetLockerTransInfo(Relation rel, ZHeapTuple zhtup, Buffer buf,
+				   int *trans_slot, uint64 *epoch_xid_out,
+				   TransactionId *xid_out, CommandId *cid_out,
+				   UndoRecPtr *urec_ptr_out)
+{
+	UnpackedUndoRecord	*urec = NULL;
+	UndoRecPtr		urec_ptr;
+	UndoRecPtr		save_urec_ptr = InvalidUndoRecPtr;
+	TransInfo *trans_slots = NULL;
+	TransactionId	xid;
+	CommandId	cid = InvalidCommandId;
+	uint64	epoch;
+	uint64	epoch_xid;
+	int		trans_slot_id;
+	uint8	uur_type;
+	int		slot_no;
+	int		total_trans_slots = 0;
+	bool	found = false;
+	BlockNumber	tpd_blkno;
+
+	trans_slots = GetTransactionsSlotsForPage(rel, buf, &total_trans_slots,
+											  &tpd_blkno);
+
+	for (slot_no = 0; slot_no < total_trans_slots; slot_no++)
+	{
+		epoch = trans_slots[slot_no].xid_epoch;
+		xid = trans_slots[slot_no].xid;
+
+		epoch_xid = MakeEpochXid(epoch, xid);
+
+		/*
+		 * We need to process the undo chain only for in-progress
+		 * transactions.
+		 */
+		if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo) ||
+			(!TransactionIdIsInProgress(xid) && TransactionIdDidCommit(xid)))
+			continue;
+
+		save_urec_ptr = urec_ptr = trans_slots[slot_no].urec_ptr;
+
+		do
+		{
+			UndoRecPtr out_urec_ptr PG_USED_FOR_ASSERTS_ONLY;
+
+			out_urec_ptr = InvalidUndoRecPtr;
+			urec = UndoFetchRecord(urec_ptr,
+								   ItemPointerGetBlockNumber(&zhtup->t_self),
+								   ItemPointerGetOffsetNumber(&zhtup->t_self),
+								   InvalidTransactionId,
+								   &out_urec_ptr,
+								   ZHeapSatisfyUndoRecord);
+
+			/*
+			 * We couldn't find any undo record for the tuple corresponding
+			 * to current slot.
+			 */
+			if (urec == NULL)
+			{
+				/* Make sure we've reached the end of current undo chain. */
+				Assert(!out_urec_ptr);
+				break;
+			}
+
+			cid = urec->uur_cid;
+
+			/*
+			 * If the current transaction has locked the tuple, then we don't need
+			 * to process the undo records.
+			 */
+			if (TransactionIdEquals(urec->uur_xid, GetTopTransactionIdIfAny()))
+			{
+				found = true;
+				break;
+			}
+
+			uur_type = urec->uur_type;
+
+			if (uur_type == UNDO_INSERT || uur_type == UNDO_MULTI_INSERT)
+			{
+				/*
+				 * We are done, once we are at the end of current chain.  We
+				 * consider the chain has ended when we reach the root tuple.
+				 */
+				break;
+			}
+
+			if (uur_type == UNDO_XID_LOCK_ONLY ||
+				uur_type == UNDO_XID_LOCK_FOR_UPDATE)
+			{
+				found = true;
+				break;
+			}
+
+			if (xid != urec->uur_xid)
+			{
+				/*
+				 * We are done, once the the undo record suggests that prior
+				 * tuple version is modified by a different transaction.
+				 */
+				break;
+			}
+
+			urec_ptr = urec->uur_blkprev;
+
+			UndoRecordRelease(urec);
+			urec = NULL;
+		} while (UndoRecPtrIsValid(urec_ptr));
+
+		if (urec)
+		{
+			UndoRecordRelease(urec);
+			urec = NULL;
+		}
+
+		if (found)
+		{
+			/* Transaction slots in the page start from 1. */
+			trans_slot_id = slot_no + 1;
+
+			/*
+			 * If the page contains TPD slots and it's not pruned, the last slot
+			 * contains the information about the corresponding TPD entry.
+			 * Hence, if current slot refers to some TPD slot, we should skip
+			 * the last slot in the page by increasing the slot index by 1.
+			 */
+			if ((trans_slot_id >= ZHEAP_PAGE_TRANS_SLOTS) &&
+				BlockNumberIsValid(tpd_blkno))
+				trans_slot_id += 1;
+
+			break;
+		}
+	}
+
+	/* be tidy */
+	pfree(trans_slots);
+
+	/*
+	 * If found, we return the corresponding transaction information. Else, we
+	 * return the same information as passed as arguments.
+	 */
+	if (found)
+	{
+		/* Set the value of required parameters. */
+		if (trans_slot)
+			*trans_slot = trans_slot_id;
+		if (epoch_xid_out)
+			*epoch_xid_out = MakeEpochXid(epoch, xid);
+		if (xid_out)
+			*xid_out = xid;
+		if (cid_out)
+			*cid_out = cid;
+		if (urec_ptr_out)
+			*urec_ptr_out = save_urec_ptr;
+	}
+
+	return found;
+}
diff --git a/src/backend/access/zheap/ztqual.c b/src/backend/access/zheap/ztqual.c
new file mode 100644
index 0000000000..0f615fa03b
--- /dev/null
+++ b/src/backend/access/zheap/ztqual.c
@@ -0,0 +1,2582 @@
+/*-------------------------------------------------------------------------
+ *
+ * ztqual.c
+ *	  POSTGRES "time qualification" code, ie, ztuple visibility rules.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/time/ztqual.c
+ *
+ * The core idea to check if the tuple is all-visible is to see if it is
+ * modified by transaction smaller than oldestXidWithEpochHavingUndo (aka
+ * there is no undo pending for the transaction) or if the transaction slot
+ * is frozen.  For undo tuples, we additionally check if the transaction id
+ * of a transaction that has modified the tuple is FrozenTransactionId. The
+ * idea is we will always check the visibility of latest tuple based on
+ * epoch+xid and undo tuple's visibility based on xid.  If the heap tuple is
+ * not all-visible (epoch+xid is not older than oldestXidWithEpochHavingUndo),
+ * then the xid corresponding to undo tuple must be in the range of 2-billion
+ * transactions with oldestXidHavingUndo (xid part in
+ * oldestXidWithEpochHavingUndo).  This is true because we don't allow undo
+ * records older than 2-billion transactions.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/xact.h"
+#include "access/zheap.h"
+#include "access/zheaputils.h"
+#include "access/zmultilocker.h"
+#include "storage/bufmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "utils/tqual.h"
+#include "utils/ztqual.h"
+#include "storage/proc.h"
+
+
+static ZHeapTuple GetTupleFromUndo(UndoRecPtr urec_ptr, ZHeapTuple zhtup,
+				 Snapshot snapshot, Buffer buffer,
+				 ItemPointer ctid, int trans_slot_id,
+				 TransactionId prev_undo_xid);
+static ZHeapTuple
+GetTupleFromUndoForAbortedXact(UndoRecPtr urec_ptr, Buffer buffer, int trans_slot,
+							   ZHeapTuple ztuple,TransactionId *xid);
+
+/*
+ * FetchTransInfoFromUndo - Retrieve transaction information of transaction
+ *			that has modified the undo tuple.
+ */
+void
+FetchTransInfoFromUndo(ZHeapTuple undo_tup, uint64 *epoch, TransactionId *xid,
+					   CommandId *cid, UndoRecPtr *urec_ptr, bool skip_lockers)
+{
+	UnpackedUndoRecord	*urec;
+	UndoRecPtr		urec_ptr_out = InvalidUndoRecPtr;
+	TransactionId	undo_tup_xid;
+
+fetch_prior_undo:
+	undo_tup_xid = *xid;
+
+	/*
+	 * The transaction slot referred by the undo tuple could have been reused
+	 * multiple times, so to ensure that we have fetched the right undo record
+	 * we need to verify that the undo record contains xid same as the xid
+	 * that has modified the tuple.
+	 */
+	urec = UndoFetchRecord(*urec_ptr,
+						   ItemPointerGetBlockNumber(&undo_tup->t_self),
+						   ItemPointerGetOffsetNumber(&undo_tup->t_self),
+						   undo_tup_xid,
+						   &urec_ptr_out,
+						   ZHeapSatisfyUndoRecord);
+
+	/*
+	 * The undo tuple must be visible, if the undo record containing
+	 * the information of the last transaction that has updated the
+	 * tuple is discarded.
+	 */
+	if (urec == NULL)
+	{
+		if (epoch)
+			*epoch = 0;
+		if (xid)
+			*xid = InvalidTransactionId;
+		if (cid)
+			*cid = InvalidCommandId;
+		if (urec_ptr)
+			*urec_ptr = InvalidUndoRecPtr;
+		return;
+	}
+
+	/*
+	 * If we reach here, this means the transaction id that has
+	 * last modified this tuple must be in 2-billion xid range
+	 * of oldestXidHavingUndo, so we can get compute its epoch
+	 * as we do for current transaction.
+	 */
+	if (epoch)
+		*epoch = GetEpochForXid(urec->uur_xid);
+	*xid = urec->uur_xid;
+	*cid = urec->uur_cid;
+	*urec_ptr = urec_ptr_out;
+
+	if (skip_lockers &&
+		(urec->uur_type == UNDO_XID_LOCK_ONLY ||
+		 urec->uur_type == UNDO_XID_LOCK_FOR_UPDATE ||
+		 urec->uur_type == UNDO_XID_MULTI_LOCK_ONLY))
+	{
+		*xid = InvalidTransactionId;
+		*urec_ptr = urec->uur_blkprev;
+		UndoRecordRelease(urec);
+		goto fetch_prior_undo;
+	}
+
+	UndoRecordRelease(urec);
+}
+
+/*
+ * ZHeapPageGetNewCtid
+ *
+ * 	This should be called for ctid which is already set deleted to get the new
+ * 	ctid, xid and cid which modified the given one.
+ */
+void
+ZHeapPageGetNewCtid(Buffer buffer, ItemPointer ctid, TransactionId *xid,
+					CommandId *cid)
+{
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	int		trans_slot;
+	int		vis_info;
+	uint64		epoch;
+	ItemId	lp;
+	Page	page;
+	OffsetNumber	offnum = ItemPointerGetOffsetNumber(ctid);
+	int		out_slot_no PG_USED_FOR_ASSERTS_ONLY;
+
+	page = BufferGetPage(buffer);
+	lp = PageGetItemId(page, offnum);
+
+	Assert(ItemIdIsDeleted(lp));
+
+	trans_slot = ItemIdGetTransactionSlot(lp);
+	vis_info = ItemIdGetVisibilityInfo(lp);
+
+	if (vis_info & ITEMID_XACT_INVALID)
+	{
+		ZHeapTupleData	undo_tup;
+		ItemPointerSetBlockNumber(&undo_tup.t_self,
+								  BufferGetBlockNumber(buffer));
+		ItemPointerSetOffsetNumber(&undo_tup.t_self, offnum);
+
+		/*
+		 * We need undo record pointer to fetch the transaction information
+		 * from undo.
+		 */
+		out_slot_no = GetTransactionSlotInfo(buffer, offnum, trans_slot,
+											 (uint32 *) &epoch, xid, &urec_ptr,
+											 true, false);
+		*xid = InvalidTransactionId;
+		FetchTransInfoFromUndo(&undo_tup, &epoch, xid, cid, &urec_ptr, false);
+	}
+	else
+	{
+		out_slot_no = GetTransactionSlotInfo(buffer, offnum, trans_slot,
+											 (uint32 *) &epoch, xid, &urec_ptr,
+											 true, false);
+		*cid = ZHeapPageGetCid(buffer, trans_slot, (uint32) epoch, *xid,
+							   urec_ptr, offnum);
+	}
+
+	/*
+	 * We always expect non-frozen transaction slot here as the caller tries
+	 * to fetch the ctid of tuples that are visible to the snapshot, so
+	 * corresponding undo record can't be discarded.
+	 */
+	Assert(out_slot_no != ZHTUP_SLOT_FROZEN);
+
+	ZHeapPageGetCtid(trans_slot, buffer, urec_ptr, ctid);
+}
+
+/*
+ * GetVisibleTupleIfAny
+ *
+ * This is a helper function for GetTupleFromUndoWithOffset.
+ */
+static ZHeapTuple
+GetVisibleTupleIfAny(UndoRecPtr prev_urec_ptr, ZHeapTuple undo_tup,
+					 Snapshot snapshot, Buffer buffer, TransactionId xid,
+					 int trans_slot_id, CommandId cid)
+{
+	int			undo_oper = -1;
+	TransactionId	oldestXidHavingUndo;
+
+	if (undo_tup->t_data->t_infomask & ZHEAP_INPLACE_UPDATED)
+	{
+		undo_oper = ZHEAP_INPLACE_UPDATED;
+	}
+	else if (undo_tup->t_data->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		undo_oper = ZHEAP_XID_LOCK_ONLY;
+	}
+	else
+	{
+		/* we can't further operate on deleted or non-inplace-updated tuple */
+		Assert(!(undo_tup->t_data->t_infomask & ZHEAP_DELETED) ||
+			   !(undo_tup->t_data->t_infomask & ZHEAP_UPDATED));
+	}
+
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+
+	/*
+	 * We need to fetch all the transaction related information from undo
+	 * record for the tuples that point to a slot that gets invalidated for
+	 * reuse at some point of time.  See PageFreezeTransSlots.
+	 */
+	if ((trans_slot_id != ZHTUP_SLOT_FROZEN) &&
+		!TransactionIdEquals(xid, FrozenTransactionId) &&
+		!TransactionIdPrecedes(xid, oldestXidHavingUndo))
+	{
+		if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask))
+		{
+			FetchTransInfoFromUndo(undo_tup, NULL, &xid, &cid, &prev_urec_ptr, false);
+		}
+		/*
+		 * If we already have a valid cid then don't fetch it from the undo.
+		 * This is the case when old locker got transferred to the newly
+		 * inserted tuple of the non-in place update.  In such case undo chain
+		 * will not have a separate undo-record for the locker so we have to
+		 * use the cid we have got from the insert undo record because in this
+		 * case the actual previous version of the locker is insert only and
+		 * that is what we are interested in.
+		 */
+		/* 
+		 * If we already have a valid cid then don't fetch it from the undo.
+		 * For detailed comment refer GetVisibleTupleIfAny.
+		 */
+		/* 
+		 * If we already have a valid cid then don't fetch it from the undo.
+		 * For detailed comment refer GetVisibleTupleIfAny.
+		 */		
+		else if (cid == InvalidCommandId)
+		{
+			/*
+ 			 * we don't use prev_undo_xid to fetch the undo record for cid as it is
+ 			 * required only when transaction is current transaction in which case
+ 			 * there is no risk of transaction chain switching, so we are safe.  It
+ 			 * might be better to move this check near to it's usage, but that will
+ 			 * make code look ugly, so keeping it here.
+ 			 */
+			cid = ZHeapTupleGetCid(undo_tup, buffer, prev_urec_ptr, trans_slot_id);
+		}
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(xid, oldestXidHavingUndo))
+		return undo_tup;
+
+	if (undo_oper == ZHEAP_INPLACE_UPDATED ||
+		undo_oper == ZHEAP_XID_LOCK_ONLY)
+	{
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid)
+			{
+				/* updated/locked after scan started */
+				return GetTupleFromUndo(prev_urec_ptr,
+										undo_tup,
+										snapshot,
+										buffer,
+										NULL,
+										trans_slot_id,
+										xid);
+			}
+			else
+				return undo_tup;	/* updated before scan started */
+		}
+		else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot))
+			return GetTupleFromUndo(prev_urec_ptr,
+									undo_tup,
+									snapshot,
+									buffer,
+									NULL,
+									trans_slot_id,
+									xid);
+		else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid))
+			return GetTupleFromUndo(prev_urec_ptr,
+									undo_tup,
+									snapshot,
+									buffer,
+									NULL,
+									trans_slot_id,
+									xid);
+		else if (TransactionIdDidCommit(xid))
+			return undo_tup;
+		else
+			return GetTupleFromUndo(prev_urec_ptr,
+									undo_tup,
+									snapshot,
+									buffer,
+									NULL,
+									trans_slot_id,
+									xid);
+	}
+	else	/* undo tuple is the root tuple */
+	{
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid)
+				return NULL;	/* inserted after scan started */
+			else
+				return undo_tup;	/* inserted before scan started */
+		}
+		else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot))
+			return NULL;
+		else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid))
+			return NULL;
+		else if (TransactionIdDidCommit(xid))
+			return undo_tup;
+		else
+			return NULL;
+	}
+}
+
+/*
+ * GetTupleFromUndoForAbortedXact
+ *
+ *	This is used to fetch the prior committed version of the tuple which is
+ *	modified by an aborted xact.
+ *
+ *	It returns the prior committed version of the tuple, if available. Else,
+ *	returns NULL.
+ *
+ *	The caller must send a palloc'ed tuple. This function can get a tuple
+ *	from undo to return in which case it will free the memory passed by
+ *	the caller.
+ *
+ *	xid is an output parameter. It is set to the latest committed xid that
+ *	inserted/in-place-updated the tuple. If the aborted transaction inserted
+ *	the tuple itself, we return the same transaction id. The caller *should*
+ *	handle the same scenario.
+ */
+static ZHeapTuple
+GetTupleFromUndoForAbortedXact(UndoRecPtr urec_ptr, Buffer buffer, int trans_slot,
+							   ZHeapTuple ztuple,TransactionId *xid)
+{
+	ZHeapTuple  undo_tup = ztuple;
+	UnpackedUndoRecord	*urec;
+	UndoRecPtr  prev_urec_ptr;
+	TransactionId	prev_undo_xid PG_USED_FOR_ASSERTS_ONLY;
+	TransactionId	oldestXidHavingUndo = InvalidTransactionId;
+	int				trans_slot_id;
+	int				prev_trans_slot_id = trans_slot;
+
+	prev_undo_xid = InvalidTransactionId;
+fetch_prior_undo_record:
+	prev_urec_ptr = InvalidUndoRecPtr;
+	trans_slot_id = InvalidXactSlotId;
+
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(&undo_tup->t_self),
+						   ItemPointerGetOffsetNumber(&undo_tup->t_self),
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/* If undo is discarded, then current tuple is visible. */
+	if (urec == NULL)
+		return undo_tup;
+
+	/* Here, we free the previous version and palloc a new tuple from undo. */
+	undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, NULL,
+									   true, BufferGetPage(buffer));
+
+	prev_urec_ptr = urec->uur_blkprev;
+	*xid = urec->uur_prevxid;
+
+	UndoRecordRelease(urec);
+
+	/* we can't further operate on deleted or non-inplace-updated tuple */
+	Assert(!((undo_tup->t_data->t_infomask & ZHEAP_DELETED) ||
+		   (undo_tup->t_data->t_infomask & ZHEAP_UPDATED)));
+
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(*xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(*xid, oldestXidHavingUndo))
+	{
+		return undo_tup;
+	}
+
+	/*
+	 * If we got a tuple modified by a committed transaction, return it.
+	 */
+	if (TransactionIdDidCommit(*xid))
+		return undo_tup;
+
+	/*
+	 * If the tuple points to a slot that gets invalidated for reuse at some
+	 * point of time, then undo_tup is the latest committed version of the tuple.
+	 */
+	if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask))
+		return undo_tup;
+
+	/*
+	 * If the undo tuple is stamped with a different transaction, then either
+	 * the previous transaction is committed or tuple must be locked only. In both
+	 * cases, we can return the tuple fetched from undo.
+	 */
+	if (trans_slot_id != prev_trans_slot_id)
+	{
+		(void) GetTransactionSlotInfo(buffer,
+									  ItemPointerGetOffsetNumber(&undo_tup->t_self),
+									  trans_slot_id,
+									  NULL,
+									  NULL,
+									  &prev_urec_ptr,
+									  true,
+									  true);
+		FetchTransInfoFromUndo(undo_tup, NULL, xid, NULL, &prev_urec_ptr, false);
+
+		Assert(TransactionIdDidCommit(*xid) ||
+			   ZHEAP_XID_IS_LOCKED_ONLY(undo_tup->t_data->t_infomask));
+
+		return undo_tup;
+	}
+
+	/* transaction must be aborted. */
+	Assert(!TransactionIdIsCurrentTransactionId(*xid));
+	Assert(!TransactionIdIsInProgress(*xid));
+	Assert(TransactionIdDidAbort(*xid));
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(*xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(*xid, oldestXidHavingUndo))
+	{
+		return undo_tup;
+	}
+
+	/*
+	 * We can't have two aborted transaction with pending rollback state for
+	 * the same tuple.
+	 */
+	Assert(!TransactionIdIsValid(prev_undo_xid) ||
+		   TransactionIdEquals(prev_undo_xid, *xid));
+
+	/*
+	 * If undo tuple is the root tuple inserted by the aborted transaction,
+	 * we don't have to process any further. The tuple is not visible to us.
+	 */
+	if (!IsZHeapTupleModified(undo_tup->t_data->t_infomask))
+	{
+		/* before leaving, free the allocated memory */
+		pfree(undo_tup);
+		return NULL;
+	}
+
+	urec_ptr = prev_urec_ptr;
+	prev_undo_xid = *xid;
+	prev_trans_slot_id = trans_slot_id;
+
+	goto fetch_prior_undo_record;
+
+	/* not reachable */
+	Assert(0);
+	return NULL;
+}
+
+/*
+ * GetTupleFromUndo
+ *
+ *	Fetch the record from undo and determine if previous version of tuple
+ *	is visible for the given snapshot.  If there exists a visible version
+ *	of tuple in undo, then return the same, else return NULL.
+ *
+ *	During undo chain traversal, we need to ensure that we switch the undo
+ *	chain if the current version of undo tuple is modified by a transaction
+ *	that is different from transaction that has modified the previous version
+ *	of undo tuple.  This is primarily done because undo chain for a particular
+ *	tuple is formed based on the transaction id that has modified the tuple.
+ *
+ *	Also we don't need to process the chain if the latest xid that has changed
+ *  the tuple precedes smallest xid that has undo.
+ */
+static ZHeapTuple
+GetTupleFromUndo(UndoRecPtr urec_ptr, ZHeapTuple zhtup,
+				 Snapshot snapshot, Buffer buffer,
+				 ItemPointer ctid, int trans_slot,
+				 TransactionId prev_undo_xid)
+{
+	UnpackedUndoRecord	*urec;
+	ZHeapTuple	undo_tup;
+	UndoRecPtr	prev_urec_ptr;
+	TransactionId	xid;
+	CommandId	cid;
+	int			undo_oper;
+	TransactionId	oldestXidHavingUndo;
+	int	trans_slot_id;
+	int	prev_trans_slot_id = trans_slot;
+
+
+	/*
+	 * tuple is modified after the scan is started, fetch the prior record
+	 * from undo to see if it is visible.
+	 */
+fetch_prior_undo_record:
+	prev_urec_ptr = InvalidUndoRecPtr;
+	cid = InvalidCommandId;
+	undo_oper = -1;
+	trans_slot_id = InvalidXactSlotId;
+
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(&zhtup->t_self),
+						   ItemPointerGetOffsetNumber(&zhtup->t_self),
+						   prev_undo_xid,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/* If undo is discarded, then current tuple is visible. */
+	if (urec == NULL)
+		return zhtup;
+
+	undo_tup = CopyTupleFromUndoRecord(urec, zhtup, &trans_slot_id, &cid, true,
+									   BufferGetPage(buffer));
+	prev_urec_ptr = urec->uur_blkprev;
+	xid = urec->uur_prevxid;
+
+	/*
+	 * For non-inplace-updates, ctid needs to be retrieved from undo
+	 * record if required.
+	 */
+	if (ctid)
+	{
+		if (urec->uur_type == UNDO_UPDATE)
+			*ctid = *((ItemPointer) urec->uur_payload.data);
+		else
+			*ctid = undo_tup->t_self;
+	}
+
+	UndoRecordRelease(urec);
+
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(xid, oldestXidHavingUndo))
+		return undo_tup;
+
+	/*
+	 * Change the undo chain if the undo tuple is stamped with the different
+	 * transaction.
+	 */
+	if (trans_slot_id != prev_trans_slot_id)
+	{
+		/*
+		 * It is quite possible that the tuple is showing some valid
+		 * transaction slot, but actual slot has been frozen.  This can happen
+		 * when the slot belongs to TPD entry and the corresponding TPD entry
+		 * is pruned.
+		 */
+		trans_slot_id = GetTransactionSlotInfo(buffer,
+											   ItemPointerGetOffsetNumber(&undo_tup->t_self),
+											   trans_slot_id,
+											   NULL,
+											   NULL,
+											   &prev_urec_ptr,
+											   true,
+											   true);
+	}
+
+	if (undo_tup->t_data->t_infomask & ZHEAP_INPLACE_UPDATED)
+	{
+		undo_oper = ZHEAP_INPLACE_UPDATED;
+	}
+	else if (undo_tup->t_data->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		undo_oper = ZHEAP_XID_LOCK_ONLY;
+	}
+	else
+	{
+		/* we can't further operate on deleted or non-inplace-updated tuple */
+		Assert(!((undo_tup->t_data->t_infomask & ZHEAP_DELETED) ||
+			   (undo_tup->t_data->t_infomask & ZHEAP_UPDATED)));
+	}
+
+	/*
+	 * We need to fetch all the transaction related information from undo
+	 * record for the tuples that point to a slot that gets invalidated for
+	 * reuse at some point of time.  See PageFreezeTransSlots.
+	 */
+	if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask))
+	{
+		FetchTransInfoFromUndo(undo_tup, NULL, &xid, &cid, &prev_urec_ptr, false);
+	}
+	else if (cid == InvalidCommandId)
+	{
+		CommandId		cur_cid = GetCurrentCommandId(false);
+
+		/*
+		 * If the current command doesn't need to modify any tuple and the
+		 * snapshot used is not of any previous command, then it can see all the
+		 * modifications made by current transactions till now.  So, we don't even
+		 * attempt to fetch CID from undo in such cases.
+		 */
+		if (!GetCurrentCommandIdUsed() && cur_cid == snapshot->curcid)
+		{
+			cid = InvalidCommandId;
+		}
+		else
+		{
+			/*
+			 * we don't use prev_undo_xid to fetch the undo record for cid as it is
+			 * required only when transaction is current transaction in which case
+			 * there is no risk of transaction chain switching, so we are safe.  It
+			 * might be better to move this check near to it's usage, but that will
+			 * make code look ugly, so keeping it here.
+			 */
+			cid = ZHeapTupleGetCid(undo_tup, buffer, prev_urec_ptr, trans_slot_id);
+		}
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(xid, oldestXidHavingUndo))
+		return undo_tup;
+
+	if (undo_oper == ZHEAP_INPLACE_UPDATED ||
+		undo_oper == ZHEAP_XID_LOCK_ONLY)
+	{
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid)
+			{
+				/*
+				 * Updated after scan started, need to fetch prior tuple
+				 * in undo chain.
+				 */
+				urec_ptr = prev_urec_ptr;
+				zhtup = undo_tup;
+				prev_undo_xid = xid;
+				prev_trans_slot_id = trans_slot_id;
+
+				goto fetch_prior_undo_record;
+			}
+			else
+				return undo_tup;	/* updated before scan started */
+		}
+		else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot))
+		{
+			urec_ptr = prev_urec_ptr;
+			zhtup = undo_tup;
+			prev_undo_xid = xid;
+			prev_trans_slot_id = trans_slot_id;
+
+			goto fetch_prior_undo_record;
+		}
+		else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid))
+		{
+			urec_ptr = prev_urec_ptr;
+			zhtup = undo_tup;
+			prev_undo_xid = xid;
+			prev_trans_slot_id = trans_slot_id;
+
+			goto fetch_prior_undo_record;
+		}
+		else if (TransactionIdDidCommit(xid))
+			return undo_tup;
+		else
+		{
+			urec_ptr = prev_urec_ptr;
+			zhtup = undo_tup;
+			prev_undo_xid = xid;
+			prev_trans_slot_id = trans_slot_id;
+
+			goto fetch_prior_undo_record;
+		}
+	}
+	else	/* undo tuple is the root tuple */
+	{
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid)
+				return NULL;	/* inserted after scan started */
+			else
+				return undo_tup;	/* inserted before scan started */
+		}
+		else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot))
+			return NULL;
+		else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid))
+			return NULL;
+		else if (TransactionIdDidCommit(xid))
+			return undo_tup;
+		else
+			return NULL;
+	}
+
+	/* we should never reach here */
+	return NULL;
+}
+
+/*
+ * GetTupleFromUndoWithOffset
+ *
+ *	This is similar to GetTupleFromUndo with a difference that it takes
+ *	line offset as an input.  This is a special purpose function that
+ *	is written to fetch visible version of deleted tuple that has been
+ *	pruned to a deleted line pointer.
+ */
+static ZHeapTuple
+GetTupleFromUndoWithOffset(UndoRecPtr urec_ptr, Snapshot snapshot,
+						   Buffer buffer, OffsetNumber off, int trans_slot)
+{
+	UnpackedUndoRecord	*urec;
+	ZHeapTuple	undo_tup;
+	UndoRecPtr	prev_urec_ptr = InvalidUndoRecPtr;
+	TransactionId	xid, oldestXidHavingUndo;
+	CommandId	cid = InvalidCommandId;
+	int	trans_slot_id = InvalidXactSlotId;
+	int	prev_trans_slot_id = trans_slot;
+
+
+	/*
+	 * tuple is modified after the scan is started, fetch the prior record
+	 * from undo to see if it is visible.
+	 */
+	urec = UndoFetchRecord(urec_ptr,
+						   BufferGetBlockNumber(buffer),
+						   off,
+						   InvalidTransactionId,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/* need to ensure that undo record contains complete tuple */
+	Assert(urec->uur_type == UNDO_DELETE || urec->uur_type == UNDO_UPDATE);
+	undo_tup = CopyTupleFromUndoRecord(urec, NULL, &trans_slot_id, &cid, false,
+									   BufferGetPage(buffer));
+	prev_urec_ptr = urec->uur_blkprev;
+	xid = urec->uur_prevxid;
+
+	UndoRecordRelease(urec);
+
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(xid, oldestXidHavingUndo))
+			return undo_tup;
+
+	/*
+	 * Change the undo chain if the undo tuple is stamped with the different
+	 * transaction.
+	 */
+	if (trans_slot_id != prev_trans_slot_id)
+	{
+		trans_slot_id = GetTransactionSlotInfo(buffer,
+											   ItemPointerGetOffsetNumber(&undo_tup->t_self),
+											   trans_slot_id,
+											   NULL,
+											   NULL,
+											   &prev_urec_ptr,
+											   true,
+											   true);
+	}
+
+	return GetVisibleTupleIfAny(prev_urec_ptr, undo_tup,
+								snapshot, buffer, xid, trans_slot_id, cid);
+}
+
+/*
+ * UndoTupleSatisfiesUpdate
+ *
+ *	Returns true, if there exists a visible version of zhtup in undo,
+ *	false otherwise.
+ *
+ *	This function returns ctid for the undo tuple which will be always
+ *	same as the ctid of zhtup except for non-in-place update case.
+ *
+ *	The Undo chain traversal follows similar protocol as mentioned atop
+ *	GetTupleFromUndo.
+ */
+static bool
+UndoTupleSatisfiesUpdate(UndoRecPtr urec_ptr, ZHeapTuple zhtup,
+						 CommandId curcid, Buffer buffer,
+						 ItemPointer ctid, int trans_slot,
+						 TransactionId  prev_undo_xid, bool free_zhtup,
+						 bool *in_place_updated_or_locked)
+{
+	UnpackedUndoRecord	*urec;
+	ZHeapTuple	undo_tup;
+	UndoRecPtr	prev_urec_ptr;
+	TransactionId	xid, oldestXidHavingUndo;
+	CommandId	cid;
+	int	trans_slot_id;
+	int prev_trans_slot_id = trans_slot;
+	int	undo_oper;
+	bool result;
+
+	/*
+	 * tuple is modified after the scan is started, fetch the prior record
+	 * from undo to see if it is visible.
+	 */
+fetch_prior_undo_record:
+	undo_tup = NULL;
+	prev_urec_ptr = InvalidUndoRecPtr;
+	cid = InvalidCommandId;
+	trans_slot_id = InvalidXactSlotId;
+	undo_oper = -1;
+	result = false;
+
+	urec = UndoFetchRecord(urec_ptr,
+						   ItemPointerGetBlockNumber(&zhtup->t_self),
+						   ItemPointerGetOffsetNumber(&zhtup->t_self),
+						   prev_undo_xid,
+						   NULL,
+						   ZHeapSatisfyUndoRecord);
+
+	/* If undo is discarded, then current tuple is visible. */
+	if (urec == NULL)
+	{
+		result = true;
+		goto result_available;
+	}
+
+	undo_tup = CopyTupleFromUndoRecord(urec, zhtup, &trans_slot_id, &cid,
+									   free_zhtup, BufferGetPage(buffer));
+	prev_urec_ptr = urec->uur_blkprev;
+	xid = urec->uur_prevxid;
+	/*
+	 * For non-inplace-updates, ctid needs to be retrieved from undo
+	 * record if required.
+	 */
+	if (ctid)
+	{
+		if (urec->uur_type == UNDO_UPDATE)
+			*ctid = *((ItemPointer) urec->uur_payload.data);
+		else
+			*ctid = undo_tup->t_self;
+	}
+
+	if (undo_tup->t_data->t_infomask & ZHEAP_INPLACE_UPDATED)
+	{
+		undo_oper = ZHEAP_INPLACE_UPDATED;
+		*in_place_updated_or_locked = true;
+	}
+	else if (undo_tup->t_data->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		undo_oper = ZHEAP_XID_LOCK_ONLY;
+		*in_place_updated_or_locked = true;
+	}
+	else
+	{
+		/* we can't further operate on deleted or non-inplace-updated tuple */
+		Assert(!(undo_tup->t_data->t_infomask & ZHEAP_DELETED) ||
+			   !(undo_tup->t_data->t_infomask & ZHEAP_UPDATED));
+	}
+
+	UndoRecordRelease(urec);
+
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(xid, oldestXidHavingUndo))
+	{
+		result = true;
+		goto result_available;
+	}
+
+	/*
+	 * Change the undo chain if the undo tuple is stamped with the different
+	 * transaction slot.
+	 */
+	if (trans_slot_id != prev_trans_slot_id)
+	{
+		/*
+		 * It is quite possible that the tuple is showing some valid
+		 * transaction slot, but actual slot has been frozen.  This can happen
+		 * when the slot belongs to TPD entry and the corresponding TPD entry
+		 * is pruned.
+		 */
+		trans_slot_id =  GetTransactionSlotInfo(buffer,
+												ItemPointerGetOffsetNumber(&undo_tup->t_self),
+												trans_slot_id,
+												NULL,
+												NULL,
+												&prev_urec_ptr,
+												true,
+												true);
+	}
+
+	/*
+	 * We need to fetch all the transaction related information from undo
+	 * record for the tuples that point to a slot that gets invalidated for
+	 * reuse at some point of time.  See PageFreezeTransSlots.
+	 */
+	if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask))
+	{
+		FetchTransInfoFromUndo(undo_tup, NULL, &xid, &cid, &prev_urec_ptr, false);
+	}
+	else if (cid == InvalidCommandId)
+	{
+		CommandId		cur_comm_cid = GetCurrentCommandId(false);
+
+		/*
+		 * If the current command doesn't need to modify any tuple and the
+		 * snapshot used is not of any previous command, then it can see all the
+		 * modifications made by current transactions till now.  So, we don't even
+		 * attempt to fetch CID from undo in such cases.
+		 */
+		if (!GetCurrentCommandIdUsed() && cur_comm_cid == curcid)
+		{
+			cid = InvalidCommandId;
+		}
+		else
+		{
+			/*
+			 * we don't use prev_undo_xid to fetch the undo record for cid as it is
+			 * required only when transaction is current transaction in which case
+			 * there is no risk of transaction chain switching, so we are safe.  It
+			 * might be better to move this check near to it's usage, but that will
+			 * make code look ugly, so keeping it here.
+			 */
+			cid = ZHeapTupleGetCid(undo_tup, buffer, prev_urec_ptr, trans_slot_id);
+		}
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple is too old that it is all-visible
+	 * or it precedes smallest xid that has undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		TransactionIdEquals(xid, FrozenTransactionId) ||
+		TransactionIdPrecedes(xid, oldestXidHavingUndo))
+	{
+		result = true;
+		goto result_available;
+	}
+
+	if (undo_oper == ZHEAP_INPLACE_UPDATED ||
+		undo_oper == ZHEAP_XID_LOCK_ONLY)
+	{
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (cid >= curcid)
+			{
+				/*
+				 * Updated after scan started, need to fetch prior tuple
+				 * in undo chain.
+				 */
+				urec_ptr = prev_urec_ptr;
+				zhtup = undo_tup;
+				prev_undo_xid = xid;
+				prev_trans_slot_id = trans_slot_id;
+				free_zhtup = true;
+
+				goto fetch_prior_undo_record;
+			}
+			else
+				result = true;	/* updated before scan started */
+		}
+		else if (TransactionIdIsInProgress(xid))
+		{
+			/* Note the values required to fetch prior tuple in undo chain. */
+			urec_ptr = prev_urec_ptr;
+			zhtup = undo_tup;
+			prev_undo_xid = xid;
+			prev_trans_slot_id = trans_slot_id;
+			free_zhtup = true;
+
+			goto fetch_prior_undo_record;
+		}
+		else if (TransactionIdDidCommit(xid))
+			result = true;
+		else
+		{
+			/* Note the values required to fetch prior tuple in undo chain. */
+			urec_ptr = prev_urec_ptr;
+			zhtup = undo_tup;
+			prev_undo_xid = xid;
+			prev_trans_slot_id = trans_slot_id;
+			free_zhtup = true;
+
+			goto fetch_prior_undo_record;
+		}
+	}
+	else	/* undo tuple is the root tuple */
+	{
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (cid >= curcid)
+				result = false;	/* inserted after scan started */
+			else
+				result = true;	/* inserted before scan started */
+		}
+		else if (TransactionIdIsInProgress(xid))
+			result = false;
+		else if (TransactionIdDidCommit(xid))
+			result = true;
+		else
+			result = false;
+	}
+
+result_available:
+	if (undo_tup)
+		pfree(undo_tup);
+	return result;
+}
+
+/*
+ * ZHeapTupleSatisfiesMVCC
+ *
+ *	Returns the visible version of tuple if any, NULL otherwise. We need to
+ *	traverse undo record chains to determine the visibility of tuple.  In
+ *	this function we need to first the determine the visibility of modified
+ *	tuple and if it is not visible, then we need to fetch the prior version
+ *	of tuple from undo chain and decide based on its visibility.  The undo
+ *	chain needs to be traversed till we reach root version of the tuple.
+ *
+ *	Here, we consider the effects of:
+ *		all transactions committed as of the time of the given snapshot
+ *		previous commands of this transaction
+ *
+ *	Does _not_ include:
+ *		transactions shown as in-progress by the snapshot
+ *		transactions started after the snapshot was taken
+ *		changes made by the current command
+ *
+ *	The tuple will be considered visible iff latest operation on tuple is
+ *	Insert, In-Place update or tuple is locked and the transaction that has
+ *	performed operation is current transaction (and the operation is performed
+ *	by some previous command) or is committed.
+ *
+ *	We traverse the undo chain to get the visible tuple if any, in case the
+ *	the latest transaction that has operated on tuple is shown as in-progress
+ *	by the snapshot or is started after the snapshot was taken or is current
+ *	transaction and the changes are made by current command.
+ *
+ *  For aborted transactions, we need to fetch the visible tuple from undo.
+ *	Now, it is possible that actions corresponding to aborted transaction
+ *	has been applied, but still xid is present in slot, however we should
+ *	never get such an xid.
+ *
+ *	For multilockers, the strongest locker information is always present on
+ *	the tuple.  So for updaters, we don't need anything special as the tuple
+ *	visibility will be determined based on the transaction information present
+ *	on tuple.  For the lockers only case, we need to determine if the original
+ *	inserter is visible to snapshot.
+ */
+ZHeapTuple
+ZHeapTupleSatisfiesMVCC(ZHeapTuple zhtup, Snapshot snapshot,
+						Buffer buffer, ItemPointer ctid)
+{
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	TransactionId	xid;
+	CommandId		*cid;
+	CommandId		cur_cid = GetCurrentCommandId(false);
+	CommandId		tmp_cid;
+	uint64		epoch_xid;
+	int			trans_slot;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	/*
+	 * If the current command doesn't need to modify any tuple and the
+	 * snapshot used is not of any previous command, then it can see all the
+	 * modifications made by current transactions till now.  So, we don't even
+	 * attempt to fetch CID from undo in such cases.
+	 */
+	if (!GetCurrentCommandIdUsed() && cur_cid == snapshot->curcid)
+	{
+		cid = NULL;
+	}
+	else
+	{
+		cid = &tmp_cid;
+		*cid = InvalidCommandId;
+	}
+
+	/* Get transaction info */
+	ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot, &epoch_xid, &xid, cid,
+						   &urec_ptr, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted and must be all visible if the transaction slot
+		 * is cleared or latest xid that has changed the tuple precedes
+		 * smallest xid that has undo.  Transaction slot can also be considered
+		 * frozen if it belongs to previous epoch.
+		 */
+		if (trans_slot == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return NULL;
+
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (cid && *cid >= snapshot->curcid)
+			{
+				/* deleted after scan started, get previous tuple from undo */
+				return GetTupleFromUndo(urec_ptr,
+										zhtup,
+										snapshot,
+										buffer,
+										ctid,
+										trans_slot,
+										InvalidTransactionId);
+			}
+			else
+			{
+				/*
+				 * For non-inplace-updates, ctid needs to be retrieved from
+				 * undo record if required.  If the tuple is moved to another
+				 * partition, then we don't need ctid.
+				 */
+				if (ctid &&
+					tuple->t_infomask & ZHEAP_UPDATED &&
+					!ZHeapTupleIsMoved(tuple->t_infomask))
+					ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid);
+
+				return NULL;	/* deleted before scan started */
+			}
+		}
+		else if (XidInMVCCSnapshot(xid, snapshot))
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot,
+									InvalidTransactionId);
+		else if (TransactionIdDidCommit(xid))
+		{
+			/*
+			 * For non-inplace-updates, ctid needs to be retrieved from undo
+			 * record if required.  If the tuple is moved to another
+			 * partition, then we don't need ctid.
+			 */
+			if (ctid &&
+				!ZHeapTupleIsMoved(tuple->t_infomask) &&
+				tuple->t_infomask & ZHEAP_UPDATED)
+				ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid);
+
+			return NULL;	/* tuple is deleted */
+		}
+		else	/* transaction is aborted */
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot,
+									InvalidTransactionId);
+	}
+	else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED ||
+			 tuple->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		/*
+		 * The tuple is updated/locked and must be all visible if the
+		 * transaction slot is cleared or latest xid that has changed the
+		 * tuple precedes smallest xid that has undo.
+		 */
+		if (trans_slot == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return zhtup;	/* tuple is updated */
+
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			if (cid && *cid >= snapshot->curcid)
+			{
+				/*
+				 * updated/locked after scan started, get previous tuple from
+				 * undo.
+				 */
+				return GetTupleFromUndo(urec_ptr,
+										zhtup,
+										snapshot,
+										buffer,
+										ctid,
+										trans_slot,
+										InvalidTransactionId);
+			}
+			else
+				return zhtup;	/* updated before scan started */
+		}
+		else if (XidInMVCCSnapshot(xid, snapshot))
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot,
+									InvalidTransactionId);
+		else if (TransactionIdDidCommit(xid))
+			return zhtup;	/* tuple is updated */
+		else	/* transaction is aborted */
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot,
+									InvalidTransactionId);
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot
+	 * is cleared or latest xid that has changed the tuple precedes
+	 * smallest xid that has undo.
+	 */
+	if (trans_slot == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return zhtup;
+
+	if (TransactionIdIsCurrentTransactionId(xid))
+	{
+		if (cid && *cid >= snapshot->curcid)
+			return NULL;	/* inserted after scan started */
+		else
+			return zhtup;	/* inserted before scan started */
+	}
+	else if (XidInMVCCSnapshot(xid, snapshot))
+		return NULL;
+	else if (TransactionIdDidCommit(xid))
+		return zhtup;
+	else
+		return NULL;
+
+	return NULL;
+}
+
+/*
+ * ZHeapGetVisibleTuple
+ *
+ *	This function is called for tuple that is deleted but not all-visible. It
+ *	returns NULL, if the last transaction that has modified the tuple is
+ *	visible to snapshot or if none of the versions of tuple is visible,
+ *	otherwise visible version tuple if any.
+ *
+ *	The caller must ensure that it passes the line offset for a tuple that is
+ *	marked as deleted.
+ */
+ZHeapTuple
+ZHeapGetVisibleTuple(OffsetNumber off, Snapshot snapshot, Buffer buffer, bool *all_dead)
+{
+	Page	page;
+	UndoRecPtr	urec_ptr;
+	TransactionId	xid;
+	CommandId		cid;
+	ItemId	lp;
+	uint64		epoch, epoch_xid;
+	uint32	tmp_epoch;
+	int		trans_slot;
+	int		vis_info;
+
+	if (all_dead)
+		*all_dead = false;
+
+	page = BufferGetPage(buffer);
+	lp = PageGetItemId(page, off);
+	Assert(ItemIdIsDeleted(lp));
+
+	trans_slot = ItemIdGetTransactionSlot(lp);
+	vis_info = ItemIdGetVisibilityInfo(lp);
+
+	/*
+	 * We need to fetch all the transaction related information from undo
+	 * record for the tuples that point to a slot that gets invalidated for
+	 * reuse at some point of time.  See PageFreezeTransSlots.
+	 */
+check_trans_slot:
+	if (trans_slot != ZHTUP_SLOT_FROZEN)
+	{
+		if (vis_info & ITEMID_XACT_INVALID)
+		{
+			ZHeapTupleData	undo_tup;
+			ItemPointerSetBlockNumber(&undo_tup.t_self,
+									  BufferGetBlockNumber(buffer));
+			ItemPointerSetOffsetNumber(&undo_tup.t_self, off);
+
+			/*
+			 * We need undo record pointer to fetch the transaction information
+			 * from undo.
+			 */
+			trans_slot = GetTransactionSlotInfo(buffer, off, trans_slot,
+												&tmp_epoch, &xid, &urec_ptr,
+												true, false);
+			/*
+			 * It is quite possible that the tuple is showing some valid
+			 * transaction slot, but actual slot has been frozen.  This can happen
+			 * when the slot belongs to TPD entry and the corresponding TPD entry
+			 * is pruned.
+			 */
+			if (trans_slot == ZHTUP_SLOT_FROZEN)
+				goto check_trans_slot;
+
+			xid = InvalidTransactionId;
+			FetchTransInfoFromUndo(&undo_tup, &epoch, &xid, &cid, &urec_ptr, false);
+		}
+		else
+		{
+			trans_slot = GetTransactionSlotInfo(buffer, off, trans_slot,
+												&tmp_epoch, &xid, &urec_ptr,
+												true, false);
+			if (trans_slot == ZHTUP_SLOT_FROZEN)
+				goto check_trans_slot;
+
+			epoch = (uint64) tmp_epoch;
+			cid = ZHeapPageGetCid(buffer, trans_slot, tmp_epoch, xid, urec_ptr, off);
+		}
+	}
+	else
+	{
+		epoch = 0;
+		xid = InvalidTransactionId;
+		cid = InvalidCommandId;
+		urec_ptr = InvalidUndoRecPtr;
+	}
+
+	epoch_xid = MakeEpochXid(epoch, xid);
+
+	/*
+	 * The tuple is deleted and must be all visible if the transaction slot
+	 * is cleared or latest xid that has changed the tuple precedes
+	 * smallest xid that has undo.  Transaction slot can also be considered
+	 * frozen if it belongs to previous epoch.
+	 */
+	if (trans_slot == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+	{
+		if (all_dead)
+			*all_dead = true;
+		return NULL;
+	}
+
+	if (TransactionIdIsCurrentTransactionId(xid))
+	{
+		if (cid >= snapshot->curcid)
+		{
+			/* deleted after scan started, get previous tuple from undo */
+			return GetTupleFromUndoWithOffset(urec_ptr,
+											  snapshot,
+											  buffer,
+											  off,
+											  trans_slot);
+		}
+		else
+			return NULL;	/* deleted before scan started */
+	}
+	else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot))
+		return GetTupleFromUndoWithOffset(urec_ptr,
+										  snapshot,
+										  buffer,
+										  off,
+										  trans_slot);
+	else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid))
+		return GetTupleFromUndoWithOffset(urec_ptr,
+										  snapshot,
+										  buffer,
+										  off,
+										  trans_slot);
+	else if (TransactionIdDidCommit(xid))
+		return NULL;	/* tuple is deleted */
+	else	/* transaction is aborted */
+		return GetTupleFromUndoWithOffset(urec_ptr,
+										  snapshot,
+										  buffer,
+										  off,
+										  trans_slot);
+
+	return NULL;
+}
+
+/*
+ * ZHeapTupleSatisfiesUpdate
+ *
+ *	The retrun values for this API are same as HeapTupleSatisfiesUpdate.
+ *	However, there is a notable difference in the way to determine visibility
+ *	of tuples.  We need to traverse undo record chains to determine the
+ *	visibility of tuple.
+ *
+ *	For multilockers, the visibility can be determined by the information
+ *	present on tuple.  See ZHeapTupleSatisfiesMVCC.  Also, this API returns
+ *	HeapTupleMayBeUpdated, if the strongest locker is committed which means
+ *	the caller need to take care of waiting for other lockers in such a case.
+ *
+ *	ctid - returns the ctid of visible tuple if the tuple is either deleted or
+ *	updated.  ctid needs to be retrieved from undo tuple.
+ *	trans_slot - returns the transaction slot of the transaction that has
+ *	modified the visible tuple.
+ *	xid - returns the xid that has modified the visible tuple.
+ *	subxid - returns the subtransaction id, if any, that has modified the
+ *	visible tuple.  We fetch the subxid from undo only when it is required,
+ *	i.e when the caller would wait on it to finish.
+ *	cid - returns the cid of visible tuple.
+ *	single_locker_xid - returns the xid of a single in-progress locker, if any.
+ *	single_locker_trans_slot - returns the transaction slot of a single
+ *	in-progress locker, if any.
+ *	lock_allowed - allow caller to lock the tuple if it is in-place updated
+ *	in_place_updated - returns whether the current visible version of tuple is
+ *	updated in place.
+ */
+HTSU_Result
+ZHeapTupleSatisfiesUpdate(Relation rel, ZHeapTuple zhtup, CommandId curcid,
+						  Buffer buffer, ItemPointer ctid, int *trans_slot,
+						  TransactionId *xid, SubTransactionId *subxid,
+						  CommandId *cid, TransactionId *single_locker_xid,
+						  int *single_locker_trans_slot, bool free_zhtup,
+						  bool lock_allowed, Snapshot snapshot,
+						  bool *in_place_updated_or_locked)
+{
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	uint64	epoch_xid;
+	CommandId	cur_comm_cid = GetCurrentCommandId(false);
+	bool	visible;
+
+	*single_locker_xid = InvalidTransactionId;
+	*single_locker_trans_slot = InvalidXactSlotId;
+	*in_place_updated_or_locked = false;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	/*
+	 * If the current command doesn't need to modify any tuple and the
+	 * snapshot used is not of any previous command, then it can see all the
+	 * modifications made by current transactions till now.  So, we don't even
+	 * attempt to fetch CID from undo in such cases.
+	 */
+	if (!GetCurrentCommandIdUsed() && cur_comm_cid == curcid)
+	{
+		cid = NULL;
+	}
+
+	/* Get transaction info */
+	ZHeapTupleGetTransInfo(zhtup, buffer, trans_slot, &epoch_xid, xid, cid,
+						   &urec_ptr, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted or non-inplace-updated and must be all visible
+		 * if the transaction slot is cleared or latest xid that has changed
+		 * the tuple precedes smallest xid that has undo.  However, that is
+		 * not possible at this stage as the tuple has already passed snapshot
+		 * check.
+		 */
+		Assert(!(*trans_slot == ZHTUP_SLOT_FROZEN &&
+			   epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)));
+
+		if (TransactionIdIsCurrentTransactionId(*xid))
+		{
+			if (cid && *cid >= curcid)
+			{
+				/* deleted after scan started, check previous tuple from undo */
+				visible = UndoTupleSatisfiesUpdate(urec_ptr,
+												   zhtup,
+												   curcid,
+												   buffer,
+												   ctid,
+												   *trans_slot,
+												   InvalidTransactionId,
+												   free_zhtup,
+												   in_place_updated_or_locked);
+				if (visible)
+					return HeapTupleSelfUpdated;
+				else
+					return HeapTupleInvisible;
+			}
+			else
+				return HeapTupleInvisible;	/* deleted before scan started */
+		}
+		else if (TransactionIdIsInProgress(*xid))
+		{
+			visible = UndoTupleSatisfiesUpdate(urec_ptr,
+											   zhtup,
+											   curcid,
+											   buffer,
+											   ctid,
+											   *trans_slot,
+											   InvalidTransactionId,
+											   free_zhtup,
+											   in_place_updated_or_locked);
+
+			if (visible)
+			{
+				if (subxid)
+					ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid);
+
+				return HeapTupleBeingUpdated;
+			}
+			else
+				return HeapTupleInvisible;
+		}
+		else if (TransactionIdDidCommit(*xid))
+		{
+			/*
+			 * For non-inplace-updates, ctid needs to be retrieved from undo
+			 * record if required.  If the tuple is moved to another
+			 * partition, then we don't need ctid.
+			 */
+			if (ctid &&
+				!ZHeapTupleIsMoved(tuple->t_infomask) &&
+				tuple->t_infomask & ZHEAP_UPDATED)
+				ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid);
+
+			/* tuple is deleted or non-inplace-updated */
+			return HeapTupleUpdated;
+		}
+		else	/* transaction is aborted */
+		{
+			visible = UndoTupleSatisfiesUpdate(urec_ptr,
+											   zhtup,
+											   curcid,
+											   buffer,
+											   ctid,
+											   *trans_slot,
+											   InvalidTransactionId,
+											   free_zhtup,
+											   in_place_updated_or_locked);
+
+			/*
+			 * If updating transaction id is aborted and the tuple is visible
+			 * then return HeapTupleBeingUpdated, so that caller can apply the
+			 * undo before modifying the page.  Here, we don't need to fetch
+			 * subtransaction id as it is only possible for top-level xid to
+			 * have pending undo actions.
+			 */
+			if (visible)
+				return HeapTupleBeingUpdated;
+			else
+				return HeapTupleInvisible;
+		}
+	}
+	else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED ||
+			 tuple->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		*in_place_updated_or_locked = true;
+
+		/*
+		 * The tuple is updated/locked and must be all visible if the
+		 * transaction slot is cleared or latest xid that has touched the
+		 * tuple precedes smallest xid that has undo.  If there is a single
+		 * locker on the tuple, then we fetch the lockers transaction info
+		 * from undo as we never store lockers slot on tuple.  See
+		 * compute_new_xid_infomask for more details about lockers.
+		 */
+		if (*trans_slot == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		{
+			bool found = false;
+
+			if (ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask) &&
+				!ZHeapTupleHasMultiLockers(tuple->t_infomask))
+				found = GetLockerTransInfo(rel, zhtup, buffer, single_locker_trans_slot,
+										   NULL, single_locker_xid, NULL, NULL);
+			if (!found)
+				return HeapTupleMayBeUpdated;
+			else
+			{
+				/*
+				 * If there is a single locker in-progress/aborted locker,
+				 * it's safe to return being updated so that the caller
+				 * check for lock conflicts or perform rollback if necessary.
+				 *
+				 * If the single locker is our current transaction, then also
+				 * we return beging updated.
+				 */
+				return HeapTupleBeingUpdated;
+			}
+		}
+
+		if (TransactionIdIsCurrentTransactionId(*xid))
+		{
+			if (cid && *cid >= curcid)
+			{
+				/*
+				 * updated/locked after scan started, check previous tuple
+				 * from undo
+				 */
+				visible = UndoTupleSatisfiesUpdate(urec_ptr,
+												   zhtup,
+												   curcid,
+												   buffer,
+												   ctid,
+												   *trans_slot,
+												   InvalidTransactionId,
+												   free_zhtup,
+												   in_place_updated_or_locked);
+				if (visible)
+				{
+					if (ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask))
+						return HeapTupleBeingUpdated;
+					else
+						return HeapTupleSelfUpdated;
+				}
+			}
+			else
+			{
+				if (ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask))
+				{
+					/*
+					 * Locked before scan;  caller can check if it is locked
+					 * in lock mode higher or equal to the required mode, then
+					 * it can skip locking the tuple.
+					 */
+					return HeapTupleBeingUpdated;
+				}
+				else
+					return HeapTupleMayBeUpdated;	/* updated before scan started */
+			}
+		}
+		else if (TransactionIdIsInProgress(*xid))
+		{
+			visible = UndoTupleSatisfiesUpdate(urec_ptr,
+											   zhtup,
+											   curcid,
+											   buffer,
+											   ctid,
+											   *trans_slot,
+											   InvalidTransactionId,
+											   free_zhtup,
+											   in_place_updated_or_locked);
+
+			if (visible)
+			{
+				if (subxid)
+					ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid);
+
+				return HeapTupleBeingUpdated;
+			}
+			else
+				return HeapTupleInvisible;
+		}
+		else if (TransactionIdDidCommit(*xid))
+		{
+			/* if tuple is updated and not in our snapshot, then allow to update it. */
+			if (lock_allowed || !XidInMVCCSnapshot(*xid, snapshot))
+				return HeapTupleMayBeUpdated;
+			else
+				return HeapTupleUpdated;
+		}
+		else	/* transaction is aborted */
+		{
+			visible = UndoTupleSatisfiesUpdate(urec_ptr,
+											   zhtup,
+											   curcid,
+											   buffer,
+											   ctid,
+											   *trans_slot,
+											   InvalidTransactionId,
+											   free_zhtup,
+											   in_place_updated_or_locked);
+
+			/*
+			 * If updating transaction id is aborted and the tuple is visible
+			 * then return HeapTupleBeingUpdated, so that caller can apply the
+			 * undo before modifying the page.  Here, we don't need to fetch
+			 * subtransaction id as it is only possible for top-level xid to
+			 * have pending undo actions.
+			 */
+			if (visible)
+				return HeapTupleBeingUpdated;
+			else
+				return HeapTupleInvisible;
+		}
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot
+	 * is cleared or latest xid that has changed the tuple precedes
+	 * smallest xid that has undo.
+	 */
+	if (*trans_slot == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return HeapTupleMayBeUpdated;
+
+	if (TransactionIdIsCurrentTransactionId(*xid))
+	{
+		if (cid && *cid >= curcid)
+			return HeapTupleInvisible;	/* inserted after scan started */
+		else
+			return HeapTupleMayBeUpdated;	/* inserted before scan started */
+	}
+	else if (TransactionIdIsInProgress(*xid))
+		return HeapTupleInvisible;
+	else if (TransactionIdDidCommit(*xid))
+		return HeapTupleMayBeUpdated;
+	else
+		return HeapTupleInvisible;
+
+	return HeapTupleInvisible;
+}
+
+/*
+ * ZHeapTupleIsSurelyDead
+ *
+ * Similar to HeapTupleIsSurelyDead, but for zheap tuples.
+ */
+bool
+ZHeapTupleIsSurelyDead(ZHeapTuple zhtup, uint64 OldestXmin, Buffer buffer)
+{
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	TransactionId	xid;
+	uint64			epoch_xid;
+	int				trans_slot_id;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	/* Get transaction id */
+	ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, &xid, NULL,
+						   NULL, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted and must be all visible if the transaction slot
+		 * is cleared or latest xid that has changed the tuple precedes
+		 * smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN || epoch_xid < OldestXmin)
+			return true;
+	}
+
+	return false; /* Tuple is still alive */
+}
+
+/*
+ * ZHeapTupleSatisfiesSelf
+ *		Returns the visible version of tuple (including effects of previous
+ *		commands in current transactions) if any, NULL otherwise.
+ *
+ *	Here, we consider the effects of:
+ *		all committed transactions (as of the current instant)
+ *		previous commands of this transaction
+ *		changes made by the current command
+ *
+ *	The tuple will be considered visible iff:
+ *		Latest operation on tuple is Insert, In-Place update or tuple is
+ *		locked and the transaction that has performed operation is current
+ *		transaction or is committed.
+ *
+ *	If the transaction is in progress, then we fetch the tuple from undo.
+ */
+ZHeapTuple
+ZHeapTupleSatisfiesSelf(ZHeapTuple zhtup, Snapshot snapshot,
+						Buffer buffer, ItemPointer ctid)
+{
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	TransactionId	xid;
+	UndoRecPtr  urec_ptr = InvalidUndoRecPtr;
+	uint64			epoch_xid;
+	int				trans_slot_id;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	/* Get transaction id */
+	ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, &xid,
+						   NULL, &urec_ptr, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted and must be all visible if the transaction slot
+		 * is cleared or latest xid that has changed the tuple precedes
+		 * smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return NULL;
+
+		if (TransactionIdIsCurrentTransactionId(xid))
+			return NULL;
+		else if (TransactionIdIsInProgress(xid))
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot_id,
+									InvalidTransactionId);
+		else if (TransactionIdDidCommit(xid))
+		{
+			/* tuple is deleted or non-inplace-updated */
+			return NULL;
+		}
+		else	/* transaction is aborted */
+		{
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot_id,
+									InvalidTransactionId);
+		}
+	}
+	else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED ||
+			 tuple->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		/*
+		 * The tuple is updated/locked and must be all visible if the
+		 * transaction slot is cleared or latest xid that has changed the
+		 * tuple precedes smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return zhtup;
+
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			return zhtup;
+		}
+		else if (TransactionIdIsInProgress(xid))
+		{
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot_id,
+									InvalidTransactionId);
+		}
+		else if (TransactionIdDidCommit(xid))
+		{
+			return zhtup;
+		}
+		else	/* transaction is aborted */
+		{
+			return GetTupleFromUndo(urec_ptr,
+									zhtup,
+									snapshot,
+									buffer,
+									ctid,
+									trans_slot_id,
+									InvalidTransactionId);
+		}
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple precedes smallest xid that has
+	 * undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return zhtup;
+
+	if (TransactionIdIsCurrentTransactionId(xid))
+		return zhtup;
+	else if (TransactionIdIsInProgress(xid))
+	{
+		return NULL;
+	}
+	else if (TransactionIdDidCommit(xid))
+		return zhtup;
+	else
+	{
+		/* Inserting transaction is aborted. */
+		return NULL;
+	}
+
+	return NULL;
+}
+
+/*
+ * ZHeapTupleSatisfiesDirty
+ *		Returns the visible version of tuple (including effects of open
+ *		transactions) if any, NULL otherwise.
+ *
+ *	Here, we consider the effects of:
+ *		all committed and in-progress transactions (as of the current instant)
+ *		previous commands of this transaction
+ *		changes made by the current command
+ *
+ *	This is essentially like ZHeapTupleSatisfiesSelf as far as effects of
+ *	the current transaction and committed/aborted xacts are concerned.
+ *	However, we also include the effects of other xacts still in progress.
+ *
+ *	The tuple will be considered visible iff:
+ *	(a) Latest operation on tuple is Delete or non-inplace-update and the
+ *		current transaction is in progress.
+ *	(b) Latest operation on tuple is Insert, In-Place update or tuple is
+ *		locked and the transaction that has performed operation is current
+ *		transaction or is in-progress or is committed.
+ */
+ZHeapTuple
+ZHeapTupleSatisfiesDirty(ZHeapTuple zhtup, Snapshot snapshot,
+						 Buffer buffer, ItemPointer ctid)
+{
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	TransactionId	xid;
+	uint64			epoch_xid;
+	UndoRecPtr		urec_ptr;
+	int				trans_slot_id;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	snapshot->xmin = snapshot->xmax = InvalidTransactionId;
+	snapshot->subxid = InvalidSubTransactionId;
+	snapshot->speculativeToken = 0;
+
+	/* Get transaction id */
+	ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, &xid, NULL,
+						   &urec_ptr, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted and must be all visible if the transaction slot
+		 * is cleared or latest xid that has changed the tuple precedes
+		 * smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return NULL;
+
+		if (TransactionIdIsCurrentTransactionId(xid))
+		{
+			/*
+			 * For non-inplace-updates, ctid needs to be retrieved from undo
+			 * record if required.  If the tuple is moved to another
+			 * partition, then we don't need ctid.
+			 */
+			if (ctid &&
+				!ZHeapTupleIsMoved(tuple->t_infomask) &&
+				tuple->t_infomask & ZHEAP_UPDATED)
+				ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid);
+			return NULL;
+		}
+		else if (TransactionIdIsInProgress(xid))
+		{
+			snapshot->xmax = xid;
+			if (UndoRecPtrIsValid(urec_ptr))
+				ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, &snapshot->subxid);
+			return zhtup;		/* in deletion by other */
+		}
+		else if (TransactionIdDidCommit(xid))
+		{
+			/*
+			 * For non-inplace-updates, ctid needs to be retrieved from undo
+			 * record if required.  If the tuple is moved to another
+			 * partition, then we don't need ctid.
+			 */
+			if (ctid &&
+				!ZHeapTupleIsMoved(tuple->t_infomask) &&
+				tuple->t_infomask & ZHEAP_UPDATED)
+				ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid);
+
+			/* tuple is deleted or non-inplace-updated */
+			return NULL;
+		}
+		else	/* transaction is aborted */
+		{
+			return GetTupleFromUndo(urec_ptr, zhtup, snapshot, buffer, ctid,
+									trans_slot_id, InvalidTransactionId);
+		}
+	}
+	else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED ||
+			 tuple->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		/*
+		 * The tuple is updated/locked and must be all visible if the
+		 * transaction slot is cleared or latest xid that has changed the
+		 * tuple precedes smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return zhtup;	/* tuple is updated */
+
+		if (TransactionIdIsCurrentTransactionId(xid))
+			return zhtup;
+		else if (TransactionIdIsInProgress(xid))
+		{
+			if (!ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask))
+			{
+				snapshot->xmax = xid;
+				if (UndoRecPtrIsValid(urec_ptr))
+					ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, &snapshot->subxid);
+			}
+			return zhtup;		/* being updated */
+		}
+		else if (TransactionIdDidCommit(xid))
+			return zhtup;	/* tuple is updated by someone else */
+		else	/* transaction is aborted */
+		{
+			/* Here we need to fetch the tuple from undo. */
+			return GetTupleFromUndo(urec_ptr, zhtup, snapshot, buffer, ctid,
+									trans_slot_id, InvalidTransactionId);
+		}
+	}
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple precedes smallest xid that has
+	 * undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return zhtup;
+
+	if (TransactionIdIsCurrentTransactionId(xid))
+		return zhtup;
+	else if (TransactionIdIsInProgress(xid))
+	{
+		/* Return the speculative token to caller. */
+		if (ZHeapTupleHeaderIsSpeculative(tuple))
+		{
+			ZHeapTupleGetSpecToken(zhtup, buffer, urec_ptr,
+								   &snapshot->speculativeToken);
+
+			Assert(snapshot->speculativeToken != 0);
+		}
+
+		snapshot->xmin = xid;
+		if (UndoRecPtrIsValid(urec_ptr))
+			ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, &snapshot->subxid);
+		return zhtup;		/* in insertion by other */
+	}
+	else if (TransactionIdDidCommit(xid))
+		return zhtup;
+	else
+	{
+		/*
+		 * Since the transaction that inserted the tuple is aborted. So, it's
+		 * not visible to any transaction.
+		 */
+		return NULL;
+	}
+
+	return NULL;
+}
+
+/*
+ * ZHeapTupleSatisfiesAny
+ *		Dummy "satisfies" routine: any tuple satisfies SnapshotAny.
+ */
+ZHeapTuple
+ZHeapTupleSatisfiesAny(ZHeapTuple zhtup, Snapshot snapshot, Buffer buffer,
+					   ItemPointer ctid)
+{
+	/* Callers can expect ctid to be populated. */
+	if (ctid &&
+		!ZHeapTupleIsMoved(zhtup->t_data->t_infomask) &&
+		ZHeapTupleIsUpdated(zhtup->t_data->t_infomask))
+	{
+		UndoRecPtr	urec_ptr;
+		int		out_slot_no PG_USED_FOR_ASSERTS_ONLY;
+
+		out_slot_no = GetTransactionSlotInfo(buffer,
+											 ItemPointerGetOffsetNumber(&zhtup->t_self),
+											 ZHeapTupleHeaderGetXactSlot(zhtup->t_data),
+											 NULL,
+											 NULL,
+											 &urec_ptr,
+											 true,
+											 false);
+		/*
+		 * We always expect non-frozen transaction slot here as the caller tries
+		 * to fetch the ctid of tuples that are visible to the snapshot, so
+		 * corresponding undo record can't be discarded.
+		 */
+		Assert(out_slot_no != ZHTUP_SLOT_FROZEN);
+
+		ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid);
+	}
+	return zhtup;
+}
+
+/*
+ * ZHeapTupleSatisfiesOldestXmin
+ *	The tuple will be considered visible if it is visible to any open
+ *	transaction.
+ *
+ *	ztuple is an input/output parameter.  The caller must send the palloc'ed
+ *	data.  This function can get a tuple from undo to return in which case it
+ *	will free the memory passed by the caller.
+ *
+ *	xid is an output parameter. It is set to the latest committed/in-progress
+ *	xid that inserted/modified the tuple.
+ *	If the latest transaction for the tuple aborted, we fetch a prior committed
+ *	version of the tuple and return the prior comitted xid and status as
+ *	HEAPTUPLE_LIVE.
+ *	If the latest transaction for the tuple aborted and it also inserted
+ *	the tuple, we return the aborted transaction id and status as
+ *	HEAPTUPLE_DEAD. In this case, the caller *should* never mark the
+ *	corresponding item id as dead. Because, when undo action for the same will
+ *	be performed, we need the item pointer.
+ */
+HTSV_Result
+ZHeapTupleSatisfiesOldestXmin(ZHeapTuple *ztuple, TransactionId OldestXmin,
+							  Buffer buffer, TransactionId *xid,
+							  SubTransactionId *subxid)
+{
+	ZHeapTuple	zhtup = *ztuple;
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	UndoRecPtr	urec_ptr;
+	uint64	epoch_xid;
+	int		trans_slot_id;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	/* Get transaction id */
+	ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, xid, NULL,
+						   &urec_ptr, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted and must be all visible if the transaction slot
+		 * is cleared or latest xid that has changed the tuple precedes
+		 * smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return HEAPTUPLE_DEAD;
+
+		if (TransactionIdIsCurrentTransactionId(*xid))
+			return HEAPTUPLE_DELETE_IN_PROGRESS;
+		else if (TransactionIdIsInProgress(*xid))
+		{
+			/* Get Sub transaction id */
+			if (subxid)
+				ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid);
+
+			return HEAPTUPLE_DELETE_IN_PROGRESS;
+		}
+		else if (TransactionIdDidCommit(*xid))
+		{
+			/*
+			 * Deleter committed, but perhaps it was recent enough that some open
+			 * transactions could still see the tuple.
+			 */
+			if (!TransactionIdPrecedes(*xid, OldestXmin))
+				return HEAPTUPLE_RECENTLY_DEAD;
+
+			/* Otherwise, it's dead and removable */
+			return HEAPTUPLE_DEAD;
+		}
+		else	/* transaction is aborted */
+		{
+			/*
+			 * For aborted transactions, we need to fetch the tuple from undo
+			 * chain.
+			 */
+			*ztuple = GetTupleFromUndoForAbortedXact(urec_ptr, buffer,
+													   trans_slot_id, zhtup, xid);
+			if (*ztuple != NULL)
+				return HEAPTUPLE_LIVE;
+			else
+			{
+				/*
+				 * If the transaction that inserted the tuple got aborted,
+				 * we should return the aborted transaction id.
+				 */
+				return HEAPTUPLE_DEAD;
+			}
+		}
+	}
+	else if (tuple->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		/*
+		 * We can't take any decision if the tuple is marked as locked-only.
+		 * It's possible that inserted transaction took a lock on the tuple
+		 * Later, if it rolled back, we should return HEAPTUPLE_DEAD, or if
+		 * it's still in progress, we should return HEAPTUPLE_INSERT_IN_PROGRESS.
+		 * Similarly, if the inserted transaction got committed, we should return
+		 * HEAPTUPLE_LIVE.
+		 * The subsequent checks already takes care of all these possible
+		 * scenarios, so we don't need any extra checks here.
+		 */
+	}
+
+	/* The tuple is either a newly inserted tuple or is in-place updated. */
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple precedes smallest xid that has
+	 * undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return HEAPTUPLE_LIVE;
+
+	if (TransactionIdIsCurrentTransactionId(*xid))
+		return HEAPTUPLE_INSERT_IN_PROGRESS;
+	else if (TransactionIdIsInProgress(*xid))
+	{
+		/* Get Sub transaction id */
+		if (subxid)
+			ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid);
+		return HEAPTUPLE_INSERT_IN_PROGRESS;		/* in insertion by other */
+	}
+	else if (TransactionIdDidCommit(*xid))
+		return HEAPTUPLE_LIVE;
+	else	/* transaction is aborted */
+	{
+		if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED)
+		{
+			/*
+			 * For aborted transactions, we need to fetch the tuple from undo
+			 * chain.
+			 */
+			*ztuple = GetTupleFromUndoForAbortedXact(urec_ptr, buffer,
+													   trans_slot_id, zhtup, xid);
+			if (*ztuple != NULL)
+				return HEAPTUPLE_LIVE;
+		}
+		/*
+		 * If the transaction that inserted the tuple got aborted, we should
+		 * return the aborted transaction id.
+		 */
+		return HEAPTUPLE_DEAD;
+	}
+
+	return HEAPTUPLE_LIVE;
+}
+
+/*
+ * ZHeapTupleSatisfiesNonVacuumable
+ *
+ *	True if tuple might be visible to some transaction; false if it's
+ *	surely dead to everyone, ie, vacuumable.
+ *
+ *	This is an interface to ZHeapTupleSatisfiesOldestXmin that meets the
+ *	SnapshotSatisfiesFunc API, so it can be used through a Snapshot.
+ *	snapshot->xmin must have been set up with the xmin horizon to use.
+ */
+ZHeapTuple
+ZHeapTupleSatisfiesNonVacuumable(ZHeapTuple ztup, Snapshot snapshot,
+								Buffer buffer, ItemPointer	ctid)
+{
+	TransactionId	xid;
+
+	return (ZHeapTupleSatisfiesOldestXmin(&ztup, snapshot->xmin, buffer, &xid, NULL)
+		!= HEAPTUPLE_DEAD) ? ztup : NULL;
+}
+
+/*
+ * ZHeapTupleSatisfiesVacuum
+ * Similar to ZHeapTupleSatisfiesOldestXmin, but it behaves differently for
+ * handling aborted transaction.
+ *
+ * For aborted transactions, we don't fetch any prior committed version of the
+ * tuple. Instead, we return ZHEAPTUPLE_ABORT_IN_PROGRESS and return the aborted
+ * xid. The caller should avoid such tuple for any kind of prunning/vacuuming.
+ */
+ZHTSV_Result
+ZHeapTupleSatisfiesVacuum(ZHeapTuple zhtup, TransactionId OldestXmin,
+						  Buffer buffer, TransactionId *xid)
+{
+	ZHeapTupleHeader tuple = zhtup->t_data;
+	UndoRecPtr	urec_ptr;
+	uint64	epoch_xid;
+	int		trans_slot_id;
+
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	/* Get transaction id */
+	ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, xid, NULL,
+						   &urec_ptr, false);
+
+	if (tuple->t_infomask & ZHEAP_DELETED ||
+		tuple->t_infomask & ZHEAP_UPDATED)
+	{
+		/*
+		 * The tuple is deleted and must be all visible if the transaction slot
+		 * is cleared or latest xid that has changed the tuple precedes
+		 * smallest xid that has undo.
+		 */
+		if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+			epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+			return ZHEAPTUPLE_DEAD;
+
+		if (TransactionIdIsCurrentTransactionId(*xid))
+			return ZHEAPTUPLE_DELETE_IN_PROGRESS;
+		else if (TransactionIdIsInProgress(*xid))
+		{
+			return ZHEAPTUPLE_DELETE_IN_PROGRESS;
+		}
+		else if (TransactionIdDidCommit(*xid))
+		{
+			/*
+			 * Deleter committed, but perhaps it was recent enough that some open
+			 * transactions could still see the tuple.
+			 */
+			if (!TransactionIdPrecedes(*xid, OldestXmin))
+				return ZHEAPTUPLE_RECENTLY_DEAD;
+
+			/* Otherwise, it's dead and removable */
+			return ZHEAPTUPLE_DEAD;
+		}
+		else	/* transaction is aborted */
+		{
+			return ZHEAPTUPLE_ABORT_IN_PROGRESS;
+		}
+	}
+	else if (tuple->t_infomask & ZHEAP_XID_LOCK_ONLY)
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.
+		 */
+		return ZHEAPTUPLE_LIVE;
+	}
+
+	/* The tuple is either a newly inserted tuple or is in-place updated. */
+
+	/*
+	 * The tuple must be all visible if the transaction slot is cleared or
+	 * latest xid that has changed the tuple precedes smallest xid that has
+	 * undo.
+	 */
+	if (trans_slot_id == ZHTUP_SLOT_FROZEN ||
+		epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))
+		return ZHEAPTUPLE_LIVE;
+
+	if (TransactionIdIsCurrentTransactionId(*xid))
+		return ZHEAPTUPLE_INSERT_IN_PROGRESS;
+	else if (TransactionIdIsInProgress(*xid))
+		return ZHEAPTUPLE_INSERT_IN_PROGRESS;		/* in insertion by other */
+	else if (TransactionIdDidCommit(*xid))
+		return ZHEAPTUPLE_LIVE;
+	else	/* transaction is aborted */
+	{
+		return ZHEAPTUPLE_ABORT_IN_PROGRESS;
+	}
+
+	return ZHEAPTUPLE_LIVE;
+}
+
+/*
+ * ZHeapTupleSatisfiesToast
+ *
+ * True iff zheap tuple is valid as a TOAST row.
+ *
+ * Unlike heap, we don't need checks for VACUUM moving conditions as those are
+ * for pre-9.0 and that doesn't apply for zheap.  For aborted speculative
+ * inserts, we always marks row as dead, so we don't any check for that.  So,
+ * here we can rely on the fact that if you can see the main table row that
+ * contains a TOAST reference, you should be able to see the TOASTed value.
+ */
+ZHeapTuple
+ZHeapTupleSatisfiesToast(ZHeapTuple zhtup, Snapshot snapshot,
+						 Buffer buffer, ItemPointer ctid)
+{
+	Assert(ItemPointerIsValid(&zhtup->t_self));
+	Assert(zhtup->t_tableOid != InvalidOid);
+
+	return zhtup;
+}
+
+
+ZHeapTuple
+ZHeapTupleSatisfies(ZHeapTuple stup, Snapshot snapshot, Buffer buffer, ItemPointer ctid)
+{
+	switch (snapshot->visibility_type)
+	{
+		case MVCC_VISIBILITY:
+			return ZHeapTupleSatisfiesMVCC(stup, snapshot, buffer, ctid);
+			break;
+		case SELF_VISIBILITY:
+			return ZHeapTupleSatisfiesSelf(stup, snapshot, buffer, ctid);
+			break;
+		case ANY_VISIBILITY:
+			return ZHeapTupleSatisfiesAny(stup, snapshot, buffer, ctid);
+			break;
+		case TOAST_VISIBILITY:
+			return ZHeapTupleSatisfiesToast(stup, snapshot, buffer, ctid);
+			break;
+		case DIRTY_VISIBILITY:
+			return ZHeapTupleSatisfiesDirty(stup, snapshot, buffer, ctid);
+			break;
+		case HISTORIC_MVCC_VISIBILITY:
+			// ZBORKED: need a better error message
+			elog(PANIC, "unimplemented");
+			break;
+		case NON_VACUUMABLE_VISIBILTY:
+			return ZHeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer, ctid);
+			break;
+		default:
+			Assert(0);
+			break;
+	}
+}
+
+/*
+ * This is a helper function for CheckForSerializableConflictOut.
+ *
+ * Check to see whether the tuple has been written to by a concurrent
+ * transaction, either to create it not visible to us, or to delete it
+ * while it is visible to us.  The "visible" bool indicates whether the
+ * tuple is visible to us, while ZHeapTupleSatisfiesOldestXmin checks what
+ * else is going on with it. The caller should have a share lock on the buffer.
+ */
+bool
+ZHeapTupleHasSerializableConflictOut(bool visible, Relation relation,
+									 ItemPointer tid, Buffer buffer,
+									 TransactionId *xid)
+{
+	HTSV_Result		htsvResult;
+	ItemId			lp;
+	OffsetNumber	offnum;
+	Page			dp;
+	ZHeapTuple		tuple;
+	Size			tuple_len;
+	bool			tuple_inplace_updated = false;
+	Snapshot		snap;
+
+	Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer));
+	offnum = ItemPointerGetOffsetNumber(tid);
+	dp = BufferGetPage(buffer);
+
+	/* check for bogus TID */
+	Assert (offnum >= FirstOffsetNumber &&
+			offnum <= PageGetMaxOffsetNumber(dp));
+
+	lp = PageGetItemId(dp, offnum);
+
+	/* check for unused or dead items */
+	Assert (ItemIdIsNormal(lp) || ItemIdIsDeleted(lp));
+
+	/*
+	 * If the record is deleted and pruned, its place in the page might have
+	 * been taken by another of its kind.
+	 */
+	if (ItemIdIsDeleted(lp))
+	{
+		/*
+		 * If the tuple is still visible to us, then we've a conflict. Becasue,
+		 * the transaction that deleted the tuple already got committed.
+		 */
+		if (visible)
+		{
+			snap = GetTransactionSnapshot();
+			tuple = ZHeapGetVisibleTuple(offnum, snap, buffer, NULL);
+			ZHeapTupleGetTransInfo(tuple, buffer, NULL, NULL, xid,
+								   NULL, NULL, false);
+			pfree(tuple);
+			return true;
+		}
+		else
+			return false;
+	}
+
+	tuple_len = ItemIdGetLength(lp);
+	tuple = palloc(ZHEAPTUPLESIZE + tuple_len);
+	tuple->t_data = (ZHeapTupleHeader) ((char *) tuple + ZHEAPTUPLESIZE);
+	tuple->t_tableOid = RelationGetRelid(relation);
+	tuple->t_len = tuple_len;
+	ItemPointerSet(&tuple->t_self, ItemPointerGetBlockNumber(tid), offnum);
+	memcpy(tuple->t_data,
+		   ((ZHeapTupleHeader) PageGetItem((Page) dp, lp)), tuple_len);
+
+	if (tuple->t_data->t_infomask & ZHEAP_INPLACE_UPDATED)
+				tuple_inplace_updated = true;
+
+	htsvResult = ZHeapTupleSatisfiesOldestXmin(&tuple, TransactionXmin, buffer, xid, NULL);
+	pfree(tuple);
+	switch (htsvResult)
+	{
+		case HEAPTUPLE_LIVE:
+			if (tuple_inplace_updated)
+			{
+				/*
+				 * We can't rely on callers visibility information for
+				 * in-place updated tuples because they consider the tuple as
+				 * visible if any version of the tuple is visible whereas we
+				 * want to know the status of current tuple.  In case of
+				 * aborted transactions, it is quite possible that the rollback
+				 * actions aren't yet applied and we need the status of last
+				 * committed transaction; ZHeapTupleSatisfiesOldestXmin returns
+				 * us that information.
+				 */
+				if (XidIsConcurrent(*xid))
+					visible = false;
+			}
+			if (visible)
+				return false;
+			break;
+		case HEAPTUPLE_RECENTLY_DEAD:
+			if (!visible)
+				return false;
+			break;
+		case HEAPTUPLE_DELETE_IN_PROGRESS:
+			break;
+		case HEAPTUPLE_INSERT_IN_PROGRESS:
+			break;
+		case HEAPTUPLE_DEAD:
+			return false;
+		default:
+
+			/*
+			 * The only way to get to this default clause is if a new value is
+			 * added to the enum type without adding it to this switch
+			 * statement.  That's a bug, so elog.
+			 */
+			elog(ERROR, "unrecognized return value from ZHeapTupleSatisfiesOldestXmin: %u", htsvResult);
+
+			/*
+			 * In spite of having all enum values covered and calling elog on
+			 * this default, some compilers think this is a code path which
+			 * allows xid to be used below without initialization. Silence
+			 * that warning.
+			 */
+			*xid = InvalidTransactionId;
+	}
+	Assert(TransactionIdIsValid(*xid));
+	Assert(TransactionIdFollowsOrEquals(*xid, TransactionXmin));
+
+	/*
+	 * Find top level xid.  Bail out if xid is too early to be a conflict, or
+	 * if it's our own xid.
+	 */
+	if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny()))
+		return false;
+	if (TransactionIdPrecedes(*xid, TransactionXmin))
+		return false;
+	if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny()))
+		return false;
+
+	return true;
+}
diff --git a/src/backend/access/zheap/ztuptoaster.c b/src/backend/access/zheap/ztuptoaster.c
new file mode 100644
index 0000000000..2ce75f387d
--- /dev/null
+++ b/src/backend/access/zheap/ztuptoaster.c
@@ -0,0 +1,990 @@
+/*-------------------------------------------------------------------------
+ *
+ * ztuptoaster.c
+ *	  Support routines for external and compressed storage of
+ *	  variable size attributes.
+ *
+ *	  This file contains common functionality with tuptoaster except that
+ *	  the tuples are of zheap format and stored in zheap storage engine.
+ *	  Even if we want to keep it as a separate code for this, the common
+ *	  parts needs to be extracted.
+ *
+ *	  The benefit of storing toast data in zheap is that it avoids bloat in
+ *	  toast storage. The tuple space can be immediately reclaimed once the
+ *	  deleting transaction is committed.
+ *
+ * Copyright (c) 2000-2018, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/heap/ztuptoaster.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		ztoast_insert_or_update -
+ *			Try to make a given tuple fit into one page by compressing
+ *			or moving off attributes
+ *
+ *		ztoast_delete -
+ *			Reclaim toast storage when a tuple is deleted
+ *
+ *
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/tuptoaster.h"
+#include "access/xact.h"
+#include "catalog/catalog.h"
+#include "common/pg_lzcompress.h"
+#include "miscadmin.h"
+#include "utils/expandeddatum.h"
+#include "utils/fmgroids.h"
+#include "utils/rel.h"
+#include "utils/snapmgr.h"
+#include "utils/typcache.h"
+#include "utils/tqual.h"
+#include "access/zheap.h"
+#include "access/zheaputils.h"
+
+static void ztoast_delete_datum(Relation rel, Datum value, bool is_speculative);
+static Datum ztoast_save_datum(Relation rel, Datum value,
+				 struct varlena *oldexternal, int options);
+
+/* ----------
+ * ztoast_insert_or_update -
+ * Just like toast_insert_or_update but for zheap relations.
+ */
+
+ZHeapTuple
+ztoast_insert_or_update(Relation rel, ZHeapTuple newtup, ZHeapTuple oldtup,
+					   int options)
+{
+	ZHeapTuple	result_tuple;
+	TupleDesc	tupleDesc;
+	int			numAttrs;
+	int			i;
+
+	bool		need_change = false;
+	bool		need_free = false;
+	bool		need_delold = false;
+	bool		has_nulls = false;
+
+	Size		maxDataLen;
+	Size		hoff;
+
+	char		toast_action[MaxHeapAttributeNumber];
+	bool		toast_isnull[MaxHeapAttributeNumber];
+	bool		toast_oldisnull[MaxHeapAttributeNumber];
+	Datum		toast_values[MaxHeapAttributeNumber];
+	Datum		toast_oldvalues[MaxHeapAttributeNumber];
+	struct varlena *toast_oldexternal[MaxHeapAttributeNumber];
+	int32		toast_sizes[MaxHeapAttributeNumber];
+	bool		toast_free[MaxHeapAttributeNumber];
+	bool		toast_delold[MaxHeapAttributeNumber];
+
+	/*
+	 * Ignore the INSERT_SPECULATIVE option. Speculative insertions/super
+	 * deletions just normally insert/delete the toast values. It seems
+	 * easiest to deal with that here, instead on, potentially, multiple
+	 * callers.
+	 */
+	options &= ~HEAP_INSERT_SPECULATIVE;
+
+	/*
+	 * We should only ever be called for tuples of plain relations or
+	 * materialized views --- recursing on a toast rel is bad news.
+	 */
+	Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+		   rel->rd_rel->relkind == RELKIND_MATVIEW);
+
+	/*
+	 * Get the tuple descriptor and break down the tuple(s) into fields.
+	 */
+	tupleDesc = rel->rd_att;
+	numAttrs = tupleDesc->natts;
+
+	Assert(numAttrs <= MaxHeapAttributeNumber);
+	zheap_deform_tuple(newtup, tupleDesc, toast_values, toast_isnull);
+	if (oldtup != NULL)
+		zheap_deform_tuple(oldtup, tupleDesc, toast_oldvalues, toast_oldisnull);
+
+	/* ----------
+	 * Then collect information about the values given
+	 *
+	 * NOTE: toast_action[i] can have these values:
+	 *		' '		default handling
+	 *		'p'		already processed --- don't touch it
+	 *		'x'		incompressible, but OK to move off
+	 *
+	 * NOTE: toast_sizes[i] is only made valid for varlena attributes with
+	 *		toast_action[i] different from 'p'.
+	 * ----------
+	 */
+	memset(toast_action, ' ', numAttrs * sizeof(char));
+	memset(toast_oldexternal, 0, numAttrs * sizeof(struct varlena *));
+	memset(toast_free, 0, numAttrs * sizeof(bool));
+	memset(toast_delold, 0, numAttrs * sizeof(bool));
+
+	for (i = 0; i < numAttrs; i++)
+	{
+		Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
+		struct varlena *old_value;
+		struct varlena *new_value;
+
+		if (oldtup != NULL)
+		{
+			/*
+			 * For UPDATE get the old and new values of this attribute
+			 */
+			old_value = (struct varlena *) DatumGetPointer(toast_oldvalues[i]);
+			new_value = (struct varlena *) DatumGetPointer(toast_values[i]);
+
+			/*
+			 * If the old value is stored on disk, check if it has changed so
+			 * we have to delete it later.
+			 */
+			if (att->attlen == -1 && !toast_oldisnull[i] &&
+				VARATT_IS_EXTERNAL_ONDISK(old_value))
+			{
+				if (toast_isnull[i] || !VARATT_IS_EXTERNAL_ONDISK(new_value) ||
+					memcmp((char *) old_value, (char *) new_value,
+						   VARSIZE_EXTERNAL(old_value)) != 0)
+				{
+					/*
+					 * The old external stored value isn't needed any more
+					 * after the update
+					 */
+					toast_delold[i] = true;
+					need_delold = true;
+				}
+				else
+				{
+					/*
+					 * This attribute isn't changed by this update so we reuse
+					 * the original reference to the old value in the new
+					 * tuple.
+					 */
+					toast_action[i] = 'p';
+					continue;
+				}
+			}
+		}
+		else
+		{
+			/*
+			 * For INSERT simply get the new value
+			 */
+			new_value = (struct varlena *) DatumGetPointer(toast_values[i]);
+		}
+
+		/*
+		 * Handle NULL attributes
+		 */
+		if (toast_isnull[i])
+		{
+			toast_action[i] = 'p';
+			has_nulls = true;
+			continue;
+		}
+
+		/*
+		 * Now look at varlena attributes
+		 */
+		if (att->attlen == -1)
+		{
+			/*
+			 * If the table's attribute says PLAIN always, force it so.
+			 */
+			if (att->attstorage == 'p')
+				toast_action[i] = 'p';
+
+			/*
+			 * We took care of UPDATE above, so any external value we find
+			 * still in the tuple must be someone else's that we cannot reuse
+			 * (this includes the case of an out-of-line in-memory datum).
+			 * Fetch it back (without decompression, unless we are forcing
+			 * PLAIN storage).  If necessary, we'll push it out as a new
+			 * external value below.
+			 */
+			if (VARATT_IS_EXTERNAL(new_value))
+			{
+				toast_oldexternal[i] = new_value;
+				if (att->attstorage == 'p')
+					new_value = heap_tuple_untoast_attr(new_value);
+				else
+					new_value = heap_tuple_fetch_attr(new_value);
+				toast_values[i] = PointerGetDatum(new_value);
+				toast_free[i] = true;
+				need_change = true;
+				need_free = true;
+			}
+
+			/*
+			 * Remember the size of this attribute
+			 */
+			toast_sizes[i] = VARSIZE_ANY(new_value);
+		}
+		else
+		{
+			/*
+			 * Not a varlena attribute, plain storage always
+			 */
+			toast_action[i] = 'p';
+		}
+	}
+
+	/* ----------
+	 * Compress and/or save external until data fits into target length
+	 *
+	 *	1: Inline compress attributes with attstorage 'x', and store very
+	 *	   large attributes with attstorage 'x' or 'e' external immediately
+	 *	2: Store attributes with attstorage 'x' or 'e' external
+	 *	3: Inline compress attributes with attstorage 'm'
+	 *	4: Store attributes with attstorage 'm' external
+	 * ----------
+	 */
+
+	/* compute header overhead --- this should match heap_form_tuple() */
+	hoff = SizeofZHeapTupleHeader;
+	if (has_nulls)
+		hoff += BITMAPLEN(numAttrs);
+
+	/* now convert to a limit on the tuple data size */
+	maxDataLen = RelationGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
+
+	/*
+	 * Look for attributes with attstorage 'x' to compress.  Also find large
+	 * attributes with attstorage 'x' or 'e', and store them external.
+	 */
+	while (zheap_compute_data_size(tupleDesc,
+								   toast_values, toast_isnull, hoff) > maxDataLen)
+	{
+		int			biggest_attno = -1;
+		int32		biggest_size = MAXALIGN(TOAST_POINTER_SIZE);
+		Datum		old_value;
+		Datum		new_value;
+
+		/*
+		 * Search for the biggest yet unprocessed internal attribute
+		 */
+		for (i = 0; i < numAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
+
+			if (toast_action[i] != ' ')
+				continue;
+			if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i])))
+				continue;		/* can't happen, toast_action would be 'p' */
+			if (VARATT_IS_COMPRESSED(DatumGetPointer(toast_values[i])))
+				continue;
+			if (att->attstorage != 'x' && att->attstorage != 'e')
+				continue;
+			if (toast_sizes[i] > biggest_size)
+			{
+				biggest_attno = i;
+				biggest_size = toast_sizes[i];
+			}
+		}
+
+		if (biggest_attno < 0)
+			break;
+
+		/*
+		 * Attempt to compress it inline, if it has attstorage 'x'
+		 */
+		i = biggest_attno;
+		if (TupleDescAttr(tupleDesc, i)->attstorage == 'x')
+		{
+			old_value = toast_values[i];
+			new_value = toast_compress_datum(old_value);
+
+			if (DatumGetPointer(new_value) != NULL)
+			{
+				/* successful compression */
+				if (toast_free[i])
+					pfree(DatumGetPointer(old_value));
+				toast_values[i] = new_value;
+				toast_free[i] = true;
+				toast_sizes[i] = VARSIZE(DatumGetPointer(toast_values[i]));
+				need_change = true;
+				need_free = true;
+			}
+			else
+			{
+				/* incompressible, ignore on subsequent compression passes */
+				toast_action[i] = 'x';
+			}
+		}
+		else
+		{
+			/* has attstorage 'e', ignore on subsequent compression passes */
+			toast_action[i] = 'x';
+		}
+
+		/*
+		 * If this value is by itself more than maxDataLen (after compression
+		 * if any), push it out to the toast table immediately, if possible.
+		 * This avoids uselessly compressing other fields in the common case
+		 * where we have one long field and several short ones.
+		 *
+		 * XXX maybe the threshold should be less than maxDataLen?
+		 */
+		if (toast_sizes[i] > maxDataLen &&
+			rel->rd_rel->reltoastrelid != InvalidOid)
+		{
+			old_value = toast_values[i];
+			toast_action[i] = 'p';
+			toast_values[i] = ztoast_save_datum(rel, toast_values[i],
+											   toast_oldexternal[i], options);
+			if (toast_free[i])
+				pfree(DatumGetPointer(old_value));
+			toast_free[i] = true;
+			need_change = true;
+			need_free = true;
+		}
+	}
+
+	/*
+	 * Second we look for attributes of attstorage 'x' or 'e' that are still
+	 * inline.  But skip this if there's no toast table to push them to.
+	 */
+	while (zheap_compute_data_size(tupleDesc,
+								   toast_values, toast_isnull, hoff) > maxDataLen &&
+								   rel->rd_rel->reltoastrelid != InvalidOid)
+	{
+		int			biggest_attno = -1;
+		int32		biggest_size = MAXALIGN(TOAST_POINTER_SIZE);
+		Datum		old_value;
+
+		/*------
+		 * Search for the biggest yet inlined attribute with
+		 * attstorage equals 'x' or 'e'
+		 *------
+		 */
+		for (i = 0; i < numAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
+
+			if (toast_action[i] == 'p')
+				continue;
+			if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i])))
+				continue;		/* can't happen, toast_action would be 'p' */
+			if (att->attstorage != 'x' && att->attstorage != 'e')
+				continue;
+			if (toast_sizes[i] > biggest_size)
+			{
+				biggest_attno = i;
+				biggest_size = toast_sizes[i];
+			}
+		}
+
+		if (biggest_attno < 0)
+			break;
+
+		/*
+		 * Store this external
+		 */
+		i = biggest_attno;
+		old_value = toast_values[i];
+		toast_action[i] = 'p';
+		toast_values[i] = ztoast_save_datum(rel, toast_values[i],
+										   toast_oldexternal[i], options);
+		if (toast_free[i])
+			pfree(DatumGetPointer(old_value));
+		toast_free[i] = true;
+
+		need_change = true;
+		need_free = true;
+	}
+
+	/*
+	 * Round 3 - this time we take attributes with storage 'm' into
+	 * compression
+	 */
+	while (zheap_compute_data_size(tupleDesc,
+								   toast_values, toast_isnull, hoff) > maxDataLen)
+	{
+		int			biggest_attno = -1;
+		int32		biggest_size = MAXALIGN(TOAST_POINTER_SIZE);
+		Datum		old_value;
+		Datum		new_value;
+
+		/*
+		 * Search for the biggest yet uncompressed internal attribute
+		 */
+		for (i = 0; i < numAttrs; i++)
+		{
+			if (toast_action[i] != ' ')
+				continue;
+			if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i])))
+				continue;		/* can't happen, toast_action would be 'p' */
+			if (VARATT_IS_COMPRESSED(DatumGetPointer(toast_values[i])))
+				continue;
+			if (TupleDescAttr(tupleDesc, i)->attstorage != 'm')
+				continue;
+			if (toast_sizes[i] > biggest_size)
+			{
+				biggest_attno = i;
+				biggest_size = toast_sizes[i];
+			}
+		}
+
+		if (biggest_attno < 0)
+			break;
+
+		/*
+		 * Attempt to compress it inline
+		 */
+		i = biggest_attno;
+		old_value = toast_values[i];
+		new_value = toast_compress_datum(old_value);
+
+		if (DatumGetPointer(new_value) != NULL)
+		{
+			/* successful compression */
+			if (toast_free[i])
+				pfree(DatumGetPointer(old_value));
+			toast_values[i] = new_value;
+			toast_free[i] = true;
+			toast_sizes[i] = VARSIZE(DatumGetPointer(toast_values[i]));
+			need_change = true;
+			need_free = true;
+		}
+		else
+		{
+			/* incompressible, ignore on subsequent compression passes */
+			toast_action[i] = 'x';
+		}
+	}
+
+	/*
+	 * Finally we store attributes of type 'm' externally.  At this point we
+	 * increase the target tuple size, so that 'm' attributes aren't stored
+	 * externally unless really necessary.
+	 */
+	maxDataLen = TOAST_TUPLE_TARGET_MAIN - hoff;
+
+	while (zheap_compute_data_size(tupleDesc,
+								   toast_values, toast_isnull, hoff) > maxDataLen &&
+		   rel->rd_rel->reltoastrelid != InvalidOid)
+	{
+		int			biggest_attno = -1;
+		int32		biggest_size = MAXALIGN(TOAST_POINTER_SIZE);
+		Datum		old_value;
+
+		/*--------
+		 * Search for the biggest yet inlined attribute with
+		 * attstorage = 'm'
+		 *--------
+		 */
+		for (i = 0; i < numAttrs; i++)
+		{
+			if (toast_action[i] == 'p')
+				continue;
+			if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i])))
+				continue;		/* can't happen, toast_action would be 'p' */
+			if (TupleDescAttr(tupleDesc, i)->attstorage != 'm')
+				continue;
+			if (toast_sizes[i] > biggest_size)
+			{
+				biggest_attno = i;
+				biggest_size = toast_sizes[i];
+			}
+		}
+
+		if (biggest_attno < 0)
+			break;
+
+		/*
+		 * Store this external
+		 */
+		i = biggest_attno;
+		old_value = toast_values[i];
+		toast_action[i] = 'p';
+		toast_values[i] = ztoast_save_datum(rel, toast_values[i],
+										   toast_oldexternal[i], options);
+		if (toast_free[i])
+			pfree(DatumGetPointer(old_value));
+		toast_free[i] = true;
+
+		need_change = true;
+		need_free = true;
+	}
+
+	/*
+	 * In the case we toasted any values, we need to build a new heap tuple
+	 * with the changed values.
+	 */
+	if (need_change)
+	{
+		ZHeapTupleHeader olddata = newtup->t_data;
+		ZHeapTupleHeader new_data;
+		int32		new_header_len;
+		int32		new_data_len;
+		int32		new_tuple_len;
+
+		/*
+		 * Calculate the new size of the tuple.
+		 *
+		 * Note: we used to assume here that the old tuple's t_hoff must equal
+		 * the new_header_len value, but that was incorrect.  The old tuple
+		 * might have a smaller-than-current natts, if there's been an ALTER
+		 * TABLE ADD COLUMN since it was stored; and that would lead to a
+		 * different conclusion about the size of the null bitmap, or even
+		 * whether there needs to be one at all.
+		 */
+		new_header_len = SizeofZHeapTupleHeader;
+		if (has_nulls)
+			new_header_len += BITMAPLEN(numAttrs);
+		new_data_len = zheap_compute_data_size(tupleDesc,
+											   toast_values, toast_isnull, hoff);
+		new_tuple_len = new_header_len + new_data_len;
+
+		/*
+		 * Allocate and zero the space needed, and fill ZHeapTupleData fields.
+		 */
+		result_tuple = (ZHeapTuple) palloc0(ZHEAPTUPLESIZE + new_tuple_len);
+		result_tuple->t_len = new_tuple_len;
+		result_tuple->t_self = newtup->t_self;
+		result_tuple->t_tableOid = newtup->t_tableOid;
+		new_data = (ZHeapTupleHeader) ((char *) result_tuple + ZHEAPTUPLESIZE);
+		result_tuple->t_data = new_data;
+
+		/*
+		 * Copy the existing tuple header, but adjust natts and t_hoff.
+		 */
+		memcpy(new_data, olddata, SizeofZHeapTupleHeader);
+		ZHeapTupleHeaderSetNatts(new_data, numAttrs);
+		new_data->t_hoff = new_header_len;
+
+		/* Copy over the data, and fill the null bitmap if needed */
+		zheap_fill_tuple(tupleDesc,
+						toast_values,
+						toast_isnull,
+						(char *) new_data + new_header_len,
+						new_data_len,
+						&(new_data->t_infomask),
+						has_nulls ? new_data->t_bits : NULL);
+	}
+	else
+		result_tuple = newtup;
+
+	/*
+	 * Free allocated temp values
+	 */
+	if (need_free)
+		for (i = 0; i < numAttrs; i++)
+			if (toast_free[i])
+				pfree(DatumGetPointer(toast_values[i]));
+
+	/*
+	 * Delete external values from the old tuple
+	 */
+	if (need_delold)
+		for (i = 0; i < numAttrs; i++)
+			if (toast_delold[i])
+				ztoast_delete_datum(rel, toast_oldvalues[i], false);
+
+	return result_tuple;
+}
+
+/*
+ * ztoast_save_datum
+ * Just like toast_save_datum but for zheap relations.
+ */
+static Datum
+ztoast_save_datum(Relation rel, Datum value,
+				 struct varlena *oldexternal, int options)
+{
+	Relation	toastrel;
+	Relation   *toastidxs;
+	ZHeapTuple	toasttup;
+	TupleDesc	toasttupDesc;
+	Datum		t_values[3];
+	bool		t_isnull[3];
+	CommandId	mycid = GetCurrentCommandId(true);
+	struct varlena *result;
+	struct varatt_external toast_pointer;
+	union
+	{
+		struct varlena hdr;
+		/* this is to make the union big enough for a chunk: */
+		char		data[TOAST_MAX_CHUNK_SIZE + VARHDRSZ];
+		/* ensure union is aligned well enough: */
+		int32		align_it;
+	}			chunk_data;
+	int32		chunk_size;
+	int32		chunk_seq = 0;
+	char	   *data_p;
+	int32		data_todo;
+	Pointer		dval = DatumGetPointer(value);
+	int			num_indexes;
+	int			validIndex;
+
+	Assert(!VARATT_IS_EXTERNAL(value));
+
+	/*
+	 * Open the toast relation and its indexes.  We can use the index to check
+	 * uniqueness of the OID we assign to the toasted item, even though it has
+	 * additional columns besides OID.
+	 */
+	toastrel = heap_open(rel->rd_rel->reltoastrelid, RowExclusiveLock);
+	toasttupDesc = toastrel->rd_att;
+
+	/* The toast table of zheap table should also be of zheap type */
+	Assert (RelationStorageIsZHeap(toastrel));
+
+	/* Open all the toast indexes and look for the valid one */
+	validIndex = toast_open_indexes(toastrel,
+									RowExclusiveLock,
+									&toastidxs,
+									&num_indexes);
+
+	/*
+	 * Get the data pointer and length, and compute va_rawsize and va_extsize.
+	 *
+	 * va_rawsize is the size of the equivalent fully uncompressed datum, so
+	 * we have to adjust for short headers.
+	 *
+	 * va_extsize is the actual size of the data payload in the toast records.
+	 */
+	if (VARATT_IS_SHORT(dval))
+	{
+		data_p = VARDATA_SHORT(dval);
+		data_todo = VARSIZE_SHORT(dval) - VARHDRSZ_SHORT;
+		toast_pointer.va_rawsize = data_todo + VARHDRSZ;	/* as if not short */
+		toast_pointer.va_extsize = data_todo;
+	}
+	else if (VARATT_IS_COMPRESSED(dval))
+	{
+		data_p = VARDATA(dval);
+		data_todo = VARSIZE(dval) - VARHDRSZ;
+		/* rawsize in a compressed datum is just the size of the payload */
+		toast_pointer.va_rawsize = VARRAWSIZE_4B_C(dval) + VARHDRSZ;
+		toast_pointer.va_extsize = data_todo;
+		/* Assert that the numbers look like it's compressed */
+		Assert(VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer));
+	}
+	else
+	{
+		data_p = VARDATA(dval);
+		data_todo = VARSIZE(dval) - VARHDRSZ;
+		toast_pointer.va_rawsize = VARSIZE(dval);
+		toast_pointer.va_extsize = data_todo;
+	}
+
+	/*
+	 * Insert the correct table OID into the result TOAST pointer.
+	 *
+	 * Normally this is the actual OID of the target toast table, but during
+	 * table-rewriting operations such as CLUSTER, we have to insert the OID
+	 * of the table's real permanent toast table instead.  rd_toastoid is set
+	 * if we have to substitute such an OID.
+	 */
+	if (OidIsValid(rel->rd_toastoid))
+		toast_pointer.va_toastrelid = rel->rd_toastoid;
+	else
+		toast_pointer.va_toastrelid = RelationGetRelid(toastrel);
+
+	/*
+	 * Choose an OID to use as the value ID for this toast value.
+	 *
+	 * Normally we just choose an unused OID within the toast table.  But
+	 * during table-rewriting operations where we are preserving an existing
+	 * toast table OID, we want to preserve toast value OIDs too.  So, if
+	 * rd_toastoid is set and we had a prior external value from that same
+	 * toast table, re-use its value ID.  If we didn't have a prior external
+	 * value (which is a corner case, but possible if the table's attstorage
+	 * options have been changed), we have to pick a value ID that doesn't
+	 * conflict with either new or existing toast value OIDs.
+	 */
+	if (!OidIsValid(rel->rd_toastoid))
+	{
+		/* normal case: just choose an unused OID */
+		toast_pointer.va_valueid =
+			GetNewOidWithIndex(toastrel,
+							   RelationGetRelid(toastidxs[validIndex]),
+							   (AttrNumber) 1);
+	}
+	else
+	{
+		/* rewrite case: check to see if value was in old toast table */
+		toast_pointer.va_valueid = InvalidOid;
+		if (oldexternal != NULL)
+		{
+			struct varatt_external old_toast_pointer;
+
+			Assert(VARATT_IS_EXTERNAL_ONDISK(oldexternal));
+			/* Must copy to access aligned fields */
+			VARATT_EXTERNAL_GET_POINTER(old_toast_pointer, oldexternal);
+			if (old_toast_pointer.va_toastrelid == rel->rd_toastoid)
+			{
+				/* This value came from the old toast table; reuse its OID */
+				toast_pointer.va_valueid = old_toast_pointer.va_valueid;
+
+				/*
+				 * There is a corner case here: the table rewrite might have
+				 * to copy both live and recently-dead versions of a row, and
+				 * those versions could easily reference the same toast value.
+				 * When we copy the second or later version of such a row,
+				 * reusing the OID will mean we select an OID that's already
+				 * in the new toast table.  Check for that, and if so, just
+				 * fall through without writing the data again.
+				 *
+				 * While annoying and ugly-looking, this is a good thing
+				 * because it ensures that we wind up with only one copy of
+				 * the toast value when there is only one copy in the old
+				 * toast table.  Before we detected this case, we'd have made
+				 * multiple copies, wasting space; and what's worse, the
+				 * copies belonging to already-deleted heap tuples would not
+				 * be reclaimed by VACUUM.
+				 */
+				if (toastrel_valueid_exists(toastrel,
+											toast_pointer.va_valueid))
+				{
+					/* Match, so short-circuit the data storage loop below */
+					data_todo = 0;
+				}
+			}
+		}
+		if (toast_pointer.va_valueid == InvalidOid)
+		{
+			/*
+			 * new value; must choose an OID that doesn't conflict in either
+			 * old or new toast table
+			 */
+			do
+			{
+				toast_pointer.va_valueid =
+					GetNewOidWithIndex(toastrel,
+									   RelationGetRelid(toastidxs[validIndex]),
+									   (AttrNumber) 1);
+			} while (toastid_valueid_exists(rel->rd_toastoid,
+											toast_pointer.va_valueid));
+		}
+	}
+
+	/*
+	 * Initialize constant parts of the tuple data
+	 */
+	t_values[0] = ObjectIdGetDatum(toast_pointer.va_valueid);
+	t_values[2] = PointerGetDatum(&chunk_data);
+	t_isnull[0] = false;
+	t_isnull[1] = false;
+	t_isnull[2] = false;
+
+	/*
+	 * Split up the item into chunks
+	 */
+	while (data_todo > 0)
+	{
+		int			i;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Calculate the size of this chunk
+		 */
+		chunk_size = Min(TOAST_MAX_CHUNK_SIZE, data_todo);
+
+		/*
+		 * Build a tuple and store it
+		 */
+		t_values[1] = Int32GetDatum(chunk_seq++);
+		SET_VARSIZE(&chunk_data, chunk_size + VARHDRSZ);
+		memcpy(VARDATA(&chunk_data), data_p, chunk_size);
+		toasttup = zheap_form_tuple(toasttupDesc, t_values, t_isnull);
+
+		zheap_insert(toastrel, toasttup, mycid, options, NULL);
+
+		/*
+		 * Create the index entry.  We cheat a little here by not using
+		 * FormIndexDatum: this relies on the knowledge that the index columns
+		 * are the same as the initial columns of the table for all the
+		 * indexes.  We also cheat by not providing an IndexInfo: this is okay
+		 * for now because btree doesn't need one, but we might have to be
+		 * more honest someday.
+		 *
+		 * Note also that there had better not be any user-created index on
+		 * the TOAST table, since we don't bother to update anything else.
+		 */
+		for (i = 0; i < num_indexes; i++)
+		{
+			/* Only index relations marked as ready can be updated */
+			if (IndexIsReady(toastidxs[i]->rd_index))
+				index_insert(toastidxs[i], t_values, t_isnull,
+							 &(toasttup->t_self),
+							 toastrel,
+							 toastidxs[i]->rd_index->indisunique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 NULL);
+		}
+
+		/*
+		 * Free memory
+		 */
+		zheap_freetuple(toasttup);
+
+		/*
+		 * Move on to next chunk
+		 */
+		data_todo -= chunk_size;
+		data_p += chunk_size;
+	}
+
+	/*
+	 * Done - close toast relation and its indexes
+	 */
+	toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock);
+	heap_close(toastrel, RowExclusiveLock);
+
+	/*
+	 * Create the TOAST pointer value that we'll return
+	 */
+	result = (struct varlena *) palloc(TOAST_POINTER_SIZE);
+	SET_VARTAG_EXTERNAL(result, VARTAG_ONDISK);
+	memcpy(VARDATA_EXTERNAL(result), &toast_pointer, sizeof(toast_pointer));
+
+	return PointerGetDatum(result);
+}
+
+/*
+ * ztoast_delete_datum
+ * Just like toast_delete_datum but for zheap relations.
+ */
+static void
+ztoast_delete_datum(Relation rel, Datum value, bool is_speculative)
+{
+	struct varlena *attr = (struct varlena *) DatumGetPointer(value);
+	struct varatt_external toast_pointer;
+	Relation	toastrel;
+	Relation   *toastidxs;
+	ScanKeyData toastkey;
+	SysScanDesc toastscan;
+	HeapTuple	toasttup;
+	int			num_indexes;
+	int			validIndex;
+	SnapshotData SnapshotToast;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		return;
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	toastrel = heap_open(toast_pointer.va_toastrelid, RowExclusiveLock);
+
+	/* The toast table of zheap table should also be of zheap type */
+	Assert (RelationStorageIsZHeap(toastrel));
+
+	/* Fetch valid relation used for process */
+	validIndex = toast_open_indexes(toastrel,
+									RowExclusiveLock,
+									&toastidxs,
+									&num_indexes);
+
+	/*
+	 * Setup a scan key to find chunks with matching va_valueid
+	 */
+	ScanKeyInit(&toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(toast_pointer.va_valueid));
+
+	/*
+	 * Find all the chunks.  (We don't actually care whether we see them in
+	 * sequence or not, but since we've already locked the index we might as
+	 * well use systable_beginscan_ordered.)
+	 */
+	init_toast_snapshot(&SnapshotToast);
+	toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex],
+										   &SnapshotToast, 1, &toastkey);
+	while ((toasttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
+	{
+		/*
+		 * Have a chunk, delete it
+		 */
+		if (!is_speculative)
+			simple_zheap_delete(toastrel, &toasttup->t_self, &SnapshotToast);
+		else
+		{
+			TupleDesc	tupdesc = toastrel->rd_att;
+			zheap_abort_speculative(toastrel, heap_to_zheap(toasttup, tupdesc));
+		}
+	}
+
+	/*
+	 * End scan and close relations
+	 */
+	systable_endscan_ordered(toastscan);
+	toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock);
+	heap_close(toastrel, RowExclusiveLock);
+}
+
+/* ----------
+ * ztoast_delete -
+ *
+ *	Cascaded delete toast-entries on DELETE
+ * ----------
+ */
+void
+ztoast_delete(Relation rel, ZHeapTuple oldtup, bool is_speculative)
+{
+	TupleDesc	tupleDesc;
+	int			numAttrs;
+	int			i;
+	Datum		toast_values[MaxHeapAttributeNumber];
+	bool		toast_isnull[MaxHeapAttributeNumber];
+
+	/*
+	 * We should only ever be called for tuples of plain relations or
+	 * materialized views --- recursing on a toast rel is bad news.
+	 */
+	Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+		   rel->rd_rel->relkind == RELKIND_MATVIEW);
+
+	/*
+	 * Get the tuple descriptor and break down the tuple into fields.
+	 *
+	 * NOTE: it's debatable whether to use heap_deform_tuple() here or just
+	 * heap_getattr() only the varlena columns.  The latter could win if there
+	 * are few varlena columns and many non-varlena ones. However,
+	 * heap_deform_tuple costs only O(N) while the heap_getattr way would cost
+	 * O(N^2) if there are many varlena columns, so it seems better to err on
+	 * the side of linear cost.  (We won't even be here unless there's at
+	 * least one varlena column, by the way.)
+	 */
+	tupleDesc = rel->rd_att;
+	numAttrs = tupleDesc->natts;
+
+	Assert(numAttrs <= MaxHeapAttributeNumber);
+	zheap_deform_tuple(oldtup, tupleDesc, toast_values, toast_isnull);
+
+	/*
+	 * Check for external stored attributes and delete them from the secondary
+	 * relation.
+	 */
+	for (i = 0; i < numAttrs; i++)
+	{
+		if (TupleDescAttr(tupleDesc, i)->attlen == -1)
+		{
+			Datum		value = toast_values[i];
+
+			if (toast_isnull[i])
+				continue;
+			else if (VARATT_IS_EXTERNAL_ONDISK(PointerGetDatum(value)))
+				ztoast_delete_datum(rel, value, is_speculative);
+		}
+	}
+}
diff --git a/src/backend/access/zheap/zvacuumlazy.c b/src/backend/access/zheap/zvacuumlazy.c
new file mode 100644
index 0000000000..7b60591b21
--- /dev/null
+++ b/src/backend/access/zheap/zvacuumlazy.c
@@ -0,0 +1,1462 @@
+/*-------------------------------------------------------------------------
+ *
+ * zvacuumlazy.c
+ *	  Concurrent ("lazy") vacuuming.
+ *
+ *
+ * The lazy vacuum in zheap uses two-passes to clean up the dead tuples in
+ * heap and index.  It reclaims all the dead items in heap in the first pass
+ * and write undo record for such items, then clean the indexes in second
+ * pass.  The undo is written, so that if there is any error while cleaning
+ * indexes, we can rollback the operation and mark the entries in as dead.
+ *
+ * The other important aspect that is ensured in this system is that we don't
+ * item ids that are marked as unused to be reused till the transaction that
+ * has marked them unused is committed.
+ *
+ * The dead tuple tracking works in the same way as in heap.  See lazyvacuum.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/zvacuumlazy.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/genam.h"
+#include "access/tpd.h"
+#include "access/visibilitymap.h"
+#include "access/xact.h"
+#include "access/zhtup.h"
+#include "utils/ztqual.h"
+#include "access/zheapam_xlog.h"
+#include "access/zheaputils.h"
+#include "commands/dbcommands.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/autovacuum.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/lmgr.h"
+#include "storage/procarray.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/pg_rusage.h"
+
+/*
+ * Before we consider skipping a page that's marked as clean in
+ * visibility map, we must've seen at least this many clean pages.
+ */
+#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+
+/* A few variables that don't seem worth passing around as parameters */
+static int	elevel = -1;
+static TransactionId OldestXmin;
+static BufferAccessStrategy vac_strategy;
+
+/*
+ * Guesstimation of number of dead tuples per page.  This is used to
+ * provide an upper limit to memory allocated when vacuuming small
+ * tables.
+ */
+#define LAZY_ALLOC_TUPLES		MaxZHeapTuplesPerPage
+
+/* non-export function prototypes */
+static int
+lazy_vacuum_zpage(Relation onerel, BlockNumber blkno, Buffer buffer,
+				  int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+static int
+lazy_vacuum_zpage_with_undo(Relation onerel, BlockNumber blkno, Buffer buffer,
+							int tupindex, LVRelStats *vacrelstats,
+							Buffer *vmbuffer,
+							TransactionId *global_visibility_cutoff_xid);
+static void
+lazy_space_zalloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void
+lazy_scan_zheap(Relation onerel, int options, LVRelStats *vacrelstats,
+				Relation *Irel, int nindexes,
+				BufferAccessStrategy vac_strategy, bool aggressive);
+static bool
+zheap_page_is_all_visible(Relation rel, Buffer buf,
+						  TransactionId *visibility_cutoff_xid);
+
+/*
+ *	lazy_vacuum_zpage() -- free dead tuples on a page
+ *					 and repair its fragmentation.
+ *
+ * Caller must hold pin and buffer exclusive lock on the buffer.
+ *
+ * tupindex is the index in vacrelstats->dead_tuples of the first dead
+ * tuple for this page.  We assume the rest follow sequentially.
+ * The return value is the first tupindex after the tuples of this page.
+ */
+static int
+lazy_vacuum_zpage(Relation onerel, BlockNumber blkno, Buffer buffer,
+				  int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+{
+	Page		page = BufferGetPage(buffer);
+	Page		tmppage;
+	OffsetNumber unused[MaxOffsetNumber];
+	int			uncnt = 0;
+	TransactionId visibility_cutoff_xid;
+	bool		pruned = false;
+
+	/*
+	 * We prepare the temporary copy of the page so that during page
+	 * repair fragmentation we can use it to copy the actual tuples.
+	 * See comments atop zheap_page_prune_guts.
+	 */
+	tmppage = PageGetTempPageCopy(page);
+
+	/*
+	 * Lock the TPD page before starting critical section.  We might need
+	 * to access it during page repair fragmentation.
+	 */
+	if (ZHeapPageHasTPDSlot((PageHeader) page))
+		TPDPageLock(onerel, buffer);
+
+	START_CRIT_SECTION();
+
+	for (; tupindex < vacrelstats->num_dead_tuples; tupindex++)
+	{
+		BlockNumber tblk;
+		OffsetNumber toff;
+		ItemId		itemid;
+
+		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
+		if (tblk != blkno)
+			break;				/* past end of tuples for this block */
+		toff = ItemPointerGetOffsetNumber(&vacrelstats->dead_tuples[tupindex]);
+		itemid = PageGetItemId(page, toff);
+		ItemIdSetUnused(itemid);
+		unused[uncnt++] = toff;
+	}
+
+	ZPageRepairFragmentation(buffer, tmppage, InvalidOffsetNumber, 0, false,
+							 &pruned);
+
+	/*
+	 * Mark buffer dirty before we write WAL.
+	 */
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(onerel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_zheap_clean(onerel, buffer, InvalidOffsetNumber, 0,
+								 NULL, 0, NULL, 0,
+								 unused, uncnt,
+								 vacrelstats->latestRemovedXid, pruned);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(tmppage);
+	UnlockReleaseTPDBuffers();
+
+	/*
+	 * Now that we have removed the dead tuples from the page, once again
+	 * check if the page has become all-visible.  The page is already marked
+	 * dirty, exclusively locked.
+	 */
+	if (zheap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+	{
+		uint8		vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+		uint8		flags = 0;
+
+		/* Set the VM all-visible bit to flag, if needed */
+		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+		Assert(BufferIsValid(*vmbuffer));
+		if (flags != 0)
+			visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
+							  *vmbuffer, visibility_cutoff_xid, flags);
+	}
+
+	return tupindex;
+}
+
+/*
+ *	lazy_vacuum_zpage_with_undo() -- free dead tuples on a page
+ *					 and repair its fragmentation.
+ *
+ * Caller must hold pin and buffer exclusive lock on the buffer.
+ */
+static int
+lazy_vacuum_zpage_with_undo(Relation onerel, BlockNumber blkno, Buffer buffer,
+							int tupindex, LVRelStats *vacrelstats,
+							Buffer *vmbuffer,
+							TransactionId *global_visibility_cutoff_xid)
+{
+	TransactionId visibility_cutoff_xid;
+	TransactionId xid = GetTopTransactionId();
+	uint32	epoch = GetEpochForXid(xid);
+	Page		page = BufferGetPage(buffer);
+	Page		tmppage;
+	UnpackedUndoRecord	undorecord;
+	OffsetNumber unused[MaxOffsetNumber];
+	UndoRecPtr	urecptr, prev_urecptr;
+	int			i, uncnt = 0;
+	int		trans_slot_id;
+	xl_undolog_meta undometa;
+	XLogRecPtr	RedoRecPtr;
+	bool		doPageWrites;
+	bool		lock_reacquired;
+	bool		pruned = false;
+
+	for (; tupindex < vacrelstats->num_dead_tuples; tupindex++)
+	{
+		BlockNumber tblk PG_USED_FOR_ASSERTS_ONLY;
+		OffsetNumber toff;
+
+		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
+
+		/*
+		 * We should never pass the end of tuples for this block as we clean
+		 * the tuples in the current block before moving to next block.
+		 */
+		Assert(tblk == blkno);
+
+		toff = ItemPointerGetOffsetNumber(&vacrelstats->dead_tuples[tupindex]);
+		unused[uncnt++] = toff;
+	}
+
+	if (uncnt <= 0)
+		return tupindex;
+
+reacquire_slot:
+	/*
+	 * The transaction information of tuple needs to be set in transaction
+	 * slot, so needs to reserve the slot before proceeding with the actual
+	 * operation.  It will be costly to wait for getting the slot, but we do
+	 * that by releasing the buffer lock.
+	 */
+	trans_slot_id = PageReserveTransactionSlot(onerel,
+											   buffer,
+											   PageGetMaxOffsetNumber(page),
+											   epoch,
+											   xid,
+											   &prev_urecptr,
+											   &lock_reacquired);
+	if (lock_reacquired)
+		goto reacquire_slot;
+
+	if (trans_slot_id == InvalidXactSlotId)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT);
+		pg_usleep(10000L);	/* 10 ms */
+		pgstat_report_wait_end();
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		goto reacquire_slot;
+	}
+
+	/* prepare an undo record */
+	undorecord.uur_type = UNDO_ITEMID_UNUSED;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_reloid = onerel->rd_id;
+	undorecord.uur_prevxid = xid;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = InvalidCommandId;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = prev_urecptr;
+	undorecord.uur_block = blkno;
+	undorecord.uur_offset = 0;
+	undorecord.uur_tuple.len = 0;
+	undorecord.uur_payload.len = uncnt * sizeof(OffsetNumber);
+	undorecord.uur_payload.data = (char *) palloc(uncnt * sizeof(OffsetNumber));
+
+	/*
+	 * XXX Unlike other undo records, we don't set the TPD slot number in undo
+	 * record as this record is just skipped during processing of undo.
+	 */
+
+	urecptr = PrepareUndoInsert(&undorecord,
+								InvalidTransactionId,
+								UndoPersistenceForRelation(onerel),
+								&undometa);
+
+	/*
+	 * We prepare the temporary copy of the page so that during page
+	 * repair fragmentation we can use it to copy the actual tuples.
+	 * See comments atop zheap_page_prune_guts.
+	 */
+	tmppage = PageGetTempPageCopy(page);
+
+	/*
+	 * Lock the TPD page before starting critical section.  We might need
+	 * to access it during page repair fragmentation.  Note that if the
+	 * transaction slot belongs to TPD entry, then the TPD page must be
+	 * locked during slot reservation.
+	 */
+	if (trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS &&
+		ZHeapPageHasTPDSlot((PageHeader) page))
+		TPDPageLock(onerel, buffer);
+
+	START_CRIT_SECTION();
+
+	memcpy(undorecord.uur_payload.data, unused, uncnt * sizeof(OffsetNumber));
+	InsertPreparedUndo();
+	/*
+	 * We're sending the undo record for debugging purpose. So, just send
+	 * the last one.
+	 */
+	if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+	{
+		PageSetUNDO(undorecord,
+					buffer,
+					trans_slot_id,
+					true,
+					epoch,
+					xid,
+					urecptr,
+					unused,
+					uncnt);
+	}
+	else
+	{
+		PageSetUNDO(undorecord,
+					buffer,
+					trans_slot_id,
+					true,
+					epoch,
+					xid,
+					urecptr,
+					NULL,
+					0);
+	}
+
+	for (i = 0; i < uncnt; i++)
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, unused[i]);
+		ItemIdSetUnusedExtended(itemid, trans_slot_id);
+	}
+
+	ZPageRepairFragmentation(buffer, tmppage, InvalidOffsetNumber, 0, false,
+							 &pruned);
+
+	/*
+	 * Mark buffer dirty before we write WAL.
+	 */
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(onerel))
+	{
+		xl_zheap_unused	xl_rec;
+		xl_undo_header	xlundohdr;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Store the information required to generate undo record during
+		 * replay.
+		 */
+		xlundohdr.reloid = undorecord.uur_reloid;
+		xlundohdr.urec_ptr = urecptr;
+		xlundohdr.blkprev = prev_urecptr;
+
+		xl_rec.latestRemovedXid = vacrelstats->latestRemovedXid;
+		xl_rec.nunused = uncnt;
+		xl_rec.trans_slot_id = trans_slot_id;
+		xl_rec.flags = 0;
+		if (pruned)
+			xl_rec.flags |= XLZ_UNUSED_ALLOW_PRUNING;
+
+prepare_xlog:
+		/*
+		 * WAL-LOG undolog meta data if this is the fisrt WAL after the
+		 * checkpoint.
+		 */
+		LogUndoMetaData(&undometa);
+		
+		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader);
+		XLogRegisterData((char *) &xl_rec, SizeOfZHeapUnused);
+
+		XLogRegisterData((char *) unused, uncnt * sizeof(OffsetNumber));
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			(void) RegisterTPDBuffer(page, 1);
+
+		recptr = XLogInsertExtended(RM_ZHEAP2_ID, XLOG_ZHEAP_UNUSED, RedoRecPtr,
+									doPageWrites);
+		if (recptr == InvalidXLogRecPtr)
+			goto prepare_xlog;
+
+		PageSetLSN(page, recptr);
+		if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)
+			TPDPageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	UnlockReleaseUndoBuffers();
+	UnlockReleaseTPDBuffers();
+
+	/* be tidy */
+	pfree(tmppage);
+
+	/*
+	 * Now that we have removed the dead tuples from the page, once again
+	 * check if the page has become potentially all-visible.  The page is
+	 * already marked dirty, exclusively locked.  We can't mark the page
+	 * as all-visible here because we have yet to remove index entries
+	 * corresponding dead tuples.  So, we mark them potentially all-visible
+	 * and later after removing index entries, if still the bit is set, we
+	 * mark them as all-visible.
+	 */
+	if (zheap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+	{
+		uint8		vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+		uint8		flags = 0;
+
+		/* Set the VM to become potentially all-visible, if needed */
+		if ((vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE) == 0)
+			flags |= VISIBILITYMAP_POTENTIAL_ALL_VISIBLE;
+
+		if (TransactionIdFollows(visibility_cutoff_xid,
+								*global_visibility_cutoff_xid))
+			*global_visibility_cutoff_xid = visibility_cutoff_xid;
+
+		Assert(BufferIsValid(*vmbuffer));
+		if (flags != 0)
+			visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
+							  *vmbuffer, InvalidTransactionId, flags);
+	}
+
+	return tupindex;
+}
+
+/*
+ *	MarkPagesAsAllVisible() -- Mark all the pages corresponding to dead tuples
+ *		as all-visible.
+ *
+ * We mark the page as all-visible, if it is already marked as potential
+ * all-visible.
+ */
+static void
+MarkPagesAsAllVisible(Relation rel, LVRelStats *vacrelstats,
+					  TransactionId visibility_cutoff_xid)
+{
+	int		idx = 0;
+
+	for (; idx < vacrelstats->num_dead_tuples; idx++)
+	{
+		BlockNumber tblk;
+		BlockNumber prev_tblk = InvalidBlockNumber;
+		Buffer		vmbuffer = InvalidBuffer;
+		Buffer		buf = InvalidBuffer;
+		uint8		vm_status;
+
+		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[idx]);
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, tblk,
+								 RBM_NORMAL, NULL);
+
+		/* Avoid processing same block again and again. */
+		if (tblk == prev_tblk)
+			continue;
+
+		visibilitymap_pin(rel, tblk, &vmbuffer);
+		vm_status = visibilitymap_get_status(rel, tblk, &vmbuffer);
+
+		/* Set the VM all-visible bit, if needed */
+		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0 &&
+			(vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE))
+		{
+			visibilitymap_clear(rel, tblk, vmbuffer,
+								VISIBILITYMAP_VALID_BITS);
+
+			Assert(BufferIsValid(buf));
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+			visibilitymap_set(rel, tblk, buf, InvalidXLogRecPtr, vmbuffer,
+							  visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
+
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		}
+
+		if (BufferIsValid(vmbuffer))
+		{
+			ReleaseBuffer(vmbuffer);
+			vmbuffer = InvalidBuffer;
+		}
+
+		if (BufferIsValid(buf))
+		{
+			ReleaseBuffer(buf);
+			buf = InvalidBuffer;
+		}
+
+		prev_tblk = tblk;
+	}
+}
+
+/*
+ *	lazy_scan_zheap() -- scan an open heap relation
+ *
+ *		This routine prunes each page in the zheap, which will among other
+ *		things truncate dead tuples to dead line pointers, truncate recently
+ *		dead tuples to deleted line pointers and defragment the page
+ *		(see zheap_page_prune).  It also builds lists of dead tuples and pages
+ *		with free space, calculates statistics on the number of live tuples in
+ *		the zheap.  It then reclaim all dead line pointers and write undo for
+ *		each of them, so that if there is any error later, we can rollback the
+ *		operation.  When done, or when we run low on space for dead-tuple
+ *		TIDs, invoke vacuuming of indexes.
+ *
+ *		We also need to ensure that the heap-TIDs won't get reused till the
+ *		transaction that has performed this vacuum is committed.  To achieve
+ *		that, we need to store transaction slot information in the line
+ *		pointers that are marked unused in the first-pass of heap.
+ *
+ *		If there are no indexes then we can reclaim line pointers without
+ *		writting any undo;
+ */
+static void
+lazy_scan_zheap(Relation onerel, int options, LVRelStats *vacrelstats,
+				Relation *Irel, int nindexes,
+				BufferAccessStrategy vac_strategy, bool aggressive)
+{
+	BlockNumber nblocks,
+				blkno;
+	ZHeapTupleData tuple;
+	char	   *relname;
+	BlockNumber empty_pages,
+				vacuumed_pages,
+				next_fsm_block_to_vacuum;
+	double		num_tuples,
+				tups_vacuumed,
+				nkeep,
+				nunused;
+	IndexBulkDeleteResult **indstats;
+	StringInfoData infobuf;
+	int			i;
+	int			tupindex = 0;
+	PGRUsage	ru0;
+	BlockNumber next_unskippable_block;
+	bool		skipping_blocks;
+	Buffer		vmbuffer = InvalidBuffer;
+	TransactionId visibility_cutoff_xid = InvalidTransactionId;
+
+	pg_rusage_init(&ru0);
+
+	relname = RelationGetRelationName(onerel);
+	if (aggressive)
+		ereport(elevel,
+				(errmsg("aggressively vacuuming \"%s.%s\"",
+						get_namespace_name(RelationGetNamespace(onerel)),
+						relname)));
+	else
+		ereport(elevel,
+				(errmsg("vacuuming \"%s.%s\"",
+						get_namespace_name(RelationGetNamespace(onerel)),
+						relname)));
+
+	empty_pages = vacuumed_pages = 0;
+	next_fsm_block_to_vacuum = (BlockNumber) 0;
+	num_tuples = tups_vacuumed = nkeep = nunused = 0;
+
+	indstats = (IndexBulkDeleteResult **)
+		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
+
+	nblocks = RelationGetNumberOfBlocks(onerel);
+	vacrelstats->rel_pages = nblocks;
+	vacrelstats->scanned_pages = 0;
+	vacrelstats->tupcount_pages = 0;
+	vacrelstats->nonempty_pages = 0;
+	vacrelstats->latestRemovedXid = InvalidTransactionId;
+
+	lazy_space_zalloc(vacrelstats, nblocks);
+	next_unskippable_block = ZHEAP_METAPAGE + 1;
+	if (!aggressive)
+	{
+	
+		Assert((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0);
+		while (next_unskippable_block < nblocks)
+		{
+			uint8       vmstatus;
+
+			vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+												&vmbuffer);
+
+			if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+				break;
+
+			vacuum_delay_point();
+			next_unskippable_block++;
+       }
+   }
+
+   if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
+       skipping_blocks = true;
+   else
+       skipping_blocks = false;
+
+	for (blkno = ZHEAP_METAPAGE + 1; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		TransactionId	xid;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+		bool		tupgone,
+					hastup;
+		bool		all_visible_according_to_vm = false;
+		bool		all_visible;
+		bool		has_dead_tuples;
+
+		if (blkno == next_unskippable_block)
+		{
+			/* Time to advance next_unskippable_block */
+			next_unskippable_block++;
+			if (!aggressive)
+			{
+				while (next_unskippable_block < nblocks)
+				{
+					uint8		vmskipflags;
+
+					vmskipflags = visibilitymap_get_status(onerel,
+														   next_unskippable_block,
+														   &vmbuffer);
+					if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
+						break;
+
+					vacuum_delay_point();
+					next_unskippable_block++;
+				}
+			}
+
+			/*
+			 * We know we can't skip the current block.  But set up
+			 * skipping_blocks to do the right thing at the following blocks.
+			 */
+			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
+				skipping_blocks = true;
+			else
+				skipping_blocks = false;
+		}
+		else
+		{
+			/*
+			 * The current block is potentially skippable; if we've seen a
+			 * long enough run of skippable blocks to justify skipping it.
+			 */
+			if (skipping_blocks)
+				continue;
+			all_visible_according_to_vm = true;
+		}
+
+		vacuum_delay_point();
+
+		/*
+		 * If we are close to overrunning the available space for dead-tuple
+		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
+		 */
+		if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxZHeapTuplesPerPage &&
+			vacrelstats->num_dead_tuples > 0)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			/*
+			 * Remove index entries.  Unlike, heap we don't need to log special
+			 * cleanup info which includes latest latestRemovedXid for standby.
+			 * This is because we have covered all the dead tuples in the first
+			 * pass itself and we don't need another pass on heap after index.
+			 */
+			for (i = 0; i < nindexes; i++)
+				lazy_vacuum_index(Irel[i],
+								  &indstats[i],
+								  vacrelstats,
+								  vac_strategy);
+			/*
+			 * XXX - The cutoff xid used here is the highest xmin of all the heap
+			 * pages scanned.  This can lead to more query cancellations on
+			 * standby.  However, alternative is that we track cutoff_xid for
+			 * each page in first-pass of vacuum and then use it after removing
+			 * index entries.  We didn't pursue the alternative because it would
+			 * require more work memory which means it can lead to more index
+			 * passes.
+			 */
+			MarkPagesAsAllVisible(onerel, vacrelstats, visibility_cutoff_xid);
+
+			/*
+			 * Forget the now-vacuumed tuples, and press on, but be careful
+			 * not to reset latestRemovedXid since we want that value to be
+			 * valid.
+			 */
+			tupindex = 0;
+			vacrelstats->num_dead_tuples = 0;
+			vacrelstats->num_index_scans++;
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 */
+			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+			next_fsm_block_to_vacuum = blkno;
+		}
+
+		/*
+		 * Pin the visibility map page in case we need to mark the page
+		 * all-visible.  In most cases this will be very cheap, because we'll
+		 * already have the correct page pinned anyway.  However, it's
+		 * possible that (a) next_unskippable_block is covered by a different
+		 * VM page than the current block or (b) we released our pin and did a
+		 * cycle of index vacuuming.
+		 *
+		 */
+		visibilitymap_pin(onerel, blkno, &vmbuffer);
+
+		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, vac_strategy);
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+		vacrelstats->scanned_pages++;
+		vacrelstats->tupcount_pages++;
+
+		page = BufferGetPage(buf);
+
+		if (PageIsNew(page))
+		{
+			/*
+			 * An all-zeroes page could be left over if a backend extends the
+			 * relation but crashes before initializing the page. Reclaim such
+			 * pages for use.  See the similar code in lazy_scan_heap to know
+			 * why we have used relation extension lock.
+			 */
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockRelationForExtension(onerel, ExclusiveLock);
+			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			if (PageIsNew(page))
+			{
+				ereport(WARNING,
+						(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
+								relname, blkno)));
+				Assert(BufferGetBlockNumber(buf) != ZHEAP_METAPAGE);
+				ZheapInitPage(page, BufferGetPageSize(buf));
+				empty_pages++;
+			}
+			freespace = PageGetZHeapFreeSpace(page);
+			MarkBufferDirty(buf);
+			UnlockReleaseBuffer(buf);
+
+			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			continue;
+		}
+
+		/*
+		 * Skip TPD pages.  This needs to be checked before PageIsEmpty as TPD
+		 * pages can also be empty, but we don't want to deal with it like a
+		 * heap page.
+		 */
+		/*
+		 * Prune the TPD pages and if all the entries are removed, then record
+		 * it in FSM, so that it can be reused as a zheap page.
+		 */
+		if (PageGetSpecialSize(page) == sizeof(TPDPageOpaqueData))
+		{
+			/* If the page is already pruned, skip it. */
+			if (!PageIsEmpty(page))
+				TPDPagePrune(onerel, buf, vac_strategy, InvalidOffsetNumber, 0,
+							 true, NULL, NULL);
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		if (PageIsEmpty(page))
+		{
+			uint8		vmstatus;
+			empty_pages++;
+			freespace = PageGetZHeapFreeSpace(page);
+
+			vmstatus = visibilitymap_get_status(onerel,
+												blkno,
+												&vmbuffer);
+			if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				START_CRIT_SECTION();
+
+				/* mark buffer dirty before writing a WAL record */
+				MarkBufferDirty(buf);
+
+				/*
+				 * It's possible that another backend has extended the heap,
+				 * initialized the page, and then failed to WAL-log the page
+				 * due to an ERROR.  Since heap extension is not WAL-logged,
+				 * recovery might try to replay our record setting the page
+				 * all-visible and find that the page isn't initialized, which
+				 * will cause a PANIC.  To prevent that, check whether the
+				 * page has been previously WAL-logged, and if not, do that
+				 * now.
+				 */
+				if (RelationNeedsWAL(onerel) &&
+					PageGetLSN(page) == InvalidXLogRecPtr)
+					log_newpage_buffer(buf, true);
+
+				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+								  vmbuffer, InvalidTransactionId,
+								  VISIBILITYMAP_ALL_VISIBLE);
+
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buf);
+			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			continue;
+		}
+
+		/*
+		 * We count tuples removed by the pruning step as removed by VACUUM.
+		 */
+		tups_vacuumed += zheap_page_prune_guts(onerel, buf, OldestXmin,
+											   InvalidOffsetNumber, 0, false,
+											   false,
+											   &vacrelstats->latestRemovedXid,
+											   NULL);
+
+		/* Now scan the page to collect vacuumable items. */
+		hastup = false;
+		freespace = 0;
+		maxoff = PageGetMaxOffsetNumber(page);
+		all_visible = true;
+		has_dead_tuples = false;
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Unused items require no processing, but we count 'em */
+			if (!ItemIdIsUsed(itemid))
+			{
+				nunused += 1;
+				continue;
+			}
+
+			/* Deleted items mustn't be touched */
+			if (ItemIdIsDeleted(itemid))
+			{
+				hastup = true;	/* this page won't be truncatable */
+				all_visible = false;
+				continue;
+			}
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			/*
+			 * DEAD item pointers are to be vacuumed normally; but we don't
+			 * count them in tups_vacuumed, else we'd be double-counting (at
+			 * least in the common case where zheap_page_prune_guts() just
+			 * freed up a tuple).
+			 */
+			if (ItemIdIsDead(itemid))
+			{
+				all_visible = false;
+				lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			tuple.t_data = (ZHeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(onerel);
+
+			tupgone = false;
+
+			switch (ZHeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf, &xid))
+			{
+				case ZHEAPTUPLE_DEAD:
+
+					/*
+					 * Ordinarily, DEAD tuples would have been removed by
+					 * zheap_page_prune_guts(), but it's possible that the
+					 * tuple state changed since heap_page_prune() looked.
+					 * In particular an INSERT_IN_PROGRESS tuple could have
+					 * changed to DEAD if the inserter aborted.  So this
+					 * cannot be considered an error condition.
+					 */
+					tupgone = true; /* we can delete the tuple */
+					all_visible = false;
+					break;
+				case ZHEAPTUPLE_LIVE:
+					if (all_visible)
+					{
+						if (!TransactionIdPrecedes(xid, OldestXmin))
+						{
+							all_visible = false;
+							break;
+						}
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xid, visibility_cutoff_xid))
+						visibility_cutoff_xid = xid;
+					break;
+				case ZHEAPTUPLE_RECENTLY_DEAD:
+
+					/*
+					 * If tuple is recently deleted then we must not remove it
+					 * from relation.
+					 */
+					nkeep += 1;
+					all_visible = false;
+					break;
+				case ZHEAPTUPLE_INSERT_IN_PROGRESS:
+				case ZHEAPTUPLE_DELETE_IN_PROGRESS:
+					/* This is an expected case during concurrent vacuum */
+					all_visible = false;
+					break;
+				case ZHEAPTUPLE_ABORT_IN_PROGRESS:
+					/*
+					 * We can simply skip the tuple if it has inserted/operated by
+					 * some aborted transaction and its rollback is still pending. It'll
+					 * be taken care of by future vacuum calls.
+					 */
+					all_visible = false;
+					break;
+				default:
+					elog(ERROR, "unexpected ZHeapTupleSatisfiesVacuum result");
+					break;
+			}
+
+			if (tupgone)
+			{
+				lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+				ZHeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, xid,
+													   &vacrelstats->latestRemovedXid);
+				tups_vacuumed += 1;
+				has_dead_tuples = true;
+			}
+			else
+			{
+				num_tuples += 1;
+				hastup = true;
+			}
+		}						/* scan along page */
+
+		/*
+		 * If there are no indexes then we can vacuum the page right now
+		 * instead of doing a second scan.
+		 */
+		if (vacrelstats->num_dead_tuples > 0)
+		{
+			if (nindexes == 0)
+			{
+				/* Remove tuples from zheap */
+				tupindex = lazy_vacuum_zpage(onerel, blkno, buf, tupindex,
+											 vacrelstats, &vmbuffer);
+				has_dead_tuples = false;
+
+				/*
+				 * Forget the now-vacuumed tuples, and press on, but be careful
+				 * not to reset latestRemovedXid since we want that value to be
+				 * valid.
+				 */
+				vacrelstats->num_dead_tuples = 0;
+				vacuumed_pages++;
+				/*
+				 * Periodically do incremental FSM vacuuming to make newly-freed
+				 * space visible on upper FSM pages.  Note: although we've cleaned
+				 * the current block, we haven't yet updated its FSM entry (that
+				 * happens further down), so passing end == blkno is correct.
+				 */
+				if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				{
+					FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+											blkno);
+					next_fsm_block_to_vacuum = blkno;
+				}
+			}
+			else
+			{
+				Assert(nindexes > 0);
+
+				/* Remove tuples from zheap and write the undo for it. */
+				tupindex = lazy_vacuum_zpage_with_undo(onerel, blkno, buf,
+													   tupindex, vacrelstats,
+													   &vmbuffer,
+													   &visibility_cutoff_xid);
+			}
+		}
+
+		/* Now that we are done with the page, get its available space */
+		freespace = PageGetZHeapFreeSpace(page);
+
+		/* mark page all-visible, if appropriate */
+		if (all_visible && !all_visible_according_to_vm)
+		{
+			uint8       flags = VISIBILITYMAP_ALL_VISIBLE;
+
+			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, visibility_cutoff_xid, flags);
+		}
+		else if (has_dead_tuples && all_visible_according_to_vm)
+		{
+    		visibilitymap_clear(onerel, blkno, vmbuffer,
+								VISIBILITYMAP_VALID_BITS);
+		}
+
+		UnlockReleaseBuffer(buf);
+
+		/* Remember the location of the last page with nonremovable tuples */
+		if (hastup)
+			vacrelstats->nonempty_pages = blkno + 1;
+
+		/* We're done with this page, so remember its free space as-is. */
+		if (freespace)
+			RecordPageWithFreeSpace(onerel, blkno, freespace);
+	}
+
+	/* save stats for use later */
+	vacrelstats->tuples_deleted = tups_vacuumed;
+	vacrelstats->new_dead_tuples = nkeep;
+
+	/*
+	 * Now we can compute the new value for pg_class.reltuples.  To compensate
+	 * for metapage pass one less than the actual nblocks.
+	 */
+	vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel,
+														 nblocks - 1,
+														 vacrelstats->tupcount_pages,
+														 num_tuples);
+
+	/*
+	 * Release any remaining pin on visibility map page.
+	 */
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	if (vacrelstats->num_dead_tuples > 0)
+	{
+		/*
+		 * Remove index entries.  Unlike, heap we don't need to log special
+		 * cleanup info which includes latest latestRemovedXid for standby.
+		 * This is because we have covered all the dead tuples in the first
+		 * pass itself and we don't need another pass on heap after index.
+		 */
+		for (i = 0; i < nindexes; i++)
+			lazy_vacuum_index(Irel[i],
+							  &indstats[i],
+							  vacrelstats,
+							  vac_strategy);
+
+		/*
+		 * XXX - The cutoff xid used here is the highest xmin of all the heap
+		 * pages scanned.  This can lead to more query cancellations on
+		 * standby.  However, alternative is that we track cutoff_xid for
+		 * each page in first-pass of vacuum and then use it after removing
+		 * index entries.  We didn't pursue the alternative because it would
+		 * require more work memory which means it can lead to more index
+		 * passes.
+		 */
+		MarkPagesAsAllVisible(onerel, vacrelstats, visibility_cutoff_xid);
+
+		vacrelstats->num_index_scans++;
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+		next_fsm_block_to_vacuum = blkno;
+	}
+	
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes.
+	 */
+	if (blkno > next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+
+	/* Do post-vacuum cleanup and statistics update for each index */
+	for (i = 0; i < nindexes; i++)
+		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats, vac_strategy);
+
+	/*
+	 * This is pretty messy, but we split it up so that we can skip emitting
+	 * individual parts of the message when not applicable.
+	 */
+	initStringInfo(&infobuf);
+	appendStringInfo(&infobuf,
+					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
+					 nkeep, OldestXmin);
+	appendStringInfo(&infobuf, _("There were %.0f unused item pointers.\n"),
+					 nunused);
+	appendStringInfo(&infobuf, ngettext("%u page is entirely empty.\n",
+									"%u pages are entirely empty.\n",
+									empty_pages),
+					 empty_pages);
+	appendStringInfo(&infobuf, _("%s."), pg_rusage_show(&ru0));
+
+	ereport(elevel,
+			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+					RelationGetRelationName(onerel),
+					tups_vacuumed, num_tuples,
+					vacrelstats->scanned_pages, nblocks),
+			 errdetail_internal("%s", infobuf.data)));
+	pfree(infobuf.data);
+}
+
+/*
+ *	lazy_vacuum_zheap_rel() -- perform LAZY VACUUM for one zheap relation
+ */
+void
+lazy_vacuum_zheap_rel(Relation onerel, int options, VacuumParams *params,
+					  BufferAccessStrategy bstrategy)
+{
+	LVRelStats *vacrelstats;
+	Relation   *Irel;
+	int			nindexes;
+	PGRUsage	ru0;
+	TimestampTz starttime = 0;
+	long		secs;
+	int			usecs;
+	double		read_rate,
+				write_rate;
+	bool		aggressive = false;	/* should we scan all unfrozen pages? */
+	BlockNumber new_rel_pages;
+	double		new_rel_tuples;
+	double		new_live_tuples;
+
+	Assert(params != NULL);
+
+	/*
+	 * For zheap, since vacuum process also reserves transaction slot
+	 * in page, other backend can't ignore this while calculating
+	 * OldestXmin/RecentXmin.  See GetSnapshotData for details.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	MyPgXact->vacuumFlags &= ~PROC_IN_VACUUM;
+	LWLockRelease(ProcArrayLock);
+
+	/* measure elapsed time iff autovacuum logging requires it */
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+	{
+		pg_rusage_init(&ru0);
+		starttime = GetCurrentTimestamp();
+	}
+
+	if (options & VACOPT_VERBOSE)
+		elevel = INFO;
+	else
+		elevel = DEBUG2;
+
+	vac_strategy = bstrategy;
+
+	/*
+	 * We can't ignore processes running lazy vacuum on zheap relations because
+	 * like other backends operating on zheap, lazy vacuum also reserves a
+	 * transaction slot in the page for pruning purpose.
+	 */
+	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_DEFAULT);
+
+	Assert(TransactionIdIsNormal(OldestXmin));
+
+	/*
+	 * We request an aggressive scan if DISABLE_PAGE_SKIPPING was specified.
+	 */
+	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
+		aggressive = true;
+
+	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
+
+	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
+	vacrelstats->old_live_tuples = onerel->rd_rel->reltuples;
+	vacrelstats->num_index_scans = 0;
+	vacrelstats->pages_removed = 0;
+	vacrelstats->lock_waiter_detected = false;
+
+	/* Open all indexes of the relation */
+	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
+	vacrelstats->hasindex = (nindexes > 0);
+
+	/* Do the vacuuming */
+	lazy_scan_zheap(onerel, options, vacrelstats, Irel, nindexes,
+					vac_strategy, aggressive);
+
+	/* Done with indexes */
+	vac_close_indexes(nindexes, Irel, NoLock);
+
+	/*
+	 * Optionally truncate the relation.
+	 */
+	if (should_attempt_truncation(vacrelstats))
+		lazy_truncate_heap(onerel, vacrelstats, vac_strategy);
+
+	/*
+	 * Update statistics in pg_class.
+	 *
+	 * A corner case here is that if we scanned no pages at all because every
+	 * page is all-visible, we should not update relpages/reltuples, because
+	 * we have no new information to contribute.  In particular this keeps us
+	 * from replacing relpages=reltuples=0 (which means "unknown tuple
+	 * density") with nonzero relpages and reltuples=0 (which means "zero
+	 * tuple density") unless there's some actual evidence for the latter.
+	 *
+	 * We can use either tupcount_pages or scanned_pages for the check
+	 * described above as both the valuse should be same.  However, we use
+	 * earlier so as to be consistent with heap.
+	 *
+	 * Fixme: We do need to update relallvisible as in heap once we start
+	 * using visibilitymap or something equivalent to it.
+	 *
+	 * relfrozenxid/relminmxid are invalid as we don't perform freeze
+	 * operation in zheap.
+	 */
+	new_rel_pages = vacrelstats->rel_pages;
+	new_rel_tuples = vacrelstats->new_rel_tuples;
+	if (vacrelstats->tupcount_pages == 0 && new_rel_pages > 0)
+	{
+		new_rel_pages = vacrelstats->old_rel_pages;
+		new_rel_tuples = vacrelstats->old_live_tuples;
+	}
+
+	vac_update_relstats(onerel,
+						new_rel_pages,
+						new_rel_tuples,
+						new_rel_pages,
+						vacrelstats->hasindex,
+						InvalidTransactionId,
+						InvalidMultiXactId,
+						false);
+
+	/* report results to the stats collector, too */
+	new_live_tuples = new_rel_tuples - vacrelstats->new_dead_tuples;
+	if (new_live_tuples < 0)
+		new_live_tuples = 0;	/* just in case */
+
+	pgstat_report_vacuum(RelationGetRelid(onerel),
+						 onerel->rd_rel->relisshared,
+						 new_live_tuples,
+						 vacrelstats->new_dead_tuples);
+
+	/* and log the action if appropriate */
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+	{
+		TimestampTz endtime = GetCurrentTimestamp();
+
+		if (params->log_min_duration == 0 ||
+			TimestampDifferenceExceeds(starttime, endtime,
+									   params->log_min_duration))
+		{
+			StringInfoData buf;
+			char	   *msgfmt;
+
+			TimestampDifference(starttime, endtime, &secs, &usecs);
+
+			read_rate = 0;
+			write_rate = 0;
+			if ((secs > 0) || (usecs > 0))
+			{
+				read_rate = (double) BLCKSZ * VacuumPageMiss / (1024 * 1024) /
+					(secs + usecs / 1000000.0);
+				write_rate = (double) BLCKSZ * VacuumPageDirty / (1024 * 1024) /
+					(secs + usecs / 1000000.0);
+			}
+
+			/*
+			 * This is pretty messy, but we split it up so that we can skip
+			 * emitting individual parts of the message when not applicable.
+			 */
+			initStringInfo(&buf);
+			if (aggressive)
+				msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+			else
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
+			appendStringInfo(&buf, msgfmt,
+							 get_database_name(MyDatabaseId),
+							 get_namespace_name(RelationGetNamespace(onerel)),
+							 RelationGetRelationName(onerel),
+							 vacrelstats->num_index_scans);
+			appendStringInfo(&buf, _("pages: %u removed, %u remain\n"),
+							 vacrelstats->pages_removed,
+							 vacrelstats->rel_pages);
+			appendStringInfo(&buf,
+							 _("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable, oldest xmin: %u\n"),
+							 vacrelstats->tuples_deleted,
+							 vacrelstats->new_rel_tuples,
+							 vacrelstats->new_dead_tuples,
+							 OldestXmin);
+			appendStringInfo(&buf,
+							 _("buffer usage: %d hits, %d misses, %d dirtied\n"),
+							 VacuumPageHit,
+							 VacuumPageMiss,
+							 VacuumPageDirty);
+			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
+							 read_rate, write_rate);
+			appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
+
+			ereport(LOG,
+					(errmsg_internal("%s", buf.data)));
+			pfree(buf.data);
+		}
+	}
+}
+
+/*
+ * lazy_space_zalloc - space allocation decisions for lazy vacuum
+ *
+ * See the comments at the head of this file for rationale.
+ */
+static void
+lazy_space_zalloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+{
+	long		maxtuples;
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+	autovacuum_work_mem != -1 ?
+	autovacuum_work_mem : maintenance_work_mem;
+
+	if (vacrelstats->hasindex)
+	{
+		maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
+		maxtuples = Min(maxtuples, INT_MAX);
+		maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+
+		/* curious coding here to ensure the multiplication can't overflow */
+		if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
+			maxtuples = relblocks * LAZY_ALLOC_TUPLES;
+
+		/* stay sane if small maintenance_work_mem */
+		maxtuples = Max(maxtuples, MaxZHeapTuplesPerPage);
+	}
+	else
+	{
+		maxtuples = MaxZHeapTuplesPerPage;
+	}
+
+	vacrelstats->num_dead_tuples = 0;
+	vacrelstats->max_dead_tuples = (int) maxtuples;
+	vacrelstats->dead_tuples = (ItemPointer)
+		palloc(maxtuples * sizeof(ItemPointerData));
+}
+
+/*
+ * Check if every tuple in the given page is visible to all current and future
+ * transactions. Also return the visibility_cutoff_xid which is the highest
+ * xmin amongst the visible tuples.
+ */
+static bool
+zheap_page_is_all_visible(Relation rel, Buffer buf,
+						  TransactionId *visibility_cutoff_xid)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blockno = BufferGetBlockNumber(buf);
+	OffsetNumber offnum,
+				maxoff;
+	bool		all_visible = true;
+
+	*visibility_cutoff_xid = InvalidTransactionId;
+
+	/*
+	 * This is a stripped down version of the line pointer scan in
+	 * lazy_scan_zheap(). So if you change anything here, also check that code.
+	 */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff && all_visible;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		TransactionId xid;
+		ZHeapTupleData tuple;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused or redirect line pointers are of no interest */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+			continue;
+
+		ItemPointerSet(&(tuple.t_self), blockno, offnum);
+
+		/*
+		 * Dead line pointers can have index pointers pointing to them. So
+		 * they can't be treated as visible
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			all_visible = false;
+			break;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		tuple.t_data = (ZHeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(rel);
+
+		switch (ZHeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf, &xid))
+		{
+			case ZHEAPTUPLE_LIVE:
+				{
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					if (!TransactionIdPrecedes(xid, OldestXmin))
+					{
+						all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xid, *visibility_cutoff_xid))
+						*visibility_cutoff_xid = xid;
+				}
+				break;
+
+			case ZHEAPTUPLE_DEAD:
+			case ZHEAPTUPLE_RECENTLY_DEAD:
+			case ZHEAPTUPLE_INSERT_IN_PROGRESS:
+			case ZHEAPTUPLE_DELETE_IN_PROGRESS:
+			case ZHEAPTUPLE_ABORT_IN_PROGRESS:
+				{
+					all_visible = false;
+					break;
+				}
+			default:
+				elog(ERROR, "unexpected ZHeapTupleSatisfiesVacuum result");
+				break;
+		}
+	}							/* scan along page */
+
+	return all_visible;
+}
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index a3015333f3..3d06ba3013 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -31,11 +31,13 @@
 
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/zheap.h"
 #include "catalog/binary_upgrade.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -918,10 +920,14 @@ AddNewRelationTuple(Relation pg_class_desc,
 			break;
 	}
 
-	/* Initialize relfrozenxid and relminmxid */
-	if (relkind == RELKIND_RELATION ||
-		relkind == RELKIND_MATVIEW ||
-		relkind == RELKIND_TOASTVALUE)
+	/*
+	 * Initialize relfrozenxid and relminmxid.  The relations stored in zheap
+	 * doesn't need to perform freeze.
+	 */
+	if ((relkind == RELKIND_RELATION ||
+		 relkind == RELKIND_MATVIEW ||
+		 relkind == RELKIND_TOASTVALUE) &&
+		new_rel_desc->rd_rel->relam != ZHEAP_TABLE_AM_OID)
 	{
 		/*
 		 * Initialize to the minimum XID that could put tuples in the table.
@@ -1391,6 +1397,15 @@ heap_create_with_catalog(const char *relname,
 	if (oncommit != ONCOMMIT_NOOP)
 		register_on_commit_action(relid, oncommit);
 
+	/*
+	 * Initialize the metapage for zheap, except for partitioned relations as
+	 * they do not have any storage
+	 * PBORKED: abstract
+	 */
+	if (accessmtd == ZHEAP_TABLE_AM_OID &&
+		relkind != 'p')
+		ZheapInitMetaPage(new_rel_desc, MAIN_FORKNUM);
+
 	/*
 	 * Unlogged objects need an init fork, except for partitioned tables which
 	 * have no storage at all.
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..37062f39cb 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -24,6 +24,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/tableam.h"
+#include "access/zheap.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
@@ -288,6 +290,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 
 	/* Do the real work */
 	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+	/*
+	 * Re-Initialize the meta page for zheap when the relation is completely
+	 * truncated.
+	 *
+	 * ZBORKED: Is this really sufficient / necessary? Why can't we just stop
+	 * truncating so far in the smgrtruncate above? And if we can't do that,
+	 * why isn't the metapage outdated regardless?
+	 */
+	if (RelationStorageIsZHeap(rel) && nblocks <= 0)
+		ZheapInitMetaPage(rel, MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8630542bb3..8f83ce196c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -940,6 +940,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio
     ON pg_subscription TO public;
 
 
+CREATE VIEW pg_stat_undo_logs AS
+    SELECT *
+    FROM pg_stat_get_undo_logs();
+
 --
 -- We have a few function definitions in here, too.
 -- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1b8d03642c..0acd9c885e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -858,6 +858,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	if (MultiXactIdPrecedes(MultiXactCutoff, OldHeap->rd_rel->relminmxid))
 		MultiXactCutoff = OldHeap->rd_rel->relminmxid;
 
+	// ZBORKED / PBORKED: change API so table_copy_for_cluster can set
+
 	/* return selected values to caller */
 	*pFreezeXid = FreezeXid;
 	*pCutoffMulti = MultiXactCutoff;
@@ -1096,9 +1098,24 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 	/* set rel1's frozen Xid and minimum MultiXid */
 	if (relform1->relkind != RELKIND_INDEX)
 	{
-		Assert(TransactionIdIsNormal(frozenXid));
+		Relation	rel;
+
+		/*
+		 * ZBORKED:
+		 *
+		 * We don't have multixact or frozenXid concept for zheap. This is a
+		 * hack to keep Asserts, probably we need some pluggable API here to
+		 * set frozen and multixact cutoff xid's.
+		 */
+		rel = heap_open(r1, NoLock);
+		if (!RelationStorageIsZHeap(rel))
+		{
+			Assert(TransactionIdIsNormal(frozenXid));
+			Assert(MultiXactIdIsValid(cutoffMulti));
+		}
+		heap_close(rel, NoLock);
+
 		relform1->relfrozenxid = frozenXid;
-		Assert(MultiXactIdIsValid(cutoffMulti));
 		relform1->relminmxid = cutoffMulti;
 	}
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9851695fca..c36f3379ff 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2395,7 +2395,12 @@ CopyFrom(CopyState cstate)
 		 cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
 	{
 		hi_options |= HEAP_INSERT_SKIP_FSM;
-		if (!XLogIsNeeded())
+		/*
+		 * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
+		 * See zheap_prepare_insert for details.
+		 * PBORKED / ZBORKED: abstract
+		 */
+		if (!RelationStorageIsZHeap(cstate->rel) && !XLogIsNeeded())
 			hi_options |= HEAP_INSERT_SKIP_WAL;
 	}
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d346bf0749..503a319808 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -560,9 +560,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	/*
 	 * We can skip WAL-logging the insertions, unless PITR or streaming
 	 * replication is in use. We can skip the FSM in any case.
+	 * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
+	 * See zheap_prepare_insert for details.
+	 * PBORKED / ZBORKED: Move logic into table_getbulkinsertstate, somehow?
 	 */
 	myState->hi_options = HEAP_INSERT_SKIP_FSM |
-		(XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+		(RelationStorageIsZHeap(myState->rel) || XLogIsNeeded() ? 0
+		 : HEAP_INSERT_SKIP_WAL);
 	myState->bistate = GetBulkInsertState();
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 03a9d22162..4ece73ca3f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -23,8 +23,10 @@
 #include "access/tableam.h"
 #include "access/sysattr.h"
 #include "access/tupconvert.h"
+#include "access/tpd.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/zheap.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
@@ -470,6 +472,7 @@ static void ATExecForceNoForceRowSecurity(Relation rel, bool force_rls);
 
 static void copy_relation_data(SMgrRelation rel, SMgrRelation dst,
 				   ForkNumber forkNum, char relpersistence);
+static void copy_zrelation_data(Relation srcRel, SMgrRelation dst, ForkNumber forkNum);
 static const char *storage_name(char c);
 
 static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
@@ -1621,8 +1624,12 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
 			 *
 			 * PBORKED: needs to be a callback
 			 */
-			RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
-									  RecentXmin, minmulti);
+			if (RelationStorageIsZHeap(rel))
+				RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
+										  InvalidTransactionId, InvalidMultiXactId);
+			else
+				RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
+										  RecentXmin, minmulti);
 			if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
 				table_create_init_fork(rel);
 
@@ -1637,9 +1644,14 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
 				Relation	toastrel = relation_open(toast_relid,
 													 AccessExclusiveLock);
 
-				RelationSetNewRelfilenode(toastrel,
-										  toastrel->rd_rel->relpersistence,
-										  RecentXmin, minmulti);
+				if (RelationStorageIsZHeap(toastrel))
+					RelationSetNewRelfilenode(toastrel,
+											  toastrel->rd_rel->relpersistence,
+											  InvalidTransactionId, InvalidMultiXactId);
+				else
+					RelationSetNewRelfilenode(toastrel,
+											  toastrel->rd_rel->relpersistence,
+											  RecentXmin, minmulti);
 				if (toastrel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
 					table_create_init_fork(toastrel);
 				heap_close(toastrel, NoLock);
@@ -4588,7 +4600,13 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 		bistate = GetBulkInsertState();
 
 		hi_options = HEAP_INSERT_SKIP_FSM;
-		if (!XLogIsNeeded())
+		/*
+		 * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
+		 * See zheap_prepare_insert for details.
+		 *
+		 * ZBORKED / PBORKED: We probably need a different abstraction for this.
+		 */
+		if (!RelationStorageIsZHeap(newrel) && !XLogIsNeeded())
 			hi_options |= HEAP_INSERT_SKIP_WAL;
 	}
 	else
@@ -10920,8 +10938,11 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
 	/* copy main fork */
-	copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM,
-					   rel->rd_rel->relpersistence);
+	if (RelationStorageIsZHeap(rel))
+		copy_zrelation_data(rel, dstrel, MAIN_FORKNUM);
+	else
+		copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+						   rel->rd_rel->relpersistence);
 
 	/* copy those extra forks that exist */
 	for (forkNum = MAIN_FORKNUM + 1; forkNum <= MAX_FORKNUM; forkNum++)
@@ -11287,6 +11308,137 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 		smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * ZBORKED: this breaks abstraction
+ *
+ * copy_zrelation_data - same as copy_relation_data but for zheap
+ *
+ * In this method, we copy a zheap relation block by block. Here is the algorithm
+ * for the same:
+ * For each zheap page,
+ * a. If it's a meta page, copy it as it is.
+ * b. If it's a TPD page, copy it as it is.
+ * c. If it's a zheap data page, apply pending aborts, copy the page and
+ *    the corresponding TPD page (if any).
+ *
+ * Please note that we may copy a tpd page multiple times. The reason is one
+ * tpd page can be referred by multiple zheap pages. While applying pending
+ * aborts on a zheap page, we also need to modify the transaction and undo
+ * information in the corresponding TPD page, hence, we need to copy it again
+ * to reflect the changes.
+ */
+static void
+copy_zrelation_data(Relation srcRel, SMgrRelation dst, ForkNumber forkNum)
+{
+	Page		page;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	SMgrRelation src = srcRel->rd_smgr;
+	char relpersistence = srcRel->rd_rel->relpersistence;
+
+	/*
+	 * The init fork for an unlogged relation in many respects has to be
+	 * treated the same as normal relation, changes need to be WAL logged and
+	 * it needs to be synced to disk.
+	 */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+
+	/*
+	 * We need to log the copied data in WAL iff WAL archiving/streaming is
+	 * enabled AND it's a permanent relation.
+	 */
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	nblocks = smgrnblocks(src, forkNum);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		BlockNumber	target_blkno = InvalidBlockNumber;
+		BlockNumber	tpd_blkno = InvalidBlockNumber;
+		Buffer		buffer = InvalidBuffer;
+
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		if (blkno != ZHEAP_METAPAGE)
+		{
+			buffer = ReadBuffer(srcRel, blkno);
+
+			/* If it's a zheap page, apply the pending undo actions */
+			if (PageGetSpecialSize(BufferGetPage(buffer)) !=
+				MAXALIGN(sizeof(TPDPageOpaqueData)))
+				zbuffer_exec_pending_rollback(srcRel, buffer, &tpd_blkno);
+		}
+
+		target_blkno = blkno;
+
+copy_buffer:
+		/* Read the buffer if not already done. */
+		if (!BufferIsValid(buffer))
+			buffer = ReadBuffer(srcRel, target_blkno);
+		page = (Page) BufferGetPage(buffer);
+
+		if (!PageIsVerified(page, target_blkno))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							target_blkno,
+							relpathbackend(src->smgr_rnode.node,
+										   src->smgr_rnode.backend,
+										   forkNum))));
+
+		/*
+		 * WAL-log the copied page. Unfortunately we don't know what kind of a
+		 * page this is, so we have to log the full page including any unused
+		 * space.
+		 */
+		if (use_wal)
+			log_newpage(&dst->smgr_rnode.node, forkNum, target_blkno, page, false);
+
+		PageSetChecksumInplace(page, target_blkno);
+
+		/*
+		 * Now write the page.  We say isTemp = true even if it's not a temp
+		 * rel, because there's no need for smgr to schedule an fsync for this
+		 * write; we'll do it ourselves below.
+		 */
+		smgrextend(dst, forkNum, target_blkno, page, true);
+
+		ReleaseBuffer(buffer);
+
+		/* If there is a TPD page corresponding to the current page, copy it. */
+		if (BlockNumberIsValid(tpd_blkno))
+		{
+			target_blkno = tpd_blkno;
+			tpd_blkno = InvalidBlockNumber;
+			buffer = InvalidBuffer;
+			goto copy_buffer;
+		}
+	}
+
+	/*
+	 * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
+	 * to ensure that the toast table gets fsync'd too.  (For a temp or
+	 * unlogged rel we don't care since the data will be gone after a crash
+	 * anyway.)
+	 *
+	 * It's obvious that we must do this when not WAL-logging the copy. It's
+	 * less obvious that we have to do it even if we did WAL-log the copied
+	 * pages. The reason is that since we're copying outside shared buffers, a
+	 * CHECKPOINT occurring during the copy has no way to flush the previously
+	 * written data to disk (indeed it won't know the new rel even exists).  A
+	 * crash later on would replay WAL from the checkpoint, therefore it
+	 * wouldn't replay our earlier WAL entries. If we do not fsync those pages
+	 * here, they might still not be on disk when the crash occurs.
+	 */
+	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+		smgrimmedsync(dst, forkNum);
+}
+
 /*
  * ALTER TABLE ENABLE/DISABLE TRIGGER
  *
@@ -13163,6 +13315,10 @@ PreCommit_on_commit_actions(void)
 			case ONCOMMIT_DROP:
 				oids_to_drop = lappend_oid(oids_to_drop, oc->relid);
 				break;
+			case ONCOMMIT_TEMP_DISCARD:
+				/* Discard temp table undo logs for temp tables. */
+				TempUndoDiscard(oc->relid);
+				break;
 		}
 	}
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index ca429731d4..73bdd539fe 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -488,6 +488,20 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	 */
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
+	/*
+	 * Drop the undo logs in this tablespace.  This will fail (without
+	 * dropping anything) if there are undo logs that we can't afford to drop
+	 * because they contain non-discarded data or a transaction is in
+	 * progress.  Since we hold TablespaceCreateLock, no other session will be
+	 * able to attach to an undo log in this tablespace (or any tablespace
+	 * except default) concurrently.
+	 */
+	if (!DropUndoLogsInTablespace(tablespaceoid))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
+						tablespacename)));
+
 	/*
 	 * Try to remove the physical infrastructure.
 	 */
@@ -1488,6 +1502,14 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* This shouldn't be able to fail in recovery. */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		if (!DropUndoLogsInTablespace(xlrec->ts_id))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("tablespace cannot be dropped because it contains non-empty undo logs")));
+		LWLockRelease(TablespaceCreateLock);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index fcae282044..bd15d04cf0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1262,12 +1262,15 @@ vac_update_datfrozenxid(void)
 		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(classTup);
 
 		/*
-		 * Only consider relations able to hold unfrozen XIDs (anything else
-		 * should have InvalidTransactionId in relfrozenxid anyway.)
+		 * Only consider relations able to hold unfrozen XIDs
 		 */
-		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW &&
-			classForm->relkind != RELKIND_TOASTVALUE)
+		if ((classForm->relkind != RELKIND_RELATION &&
+			 classForm->relkind != RELKIND_MATVIEW &&
+			 classForm->relkind != RELKIND_TOASTVALUE))
+			continue;
+
+		/* some AMs might not use frozen xids */
+		if (!TransactionIdIsValid(classForm->relfrozenxid))
 			continue;
 
 		Assert(TransactionIdIsNormal(classForm->relfrozenxid));
@@ -1382,6 +1385,7 @@ vac_truncate_clog(TransactionId frozenXID,
 				  MultiXactId lastSaneMinMulti)
 {
 	TransactionId nextXID = ReadNewTransactionId();
+	TransactionId oldestXidHavingUndo;
 	Relation	relation;
 	TableScanDesc scan;
 	HeapTuple	tuple;
@@ -1475,6 +1479,16 @@ vac_truncate_clog(TransactionId frozenXID,
 	if (bogus)
 		return;
 
+	/*
+	 * We can't truncate the clog for transactions that still have undo.  The
+	 * oldestXidHavingUndo will be only valid for zheap storage engine, so it
+	 * won't impact any other storage engine.
+	 */
+	oldestXidHavingUndo = GetXidFromEpochXid(
+						pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+	if (TransactionIdIsValid(oldestXidHavingUndo))
+		frozenXID = Min(frozenXID, oldestXidHavingUndo);
+
 	/*
 	 * Advance the oldest value for commit timestamps before truncating, so
 	 * that if a user requests a timestamp for a transaction we're truncating
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 66d838dbce..154271a464 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -726,7 +726,7 @@ retry:
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
 	{
-		TransactionId xwait;
+		TransactionId xwait, xid;
 		XLTW_Oper	reason_wait;
 		Datum		existing_values[INDEX_MAX_KEYS];
 		bool		existing_isnull[INDEX_MAX_KEYS];
@@ -778,11 +778,22 @@ retry:
 		xwait = TransactionIdIsValid(DirtySnapshot.xmin) ?
 			DirtySnapshot.xmin : DirtySnapshot.xmax;
 
+		/* For zheap, we always use Top Transaction Id. */
+		// ZBORKED: What does this even mean?
+		if (RelationStorageIsZHeap(heap))
+		{
+			xid = GetTopTransactionId();
+		}
+		else
+		{
+			xid = GetCurrentTransactionId();
+		}
+
 		if (TransactionIdIsValid(xwait) &&
 			(waitMode == CEOUC_WAIT ||
 			 (waitMode == CEOUC_LIVELOCK_PREVENTING_WAIT &&
 			  DirtySnapshot.speculativeToken &&
-			  TransactionIdPrecedes(GetCurrentTransactionId(), xwait))))
+			  TransactionIdPrecedes(xid, xwait))))
 		{
 			/*
 			 * PBORKED? When waiting, we used to use t_ctid, rather than
@@ -794,6 +805,9 @@ retry:
 			if (DirtySnapshot.speculativeToken)
 				SpeculativeInsertionWait(DirtySnapshot.xmin,
 										 DirtySnapshot.speculativeToken);
+			else if (DirtySnapshot.subxid != InvalidSubTransactionId)
+				SubXactLockTableWait(xwait, DirtySnapshot.subxid, heap,
+									 &existing_slot->tts_tid, reason_wait);
 			else
 				XactLockTableWait(xwait, heap,
 								  &existing_slot->tts_tid, reason_wait);
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index d91a71a7c1..1085fb4e9b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -60,6 +60,9 @@
 #include "access/htup_details.h"
 #include "access/tupdesc_details.h"
 #include "access/tuptoaster.h"
+#include "access/zheap.h"
+#include "access/zheaputils.h"
+#include "access/zhtup.h"
 #include "funcapi.h"
 #include "catalog/pg_type.h"
 #include "nodes/nodeFuncs.h"
@@ -1011,6 +1014,176 @@ slot_deform_heap_tuple(TupleTableSlot *slot, HeapTuple tuple, uint32 *offp,
 }
 
 
+
+/*
+ * TupleTableSlotOps implementation for ZheapHeapTupleTableSlot.
+ */
+
+static void
+tts_zheap_init(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zheap_release(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zheap_clear(TupleTableSlot *slot)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+
+	/*
+	 * Free the memory for heap tuple if allowed. A tuple coming from zheap
+	 * can never be freed. But we may have materialized a tuple from zheap.
+	 * Such a tuple can be freed.
+	 */
+	if (TTS_SHOULDFREE(slot))
+	{
+		zheap_freetuple(zslot->tuple);
+		slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+	}
+
+#if 0
+	if (ZheapIsValid(bslot->zheap))
+		ReleaseZheap(bslot->zheap);
+#endif
+
+	slot->tts_nvalid = 0;
+	slot->tts_flags |= TTS_FLAG_EMPTY;
+	zslot->tuple = NULL;
+}
+
+static void
+tts_zheap_getsomeattrs(TupleTableSlot *slot, int natts)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+
+	Assert(!TTS_EMPTY(slot));
+	slot_deform_ztuple(slot, zslot->tuple, &zslot->off, natts);
+}
+
+static Datum
+tts_zheap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+
+	return zheap_getsysattr(zslot->tuple, InvalidBuffer, attnum,
+							slot->tts_tupleDescriptor, isnull);
+}
+
+/*
+ * Materialize the heap tuple contained in the given slot into its own memory
+ * context.
+ */
+static void
+tts_zheap_materialize(TupleTableSlot *slot)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+	MemoryContext oldContext;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/* If already materialized nothing to do. */
+	if (TTS_SHOULDFREE(slot))
+		return;
+
+	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
+
+	if (zslot->tuple)
+		zslot->tuple = zheap_copytuple(zslot->tuple);
+	else
+	{
+		/*
+		 * The tuple contained in this slot is not allocated in the memory
+		 * context of the given slot (else it would have TTS_SHOULDFREE set).
+		 * Copy the tuple into the given slot's memory context.
+		 */
+		zslot->tuple = zheap_form_tuple(slot->tts_tupleDescriptor,
+										slot->tts_values,
+										slot->tts_isnull);
+	}
+	MemoryContextSwitchTo(oldContext);
+
+#if 0
+	/*
+	 * TODO: I expect a ZheapHeapTupleTableSlot to always have a zheap to be
+	 * associated with it OR the tuple is materialized. In the later case we
+	 * won't come here. So, we should always see a valid zheap here to be
+	 * unpinned.
+	 */
+	if (zslot->tuple))
+	{
+		ReleaseZheap(bslot->zheap);
+		bslot->zheap = InvalidZheap;
+	}
+#endif
+
+	/*
+	 * Have to deform from scratch, otherwise tts_values[] entries could point
+	 * into the non-materialized tuple (which might be gone when accessed).
+	 */
+	slot->tts_nvalid = 0;
+	zslot->off = 0;
+}
+
+static void
+tts_zheap_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
+{
+	HeapTuple tuple;
+	MemoryContext oldcontext;
+
+	// PBORKED: This is a horrible implementation
+
+	oldcontext = MemoryContextSwitchTo(dstslot->tts_mcxt);
+	tuple = ExecCopySlotHeapTuple(srcslot);
+	MemoryContextSwitchTo(oldcontext);
+
+	ExecForceStoreHeapTuple(tuple, dstslot);
+	ExecMaterializeSlot(dstslot);
+
+	pfree(tuple);
+}
+
+static HeapTuple
+tts_zheap_copy_heap_tuple(TupleTableSlot *slot)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+
+	Assert(!TTS_EMPTY(slot));
+
+	if (!zslot->tuple)
+		tts_zheap_materialize(slot);
+
+	return zheap_to_heap(zslot->tuple, slot->tts_tupleDescriptor);
+}
+
+/*
+ * Return a minimal tuple constructed from the contents of the slot.
+ *
+ * We always return a new minimal tuple so no copy, per say, is needed.
+ *
+ * TODO:
+ * This function is exact copy of tts_zheap_get_minimal_tuple() and thus the
+ * callback should point to that one instead of a new implementation. But
+ * there's one TODO there which might change tts_heap_get_minimal_tuple().
+ */
+static MinimalTuple
+tts_zheap_copy_minimal_tuple(TupleTableSlot *slot)
+{
+	slot_getallattrs(slot);
+
+	return heap_form_minimal_tuple(slot->tts_tupleDescriptor,
+								   slot->tts_values, slot->tts_isnull);
+}
+
+/*
+ * TupleTableSlotOps for each of TupleTableSlotTypes. These are used to
+ * identify the type of slot.
+ */
 const TupleTableSlotOps TTSOpsVirtual = {
 	.base_slot_size = sizeof(VirtualTupleTableSlot),
 	.init = tts_virtual_init,
@@ -1082,6 +1255,22 @@ const TupleTableSlotOps TTSOpsBufferHeapTuple = {
 	.copy_minimal_tuple = tts_buffer_heap_copy_minimal_tuple
 };
 
+const TupleTableSlotOps TTSOpsZHeapTuple = {
+	.base_slot_size = sizeof(ZHeapTupleTableSlot),
+	.init = tts_zheap_init,
+	.release = tts_zheap_release,
+	.clear = tts_zheap_clear,
+	.getsomeattrs = tts_zheap_getsomeattrs,
+	.getsysattr = tts_zheap_getsysattr,
+	.materialize = tts_zheap_materialize,
+	.copyslot = tts_zheap_copyslot,
+
+	.get_heap_tuple = NULL,
+	.get_minimal_tuple = NULL,
+
+	.copy_heap_tuple = tts_zheap_copy_heap_tuple,
+	.copy_minimal_tuple = tts_zheap_copy_minimal_tuple
+};
 
 /* ----------------------------------------------------------------
  *				  tuple table create/delete functions
@@ -1476,6 +1665,43 @@ ExecStoreMinimalTuple(MinimalTuple mtup,
 	return slot;
 }
 
+/* --------------------------------
+ *		ExecStoreZTuple
+ *
+ *		This function is same as ExecStoreTuple except that it used to store a
+ *		physical zheap tuple into a specified slot in the tuple table.
+ *
+ *		NOTE: Unlike ExecStoreTuple, it's possible that buffer is valid and
+ *		should_free is true. Because, slot->tts_ztuple may be a copy of the
+ *		tuple allocated locally. So, we want to free the tuple even after
+ *		keeping a pin/lock to the previously valid buffer.
+ */
+TupleTableSlot *
+ExecStoreZTuple(ZHeapTuple tuple,
+				TupleTableSlot *slot,
+				Buffer buffer,
+				bool shouldFree)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+	/*
+	 * sanity checks
+	 */
+	Assert(slot != NULL);
+	Assert(TTS_IS_ZHEAP(slot));
+	tts_zheap_clear(slot);
+
+	slot->tts_nvalid = 0;
+	zslot->tuple = tuple;
+	zslot->off = 0;
+	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+	slot->tts_tid = tuple->t_self;
+
+	if (shouldFree)
+		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	return slot;
+}
+
 /*
  * Store a HeapTuple into any kind of slot, performing conversion if
  * necessary.
@@ -1559,6 +1785,23 @@ ExecForceStoreHeapTupleDatum(Datum data, TupleTableSlot *slot)
 	ExecMaterializeSlot(slot);
 }
 
+ZHeapTuple
+ExecGetZHeapTupleFromSlot(TupleTableSlot *slot)
+{
+	ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot;
+
+	if (!TTS_IS_ZHEAP(slot))
+		elog(ERROR, "unsupported");
+
+	if (TTS_EMPTY(slot))
+		return NULL;
+
+	if (!zslot->tuple)
+		slot->tts_ops->materialize(slot);
+
+	return zslot->tuple;
+}
+
 /* --------------------------------
  *		ExecStoreVirtualTuple
  *			Mark a slot as containing a virtual tuple.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d1ac9fc2e9..78b0607406 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -680,7 +680,7 @@ ldelete:;
 									  estate->es_snapshot,
 									  slot, estate->es_output_cid,
 									  LockTupleExclusive, LockWaitBlock,
-									  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+									  TUPLE_LOCK_FLAG_FIND_LAST_VERSION | TUPLE_LOCK_FLAG_WEIRD,
 									  &hufd);
 			/*hari FIXME*/
 			/*Assert(result != HeapTupleUpdated && hufd.traversed);*/
@@ -1174,7 +1174,7 @@ lreplace:;
 									  estate->es_snapshot,
 									  inputslot, estate->es_output_cid,
 									  lockmode, LockWaitBlock,
-									  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+									  TUPLE_LOCK_FLAG_FIND_LAST_VERSION | TUPLE_LOCK_FLAG_WEIRD,
 									  &hufd);
 			/* hari FIXME*/
 			/*Assert(result != HeapTupleUpdated && hufd.traversed);*/
@@ -1360,7 +1360,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	test = table_lock_tuple(relation, conflictTid,
 							estate->es_snapshot,
 							mtstate->mt_existing, estate->es_output_cid,
-							lockmode, LockWaitBlock, TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS,
+							lockmode, LockWaitBlock,
+							TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS | TUPLE_LOCK_FLAG_WEIRD,
 							&hufd);
 	switch (test)
 	{
@@ -1403,6 +1404,15 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 			break;
 
 		case HeapTupleSelfUpdated:
+#ifdef ZBORKED
+			/*
+			 * ZHEAP accepts this, but this isn't ok from a layering POV (and
+			 * I'm doubtful about the correctness). See 1e9d17cc240.
+			 *
+			 * Unlike heap, we expect HeapTupleSelfUpdated in the same scenario
+			 * as the new tuple could have been in-place updated.
+			 */
+#endif
 
 			/*
 			 * This state should never be reached. As a dirty snapshot is used
diff --git a/src/backend/lib/stringinfo.c b/src/backend/lib/stringinfo.c
index df7e01f76d..73e57d4d47 100644
--- a/src/backend/lib/stringinfo.c
+++ b/src/backend/lib/stringinfo.c
@@ -238,15 +238,47 @@ appendBinaryStringInfo(StringInfo str, const char *data, int datalen)
  */
 void
 appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen)
+{
+        Assert(str != NULL);
+
+        /* Make more room if needed */
+        enlargeStringInfo(str, datalen);
+
+        /* OK, append the data */
+        memcpy(str->data + str->len, data, datalen);
+        str->len += datalen;
+}
+
+/* appendBinaryStringInfoNoExtend
+ *
+ * Append arbitrary binary data to a StringInfo.
+ *
+ * Returns false, if more space is required to append the string, true
+ * otherwise.
+ *
+ * This can be used in critical section.
+ */
+bool
+appendBinaryStringInfoNoExtend(StringInfo str, const char *data, int datalen)
 {
 	Assert(str != NULL);
 
-	/* Make more room if needed */
-	enlargeStringInfo(str, datalen);
+	/* fail, if more space is required */
+	if (datalen > str->maxlen)
+		return false;
 
 	/* OK, append the data */
 	memcpy(str->data + str->len, data, datalen);
 	str->len += datalen;
+
+	/*
+	 * Keep a trailing null in place, even though it's probably useless for
+	 * binary data.  (Some callers are dealing with text but call this because
+	 * their input isn't null-terminated.)
+	 */
+	str->data[str->len] = '\0';
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 17dc53898f..c7b98edf7a 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -41,6 +41,7 @@
 #include <limits.h>
 
 #include "access/htup_details.h"
+#include "access/zheap.h"
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "storage/lwlock.h"
@@ -53,7 +54,7 @@
  * the per-page bitmaps variable size.  We just legislate that the size
  * is this:
  */
-#define MAX_TUPLES_PER_PAGE  MaxHeapTuplesPerPage
+#define MAX_TUPLES_PER_PAGE  Max(MaxHeapTuplesPerPage, MaxZHeapTuplesPerPage)
 
 /*
  * When we have to switch over to lossy storage, we use a data structure
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5e1b27be75..6b33814f39 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1000,7 +1000,9 @@ estimate_rel_size(Relation rel, int32 *attr_widths,
 			 * minimum to indexes.
 			 */
 			if (curpages < 10 &&
-				rel->rd_rel->relpages == 0 &&
+				(rel->rd_rel->relpages == 0 ||
+				(RelationStorageIsZHeap(rel) &&
+				 rel->rd_rel->relpages == ZHEAP_METAPAGE + 1)) &&
 				!rel->rd_rel->relhassubclass &&
 				rel->rd_rel->relkind != RELKIND_INDEX)
 				curpages = 10;
@@ -1008,7 +1010,8 @@ estimate_rel_size(Relation rel, int32 *attr_widths,
 			/* report estimated # pages */
 			*pages = curpages;
 			/* quick exit if rel is clearly empty */
-			if (curpages == 0)
+			if (curpages == 0 || (RelationStorageIsZHeap(rel) &&
+				curpages == ZHEAP_METAPAGE + 1))
 			{
 				*tuples = 0;
 				*allvisfrac = 0;
@@ -1019,6 +1022,15 @@ estimate_rel_size(Relation rel, int32 *attr_widths,
 			reltuples = (double) rel->rd_rel->reltuples;
 			relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
 
+			/*
+			 * If it's a zheap relation, then subtract the pages
+			 * to account for the metapage.
+			 */
+			if (relpages > 0 && RelationStorageIsZHeap(rel))
+			{
+				curpages--;
+				relpages--;
+			}
 			/*
 			 * If it's an index, discount the metapage while estimating the
 			 * number of tuples.  This is a kluge because it assumes more than
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..e44c563b45 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+	discardworker.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	undoworker.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d2b695e146..49df516957 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -20,7 +20,9 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
+#include "postmaster/discardworker.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/undoworker.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
 #include "storage/dsm.h"
@@ -129,6 +131,15 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"UndoLauncherMain", UndoLauncherMain
+	},
+	{
+		"UndoWorkerMain", UndoWorkerMain
+	},
+	{
+		"DiscardWorkerMain", DiscardWorkerMain
 	}
 };
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b9c118e156..b2505c8f23 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1314,7 +1314,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		smgrrequestsync(request->rnode, request->forknum, request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/discardworker.c b/src/backend/postmaster/discardworker.c
new file mode 100644
index 0000000000..121779009d
--- /dev/null
+++ b/src/backend/postmaster/discardworker.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * discardworker.c
+ *	  The undo discard worker for asynchronous undo management.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/discardworker.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+/* These are always necessary for a bgworker. */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+#include "access/undodiscard.h"
+#include "pgstat.h"
+#include "postmaster/discardworker.h"
+#include "storage/procarray.h"
+#include "tcop/tcopprot.h"
+#include "utils/guc.h"
+#include "utils/resowner.h"
+
+static void undoworker_sigterm_handler(SIGNAL_ARGS);
+
+/* max sleep time between cycles (100 milliseconds) */
+#define MIN_NAPTIME_PER_CYCLE 100L
+#define DELAYED_NAPTIME 10 * MIN_NAPTIME_PER_CYCLE
+#define MAX_NAPTIME_PER_CYCLE 100 * MIN_NAPTIME_PER_CYCLE
+
+static bool got_SIGTERM = false;
+static bool hibernate = false;
+static	long		wait_time = MIN_NAPTIME_PER_CYCLE;
+
+/* SIGTERM: set flag to exit at next convenient time */
+static void
+undoworker_sigterm_handler(SIGNAL_ARGS)
+{
+	got_SIGTERM = true;
+
+	/* Waken anything waiting on the process latch */
+	SetLatch(MyLatch);
+}
+
+/*
+ * DiscardWorkerRegister -- Register a undo discard worker.
+ */
+void
+DiscardWorkerRegister(void)
+{
+	BackgroundWorker bgw;
+
+	/* TODO: This should be configurable. */
+
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_name, BGW_MAXLEN, "discard worker");
+	sprintf(bgw.bgw_library_name, "postgres");
+	sprintf(bgw.bgw_function_name, "DiscardWorkerMain");
+	bgw.bgw_restart_time = 5;
+	bgw.bgw_notify_pid = 0;
+	bgw.bgw_main_arg = (Datum) 0;
+
+	RegisterBackgroundWorker(&bgw);
+}
+
+/*
+ * DiscardWorkerMain -- Main loop for the undo discard worker.
+ */
+void
+DiscardWorkerMain(Datum main_arg)
+{
+	ereport(LOG,
+			(errmsg("discard worker started")));
+
+	/* Establish signal handlers. */
+	pqsignal(SIGTERM, undoworker_sigterm_handler);
+	BackgroundWorkerUnblockSignals();
+
+	/* Make it easy to identify our processes. */
+	SetConfigOption("application_name", MyBgworkerEntry->bgw_name,
+					PGC_USERSET, PGC_S_SESSION);
+
+	/*
+	 * Create resource owner for discard worker as it need to read the undo
+	 * records  outside the transaction blocks which intern access buffer read
+	 * routine.
+	 */
+	CreateAuxProcessResourceOwner();
+
+	/* Enter main loop */
+	while (!got_SIGTERM)
+	{
+		int			rc;
+		TransactionId OldestXmin, oldestXidHavingUndo;
+
+		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+
+		oldestXidHavingUndo = GetXidFromEpochXid(
+				pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo));
+
+		/*
+		 * Call the discard routine if there oldestXidHavingUndo is lagging
+		 * behind OldestXmin.
+		 */
+		if (OldestXmin != InvalidTransactionId &&
+			TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
+		{
+			UndoDiscard(OldestXmin, &hibernate);
+
+			/*
+			 * If we got some undo logs to discard or discarded something,
+			 * then reset the wait_time as we have got work to do.
+			 * Note that if there are some undologs that cannot be discarded,
+			 * then above condition will remain unsatisified till oldestXmin
+			 * remains unchanged and the wait_time will not reset in that case.
+			 */
+			if (!hibernate)
+				wait_time = MIN_NAPTIME_PER_CYCLE;
+		}
+
+		/* Wait for more work. */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   wait_time,
+					   WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN);
+
+		ResetLatch(&MyProc->procLatch);
+
+		/*
+		 * Increase the wait_time based on the length of inactivity. If wait_time
+		 * is within one second, then increment it by 100 ms at a time. Henceforth,
+		 * increment it one second at a time, till it reaches ten seconds. Never
+		 * increase the wait_time more than ten seconds, it will be too much of
+		 * waiting otherwise.
+		 */
+		if (rc & WL_TIMEOUT && hibernate)
+		{
+			wait_time += (wait_time < DELAYED_NAPTIME ?
+							MIN_NAPTIME_PER_CYCLE : DELAYED_NAPTIME);
+			if (wait_time > MAX_NAPTIME_PER_CYCLE)
+				wait_time = MAX_NAPTIME_PER_CYCLE;
+		}
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+	}
+
+	ReleaseAuxProcessResources(true);
+
+	/* we're done */
+	ereport(LOG,
+			(errmsg("discard worker shutting down")));
+
+	proc_exit(0);
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7762dbc44b..b72795825f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1926,6 +1926,28 @@ pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
 	}
 }
 
+/*
+ * pgstat_count_zheap_update - count a inplace tuple update
+ */
+void
+pgstat_count_zheap_update(Relation rel)
+{
+	PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+	if (pgstat_info != NULL)
+	{
+		/* We have to log the effect at the proper transactional level */
+		int			nest_level = GetCurrentTransactionNestLevel();
+
+		if (pgstat_info->trans == NULL ||
+			pgstat_info->trans->nest_level != nest_level)
+			add_tabstat_xact_level(pgstat_info, nest_level);
+
+		/* t_tuples_hot_updated is nontransactional, so just advance it */
+		pgstat_info->t_counts.t_tuples_hot_updated++;
+	}
+}
+
 /*
  * pgstat_count_heap_update - count a tuple update
  */
@@ -3376,6 +3398,9 @@ pgstat_get_wait_event_type(uint32 wait_event_info)
 		case PG_WAIT_IO:
 			event_type = "IO";
 			break;
+		case PG_WAIT_PAGE_TRANS_SLOT:
+			event_type = "TransSlot";
+			break;
 		default:
 			event_type = "???";
 			break;
@@ -3453,6 +3478,9 @@ pgstat_get_wait_event(uint32 wait_event_info)
 				event_name = pgstat_get_wait_io(w);
 				break;
 			}
+		case PG_WAIT_PAGE_TRANS_SLOT:
+			event_name = "TransSlot";
+			break;
 		default:
 			event_name = "unknown wait event";
 			break;
@@ -3516,7 +3544,13 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
-			/* no default case, so that compiler will warn */
+		case WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN:
+			event_name = "UndoDiscardWorkerMain";
+			break;
+		case WAIT_EVENT_UNDO_LAUNCHER_MAIN:
+			event_name = "UndoLauncherMain";
+			break;
+		/* no default case, so that compiler will warn */
 	}
 
 	return event_name;
@@ -3898,6 +3932,28 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_READ:
+			event_name = "UndoCheckpointRead";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_WRITE:
+			event_name = "UndoCheckpointWrite";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_SYNC:
+			event_name = "UndoCheckpointSync";
+			break;
+		case WAIT_EVENT_UNDO_FILE_READ:
+			event_name = "UndoFileRead";
+			break;
+		case WAIT_EVENT_UNDO_FILE_WRITE:
+			event_name = "UndoFileWrite";
+			break;
+		case WAIT_EVENT_UNDO_FILE_FLUSH:
+			event_name = "UndoFileFlush";
+			break;
+		case WAIT_EVENT_UNDO_FILE_SYNC:
+			event_name = "UndoFileSync";
+			break;
+
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a33a131182..9f6ba1a65f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -111,10 +111,12 @@
 #include "port/pg_bswap.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
+#include "postmaster/discardworker.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
+#include "postmaster/undoworker.h"
 #include "replication/logicallauncher.h"
 #include "replication/walsender.h"
 #include "storage/fd.h"
@@ -246,6 +248,8 @@ bool		enable_bonjour = false;
 char	   *bonjour_name;
 bool		restart_after_crash = true;
 
+bool		disable_undo_launcher;
+
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
 			BgWriterPID = 0,
@@ -991,6 +995,13 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();
 
+	/* Register the Undo worker launcher. */
+	if (!disable_undo_launcher)
+		UndoLauncherRegister();
+
+	/* Register the Undo Discard worker. */
+	DiscardWorkerRegister();
+
 	/*
 	 * process any libraries that should be preloaded at postmaster start
 	 */
diff --git a/src/backend/postmaster/undoworker.c b/src/backend/postmaster/undoworker.c
new file mode 100644
index 0000000000..55ac139a91
--- /dev/null
+++ b/src/backend/postmaster/undoworker.c
@@ -0,0 +1,665 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoworker.c
+ *	  undo launcher and undo worker process.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/undoworker.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+
+#include "access/heapam.h"
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/sysattr.h"
+#include "access/xact.h"
+
+#include "catalog/indexing.h"
+#include "catalog/pg_database.h"
+
+#include "libpq/pqsignal.h"
+
+#include "postmaster/bgworker.h"
+#include "postmaster/fork_process.h"
+#include "postmaster/postmaster.h"
+#include "postmaster/undoloop.h"
+#include "postmaster/undoworker.h"
+
+#include "replication/slot.h"
+#include "replication/worker_internal.h"
+
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/procsignal.h"
+
+#include "tcop/tcopprot.h"
+
+#include "utils/fmgroids.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/* max sleep time between cycles (100 milliseconds) */
+#define DEFAULT_NAPTIME_PER_CYCLE 100L
+#define DEFAULT_RETRY_NAPTIME 50L
+
+int			max_undo_workers = 5;
+
+typedef struct UndoApplyWorker
+{
+	/* Indicates if this slot is used or free. */
+	bool		in_use;
+
+	/* Increased everytime the slot is taken by new worker. */
+	uint16		generation;
+
+	/* Pointer to proc array. NULL if not running. */
+	PGPROC	   *proc;
+
+	/* Database id to connect to. */
+	Oid			dbid;
+} UndoApplyWorker;
+
+UndoApplyWorker *MyUndoWorker = NULL;
+
+typedef struct UndoApplyCtxStruct
+{
+	/* Supervisor process. */
+	pid_t		launcher_pid;
+
+	/* Background workers. */
+	UndoApplyWorker workers[FLEXIBLE_ARRAY_MEMBER];
+} UndoApplyCtxStruct;
+
+UndoApplyCtxStruct *UndoApplyCtx;
+
+static void undo_worker_onexit(int code, Datum arg);
+static void undo_worker_cleanup(UndoApplyWorker *worker);
+
+static volatile sig_atomic_t got_SIGHUP = false;
+
+/*
+ * Wait for a background worker to start up and attach to the shmem context.
+ *
+ * This is only needed for cleaning up the shared memory in case the worker
+ * fails to attach.
+ */
+static void
+WaitForUndoWorkerAttach(UndoApplyWorker *worker,
+						uint16 generation,
+						BackgroundWorkerHandle *handle)
+{
+	BgwHandleStatus status;
+	int			rc;
+
+	for (;;)
+	{
+		pid_t		pid;
+
+		CHECK_FOR_INTERRUPTS();
+
+		LWLockAcquire(UndoWorkerLock, LW_SHARED);
+
+		/* Worker either died or has started; no need to do anything. */
+		if (!worker->in_use || worker->proc)
+		{
+			LWLockRelease(UndoWorkerLock);
+			return;
+		}
+
+		LWLockRelease(UndoWorkerLock);
+
+		/* Check if worker has died before attaching, and clean up after it. */
+		status = GetBackgroundWorkerPid(handle, &pid);
+
+		if (status == BGWH_STOPPED)
+		{
+			LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+			/* Ensure that this was indeed the worker we waited for. */
+			if (generation == worker->generation)
+				undo_worker_cleanup(worker);
+			LWLockRelease(UndoWorkerLock);
+			return;
+		}
+
+		/*
+		 * We need timeout because we generally don't get notified via latch
+		 * about the worker attach.  But we don't expect to have to wait long.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   10L, WAIT_EVENT_BGWORKER_STARTUP);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (rc & WL_LATCH_SET)
+		{
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
+	}
+
+	return;
+}
+
+/*
+ * Get dbid from the worker slot.
+ */
+static Oid
+slot_get_dbid(int slot)
+{
+	Oid dbid;
+
+	/* Block concurrent access. */
+	LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+
+	MyUndoWorker = &UndoApplyCtx->workers[slot];
+
+	if (!MyUndoWorker->in_use)
+	{
+		LWLockRelease(UndoWorkerLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("undo worker slot %d is empty,",
+						slot)));
+	}
+
+	dbid = MyUndoWorker->dbid;
+
+	LWLockRelease(UndoWorkerLock);
+
+	return dbid;
+}
+
+/*
+ * Attach to a slot.
+ */
+static void
+undo_worker_attach(int slot)
+{
+	/* Block concurrent access. */
+	LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+
+	MyUndoWorker = &UndoApplyCtx->workers[slot];
+
+	if (!MyUndoWorker->in_use)
+	{
+		LWLockRelease(UndoWorkerLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("undo worker slot %d is empty, cannot attach",
+						slot)));
+	}
+
+	if (MyUndoWorker->proc)
+	{
+		LWLockRelease(UndoWorkerLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("undo worker slot %d is already used by "
+						"another worker, cannot attach", slot)));
+	}
+
+	MyUndoWorker->proc = MyProc;
+	before_shmem_exit(undo_worker_onexit, (Datum) 0);
+
+	LWLockRelease(UndoWorkerLock);
+}
+
+/*
+ * Walks the workers array and searches for one that matches given
+ * dbid.
+ */
+static UndoApplyWorker *
+undo_worker_find(Oid dbid)
+{
+	int			i;
+	UndoApplyWorker *res = NULL;
+
+	Assert(LWLockHeldByMe(UndoWorkerLock));
+
+	/* Search for attached worker for a given db id. */
+	for (i = 0; i < max_undo_workers; i++)
+	{
+		UndoApplyWorker *w = &UndoApplyCtx->workers[i];
+
+		if (w->in_use && w->dbid == dbid)
+		{
+			res = w;
+			break;
+		}
+	}
+
+	return res;
+}
+
+/*
+ * Check whether the dbid exist or not.
+ *
+ * Refer comments from GetDatabaseTupleByOid.
+ * FIXME:  Should we expose GetDatabaseTupleByOid and directly use it.
+ */
+static bool
+dbid_exist(Oid dboid)
+{
+	HeapTuple	tuple;
+	Relation	relation;
+	SysScanDesc scan;
+	ScanKeyData key[1];
+	bool		result = false;
+
+	/*
+	 * form a scan key
+	 */
+	ScanKeyInit(&key[0],
+				Anum_pg_database_oid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(dboid));
+
+	relation = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = systable_beginscan(relation, DatabaseOidIndexId,
+							  criticalSharedRelcachesBuilt,
+							  NULL,
+							  1, key);
+
+	tuple = systable_getnext(scan);
+
+	if (HeapTupleIsValid(tuple))
+		result = true;
+
+	/* all done */
+	systable_endscan(scan);
+	heap_close(relation, AccessShareLock);
+
+	return result;
+}
+
+/*
+ * Start new undo apply background worker, if possible otherwise return false.
+ */
+static bool
+undo_worker_launch(Oid dbid)
+{
+	BackgroundWorker bgw;
+	BackgroundWorkerHandle *bgw_handle;
+	uint16		generation;
+	int			i;
+	int			slot = 0;
+	UndoApplyWorker *worker = NULL;
+
+	/*
+	 * We need to do the modification of the shared memory under lock so that
+	 * we have consistent view.
+	 */
+	LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+
+	/* Find unused worker slot. */
+	for (i = 0; i < max_undo_workers; i++)
+	{
+		UndoApplyWorker *w = &UndoApplyCtx->workers[i];
+
+		if (!w->in_use)
+		{
+			worker = w;
+			slot = i;
+			break;
+		}
+	}
+
+	/* There are no more free worker slots */
+	if (worker == NULL)
+		return false;
+
+	/* Prepare the worker slot. */
+	worker->in_use = true;
+	worker->proc = NULL;
+	worker->dbid = dbid;
+	worker->generation++;
+
+	generation = worker->generation;
+	LWLockRelease(UndoWorkerLock);
+
+	/* Register the new dynamic worker. */
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "UndoWorkerMain");
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "undo apply worker");
+	snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker");
+
+	bgw.bgw_restart_time = BGW_NEVER_RESTART;
+	bgw.bgw_notify_pid = MyProcPid;
+	bgw.bgw_main_arg = Int32GetDatum(slot);
+
+	StartTransactionCommand();
+	/* Check the database exists or not. */
+	if (!dbid_exist(dbid))
+	{
+		CommitTransactionCommand();
+
+		LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+		undo_worker_cleanup(worker);
+		LWLockRelease(UndoWorkerLock);
+		return true;
+	}
+
+	/*
+	 * Acquire database object lock before launching the worker so that it
+	 * doesn't get dropped while worker is connecting to the database.
+	 */
+	LockSharedObject(DatabaseRelationId, dbid, 0, RowExclusiveLock);
+
+	/*  Recheck whether database still exists or not. */
+	if (!dbid_exist(dbid))
+	{
+		CommitTransactionCommand();
+
+		LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+		undo_worker_cleanup(worker);
+		LWLockRelease(UndoWorkerLock);
+		return true;
+	}
+
+	if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+	{
+		/* Failed to start worker, so clean up the worker slot. */
+		LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+		undo_worker_cleanup(worker);
+		LWLockRelease(UndoWorkerLock);
+
+		UnlockSharedObject(DatabaseRelationId, dbid, 0, RowExclusiveLock);
+		CommitTransactionCommand();
+
+		return false;
+	}
+
+	/* Now wait until it attaches. */
+	WaitForUndoWorkerAttach(worker, generation, bgw_handle);
+
+	/*
+	 * By this point the undo-worker has already connected to the database so we
+	 * can release the database lock.
+	 */
+	UnlockSharedObject(DatabaseRelationId, dbid, 0, RowExclusiveLock);
+	CommitTransactionCommand();
+
+	return true;
+}
+
+/*
+ * Detach the worker (cleans up the worker info).
+ */
+static void
+undo_worker_detach(void)
+{
+	/* Block concurrent access. */
+	LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+
+	undo_worker_cleanup(MyUndoWorker);
+
+	LWLockRelease(UndoWorkerLock);
+}
+
+/*
+ * Clean up worker info.
+ */
+static void
+undo_worker_cleanup(UndoApplyWorker *worker)
+{
+	Assert(LWLockHeldByMeInMode(UndoWorkerLock, LW_EXCLUSIVE));
+
+	worker->in_use = false;
+	worker->proc = NULL;
+	worker->dbid = InvalidOid;
+}
+
+/*
+ * Cleanup function for undo worker launcher.
+ *
+ * Called on undo worker launcher exit.
+ */
+static void
+undo_launcher_onexit(int code, Datum arg)
+{
+	UndoApplyCtx->launcher_pid = 0;
+}
+
+/* SIGHUP: set flag to reload configuration at next convenient time */
+static void
+undo_launcher_sighup(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+
+	/* Waken anything waiting on the process latch */
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+undo_worker_onexit(int code, Datum arg)
+{
+	undo_worker_detach();
+}
+
+/*
+ * UndoLauncherShmemSize
+ *		Compute space needed for undo launcher shared memory
+ */
+Size
+UndoLauncherShmemSize(void)
+{
+	Size		size;
+
+	/*
+	 * Need the fixed struct and the array of LogicalRepWorker.
+	 */
+	size = sizeof(UndoApplyCtxStruct);
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_undo_workers,
+								   sizeof(UndoApplyWorker)));
+	return size;
+}
+
+/*
+ * UndoLauncherRegister
+ *		Register a background worker running the undo worker launcher.
+ */
+void
+UndoLauncherRegister(void)
+{
+	BackgroundWorker bgw;
+
+	if (max_undo_workers == 0)
+		return;
+
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "UndoLauncherMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN,
+			 "undo worker launcher");
+	snprintf(bgw.bgw_type, BGW_MAXLEN,
+			 "undo worker launcher");
+	bgw.bgw_restart_time = 5;
+	bgw.bgw_notify_pid = 0;
+	bgw.bgw_main_arg = (Datum) 0;
+
+	RegisterBackgroundWorker(&bgw);
+}
+
+/*
+ * UndoLauncherShmemInit
+ *		Allocate and initialize undo worker launcher shared memory
+ */
+void
+UndoLauncherShmemInit(void)
+{
+	bool		found;
+
+	UndoApplyCtx = (UndoApplyCtxStruct *)
+		ShmemInitStruct("Undo Worker Launcher Data",
+						UndoLauncherShmemSize(),
+						&found);
+
+	if (!found)
+		memset(UndoApplyCtx, 0, UndoLauncherShmemSize());
+}
+
+/*
+ * Main loop for the undo worker launcher process.
+ */
+void
+UndoLauncherMain(Datum main_arg)
+{
+	MemoryContext tmpctx;
+	MemoryContext oldctx;
+
+	ereport(DEBUG1,
+			(errmsg("undo launcher started")));
+
+	before_shmem_exit(undo_launcher_onexit, (Datum) 0);
+
+	Assert(UndoApplyCtx->launcher_pid == 0);
+	UndoApplyCtx->launcher_pid = MyProcPid;
+
+	/* Establish signal handlers. */
+	pqsignal(SIGHUP, undo_launcher_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Establish connection to nailed catalogs (we only ever access
+	 * pg_subscription).
+	 */
+	BackgroundWorkerInitializeConnection(NULL, NULL, 0);
+
+	/* Use temporary context for the database list and worker info. */
+	tmpctx = AllocSetContextCreate(TopMemoryContext,
+								   "Undo worker Launcher context",
+								   ALLOCSET_DEFAULT_SIZES);
+	/* Enter main loop */
+	for (;;)
+	{
+		int			rc;
+		List	   *dblist;
+		ListCell   *l;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* switch to the temp context. */
+		oldctx = MemoryContextSwitchTo(tmpctx);
+		dblist = RollbackHTGetDBList();
+
+		foreach(l, dblist)
+		{
+			UndoApplyWorker *w;
+			Oid	dbid = lfirst_oid(l);
+
+			LWLockAcquire(UndoWorkerLock, LW_SHARED);
+			w = undo_worker_find(dbid);
+			LWLockRelease(UndoWorkerLock);
+
+			if (w == NULL)
+			{
+retry:
+				if (!undo_worker_launch(dbid))
+				{
+					/* Could not launch the worker, retry after sometime, */
+					rc = WaitLatch(MyLatch,
+								   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+								   DEFAULT_RETRY_NAPTIME,
+								   WAIT_EVENT_UNDO_LAUNCHER_MAIN);
+					goto retry;
+				}
+			}
+		}
+
+		/* Switch back to original memory context. */
+		MemoryContextSwitchTo(oldctx);
+
+		/* Clean the temporary memory. */
+		MemoryContextReset(tmpctx);
+
+		/* Wait for more work. */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   DEFAULT_NAPTIME_PER_CYCLE,
+					   WAIT_EVENT_UNDO_LAUNCHER_MAIN);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		if (rc & WL_LATCH_SET)
+		{
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+	}
+}
+
+/*
+ * UndoWorkerMain -- Main loop for the undo apply worker.
+ */
+void
+UndoWorkerMain(Datum main_arg)
+{
+	int		worker_slot = DatumGetInt32(main_arg);
+	Oid		dbid;
+
+	dbid = slot_get_dbid(worker_slot);
+
+	/* Setup signal handling */
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
+
+	/* Connect to the database. */
+	BackgroundWorkerInitializeConnectionByOid(dbid, 0, 0);
+
+	/* Attach to slot */
+	undo_worker_attach(worker_slot);
+
+	/*
+	 * Create resource owner for undo worker.  Undo worker need this as it
+	 * need to read the undo records  outside the transaction blocks which
+	 * intern access buffer read routine.
+	 */
+	CreateAuxProcessResourceOwner();
+
+	RollbackFromHT(dbid);
+
+	ReleaseAuxProcessResources(true);
+
+	proc_exit(0);
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e3b05657f8..95153f4e29 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,10 +154,27 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_UNDOLOG_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
 			break;
+		case RM_ZHEAP_ID:
+			/* Logical decoding is not yet implemented for zheap. */
+			Assert(0);
+			break;
+		case RM_ZHEAP2_ID:
+			/* Logical decoding is not yet implemented for zheap. */
+			Assert(0);
+			break;
+		case RM_UNDOACTION_ID:
+			/* Logical decoding is not yet implemented for undoactions. */
+			Assert(0);
+			break;
+		case RM_TPD_ID:
+			/* Logical decoding is not yet implemented for TPD. */
+			Assert(0);
+			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770aff..4c8088eb2d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * valid, the page is zeroed instead of throwing an error. This is intended
  * for non-critical data, where the caller is prepared to repair errors.
  *
- * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
+ * In RBM_ZERO mode, if the page isn't in buffer cache already, it's
  * filled with zeros instead of reading it from disk.  Useful when the caller
  * is going to fill the page from scratch, since this saves I/O and avoids
  * unnecessary failure if the page-on-disk has corrupt page headers.
+ *
+ * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked.
  * The page is returned locked to ensure that the caller has a chance to
  * initialize the page before it's made visible to others.
  * Caution: do not use this mode to read a page that is beyond the relation's
@@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy,
+						  char relpersistence)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-	Assert(InRecovery);
+	SMgrRelation smgr = smgropen(rnode,
+								 relpersistence == RELPERSISTENCE_TEMP
+								 ? MyBackendId : InvalidBackendId);
 
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -883,7 +882,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Read in the page, unless the caller intends to overwrite it and
 		 * just wants us to allocate a buffer.
 		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		if (mode == RBM_ZERO ||
+			mode == RBM_ZERO_AND_LOCK ||
+			mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
@@ -1337,6 +1338,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return buf;
 }
 
+/*
+ * ForgetBuffer -- drop a buffer from shared buffers
+ *
+ * If the buffer isn't present in shared buffers, nothing happens.  If it is
+ * present, it is discarded without making any attempt to write it back out to
+ * the operating system.  The caller must therefore somehow be sure that the
+ * data won't be needed for anything now or in the future.  It assumes that
+ * there is no concurrent access to the block, except that it might be being
+ * concurrently written.
+ */
+void
+ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	BufferTag	tag;			/* identity of target block */
+	uint32		hash;			/* hash value for tag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	hash = BufTableHashCode(&tag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* see if the block is in the buffer pool */
+	LWLockAcquire(partitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&tag, hash);
+	LWLockRelease(partitionLock);
+
+	/* didn't find it, so nothing to do */
+	if (buf_id < 0)
+		return;
+
+	/* take the buffer header lock */
+	bufHdr = GetBufferDescriptor(buf_id);
+	buf_state = LockBufHdr(bufHdr);
+
+	/*
+	 * The buffer might been evicted after we released the partition lock and
+	 * before we acquired the buffer header lock.  If so, the buffer we've
+	 * locked might contain some other data which we shouldn't touch. If the
+	 * buffer hasn't been recycled, we proceed to invalidate it.
+	 */
+	if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		bufHdr->tag.blockNum == blockNum &&
+		bufHdr->tag.forkNum == forkNum)
+		InvalidateBuffer(bufHdr);		/* releases spinlock */
+	else
+		UnlockBufHdr(bufHdr, buf_state);
+}
+
 /*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e4146a260a..553a416ab4 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -272,6 +272,49 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	return bufHdr;
 }
 
+/*
+ * ForgetLocalBuffer - drop a buffer from local buffers
+ *
+ * This is similar to bufmgr.c's ForgetBuffer, except that we do not need
+ * to do any locking since this is all local.  As with that function, this
+ * must be used very carefully, since we'll cheerfully throw away dirty
+ * buffers without any attempt to write them.
+ */
+void
+ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, BackendIdForTempRelations());
+	BufferTag	tag;			/* identity of target block */
+	LocalBufferLookupEnt *hresult;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/*
+	 * If somehow this is the first request in the session, there's nothing to
+	 * do.  (This probably shouldn't happen, though.)
+	 */
+	if (LocalBufHash == NULL)
+		return;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* see if the block is in the local buffer pool */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, (void *) &tag, HASH_REMOVE, NULL);
+
+	/* didn't find it, so nothing to do */
+	if (!hresult)
+		return;
+
+	/* mark buffer invalid */
+	bufHdr = GetLocalBufferDescriptor(hresult->id);
+	CLEAR_BUFFERTAG(bufHdr->tag);
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+	buf_state &= ~(BM_VALID | BM_TAG_VALID | BM_DIRTY);
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..8f6122ae17 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,8 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
+#include "access/undodiscard.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -28,6 +30,7 @@
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/undoworker.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
@@ -127,6 +130,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, UndoLogShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
@@ -150,6 +154,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, RollbackHTSize());
+		size = add_size(size, UndoLauncherShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -219,10 +225,12 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	UndoLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
+	InitRollbackHashTable();
 
 	/*
 	 * Set up lock manager
@@ -261,6 +269,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	WalSndShmemInit();
 	WalRcvShmemInit();
 	ApplyLauncherShmemInit();
+	UndoLauncherShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 3f57507bce..5d2cb3eaef 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -721,6 +721,143 @@ ConditionalXactLockTableWait(TransactionId xid)
 	return true;
 }
 
+/*
+ *		SubXactLockTableInsert
+ *
+ * Insert a lock showing that the current subtransaction is running ---
+ * this is done when a subtransaction performs the operation.  The lock can
+ * then be used to wait for the subtransaction to finish.
+ */
+void
+SubXactLockTableInsert(SubTransactionId	subxid)
+{
+	LOCKTAG		tag;
+	TransactionId	xid;
+	ResourceOwner currentOwner;
+
+	/* Acquire lock only if we doesn't already hold that lock. */
+	if (HasCurrentSubTransactionLock())
+		return;
+
+	xid = GetTopTransactionId();
+
+	/*
+	 * Acquire lock on the transaction XID.  (We assume this cannot block.) We
+	 * have to ensure that the lock is assigned to the transaction's own
+	 * ResourceOwner.
+	 */
+	currentOwner = CurrentResourceOwner;
+	CurrentResourceOwner = GetCurrentTransactionResOwner();
+
+	SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid);
+	(void) LockAcquire(&tag, ExclusiveLock, false, false);
+
+	CurrentResourceOwner = currentOwner;
+
+	SetCurrentSubTransactionLocked();
+}
+
+/*
+ *		SubXactLockTableDelete
+ *
+ * Delete the lock showing that the given subtransaction is running.
+ * (This is never used for main transaction IDs; those locks are only
+ * released implicitly at transaction end.  But we do use it for
+ * subtransactions in zheap.)
+ */
+void
+SubXactLockTableDelete(SubTransactionId	subxid)
+{
+	LOCKTAG		tag;
+	TransactionId	xid = GetTopTransactionId();
+
+	SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid);
+
+	LockRelease(&tag, ExclusiveLock, false);
+}
+
+/*
+ *		SubXactLockTableWait
+ *
+ * Wait for the specified subtransaction to commit or abort.  Here, instead of
+ * waiting on xid, we wait on xid + subTransactionId.  Whenever any concurrent
+ * transaction finds conflict then it will create a lock tag by (slot xid +
+ * subtransaction id from the undo) and wait on that.
+ *
+ * Unlike XactLockTableWait, we don't need to wait for topmost transaction to
+ * finish as we release the lock only when the transaction (committed/aborted)
+ * is recorded in clog.  This has some overhead in terms of maintianing unique
+ * xid locks for subtransactions during commit, but that shouldn't be much as
+ * we release the locks immediately after transaction is recorded in clog.
+ * This function is designed for zheap where we don't have xids assigned for
+ * subtransaction, so we can't really figure out if the subtransaction is
+ * still in progress.
+ */
+void
+SubXactLockTableWait(TransactionId xid, SubTransactionId subxid, Relation rel,
+					 ItemPointer ctid, XLTW_Oper oper)
+{
+	LOCKTAG		tag;
+	XactLockTableWaitInfo	info;
+	ErrorContextCallback callback;
+
+	/*
+	 * If an operation is specified, set up our verbose error context
+	 * callback.
+	 */
+	if (oper != XLTW_None)
+	{
+		Assert(RelationIsValid(rel));
+		Assert(ItemPointerIsValid(ctid));
+
+		info.rel = rel;
+		info.ctid = ctid;
+		info.oper = oper;
+
+		callback.callback = XactLockTableWaitErrorCb;
+		callback.arg = &info;
+		callback.previous = error_context_stack;
+		error_context_stack = &callback;
+	}
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny()));
+	Assert(subxid != InvalidSubTransactionId);
+
+	SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid);
+
+	(void) LockAcquire(&tag, ShareLock, false, false);
+
+	LockRelease(&tag, ShareLock, false);
+
+	if (oper != XLTW_None)
+		error_context_stack = callback.previous;
+}
+
+/*
+ *		ConditionalSubXactLockTableWait
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns true if the lock was acquired.
+ */
+bool
+ConditionalSubXactLockTableWait(TransactionId xid, SubTransactionId subxid)
+{
+	LOCKTAG		tag;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny()));
+
+	SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid);
+
+	if (LockAcquire(&tag, ShareLock, false, true) == LOCKACQUIRE_NOT_AVAIL)
+		return false;
+
+	LockRelease(&tag, ShareLock, false);
+
+	return true;
+}
+
 /*
  *		SpeculativeInsertionLockAcquire
  *
@@ -768,6 +905,17 @@ SpeculativeInsertionLockRelease(TransactionId xid)
 	LockRelease(&tag, ExclusiveLock, false);
 }
 
+/*
+ *		GetSpeculativeInsertionToken
+ *
+ * Return the value of speculative insertion token.
+ */
+uint32
+GetSpeculativeInsertionToken(void)
+{
+	return speculativeInsertionToken;
+}
+
 /*
  *		SpeculativeInsertionWait
  *
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..b6c0b00ed0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+	LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log");
+	LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard");
 
 	/* Register named tranches. */
 	for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..cde0daef7b 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,6 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+UndoLogLock							46
+RollbackHTLock							47
+UndoWorkerLock					48
\ No newline at end of file
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 2960e21340..a5730a9bba 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -461,7 +461,6 @@ static void SetNewSxactGlobalXmin(void);
 static void ClearOldPredicateLocks(void);
 static void ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 						   bool summarize);
-static bool XidIsConcurrent(TransactionId xid);
 static void CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag);
 static void FlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer);
 static void OnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader,
@@ -1049,6 +1048,12 @@ CheckPointPredicate(void)
 
 /*------------------------------------------------------------------------*/
 
+bool
+IsSerializableXact()
+{
+	return (MySerializableXact != InvalidSerializableXact);
+}
+
 /*
  * InitPredicateLocks -- Initialize the predicate locking data structures.
  *
@@ -2495,11 +2500,10 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot)
  * Skip if this is a temporary table.
  */
 void
-PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
+PredicateLockTid(Relation relation, ItemPointer tid, Snapshot snapshot,
+					TransactionId targetxmin)
 {
 	PREDICATELOCKTARGETTAG tag;
-	ItemPointer tid;
-	TransactionId targetxmin;
 
 	if (!SerializationNeededForRead(relation, snapshot))
 		return;
@@ -2511,8 +2515,6 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
 	{
 		TransactionId myxid;
 
-		targetxmin = HeapTupleHeaderGetXmin(tuple->t_data);
-
 		myxid = GetTopTransactionIdIfAny();
 		if (TransactionIdIsValid(myxid))
 		{
@@ -2541,7 +2543,6 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
 	if (PredicateLockExists(&tag))
 		return;
 
-	tid = &(tuple->t_self);
 	SET_PREDICATELOCKTARGETTAG_TUPLE(tag,
 									 relation->rd_node.dbNode,
 									 relation->rd_id,
@@ -3853,7 +3854,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
  * that to this function to save the overhead of checking the snapshot's
  * subxip array.
  */
-static bool
+bool
 XidIsConcurrent(TransactionId xid)
 {
 	Snapshot	snap;
@@ -3898,14 +3899,13 @@ XidIsConcurrent(TransactionId xid)
  */
 void
 CheckForSerializableConflictOut(bool visible, Relation relation,
-								HeapTuple tuple, Buffer buffer,
+								void *stup, Buffer buffer,
 								Snapshot snapshot)
 {
 	TransactionId xid;
 	SERIALIZABLEXIDTAG sxidtag;
 	SERIALIZABLEXID *sxid;
 	SERIALIZABLEXACT *sxact;
-	HTSV_Result htsvResult;
 
 	if (!SerializationNeededForRead(relation, snapshot))
 		return;
@@ -3920,65 +3920,17 @@ CheckForSerializableConflictOut(bool visible, Relation relation,
 				 errhint("The transaction might succeed if retried.")));
 	}
 
-	/*
-	 * Check to see whether the tuple has been written to by a concurrent
-	 * transaction, either to create it not visible to us, or to delete it
-	 * while it is visible to us.  The "visible" bool indicates whether the
-	 * tuple is visible to us, while HeapTupleSatisfiesVacuum checks what else
-	 * is going on with it.
-	 */
-	htsvResult = HeapTupleSatisfiesVacuum(tuple, TransactionXmin, buffer);
-	switch (htsvResult)
+	if (RelationStorageIsZHeap(relation))
 	{
-		case HEAPTUPLE_LIVE:
-			if (visible)
-				return;
-			xid = HeapTupleHeaderGetXmin(tuple->t_data);
-			break;
-		case HEAPTUPLE_RECENTLY_DEAD:
-			if (!visible)
-				return;
-			xid = HeapTupleHeaderGetUpdateXid(tuple->t_data);
-			break;
-		case HEAPTUPLE_DELETE_IN_PROGRESS:
-			xid = HeapTupleHeaderGetUpdateXid(tuple->t_data);
-			break;
-		case HEAPTUPLE_INSERT_IN_PROGRESS:
-			xid = HeapTupleHeaderGetXmin(tuple->t_data);
-			break;
-		case HEAPTUPLE_DEAD:
+		if (!ZHeapTupleHasSerializableConflictOut(visible, relation,
+												(ItemPointer) stup, buffer, &xid))
+			return;
+	}
+	else
+	{
+		if (!HeapTupleHasSerializableConflictOut(visible, (HeapTuple) stup, buffer, &xid))
 			return;
-		default:
-
-			/*
-			 * The only way to get to this default clause is if a new value is
-			 * added to the enum type without adding it to this switch
-			 * statement.  That's a bug, so elog.
-			 */
-			elog(ERROR, "unrecognized return value from HeapTupleSatisfiesVacuum: %u", htsvResult);
-
-			/*
-			 * In spite of having all enum values covered and calling elog on
-			 * this default, some compilers think this is a code path which
-			 * allows xid to be used below without initialization. Silence
-			 * that warning.
-			 */
-			xid = InvalidTransactionId;
 	}
-	Assert(TransactionIdIsValid(xid));
-	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
-
-	/*
-	 * Find top level xid.  Bail out if xid is too early to be a conflict, or
-	 * if it's our own xid.
-	 */
-	if (TransactionIdEquals(xid, GetTopTransactionIdIfAny()))
-		return;
-	xid = SubTransGetTopmostTransaction(xid);
-	if (TransactionIdPrecedes(xid, TransactionXmin))
-		return;
-	if (TransactionIdEquals(xid, GetTopTransactionIdIfAny()))
-		return;
 
 	/*
 	 * Find sxact or summarized info for the top level xid.
@@ -4278,7 +4230,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
  * tuple itself.
  */
 void
-CheckForSerializableConflictIn(Relation relation, HeapTuple tuple,
+CheckForSerializableConflictIn(Relation relation, ItemPointer tid,
 							   Buffer buffer)
 {
 	PREDICATELOCKTARGETTAG targettag;
@@ -4309,13 +4261,13 @@ CheckForSerializableConflictIn(Relation relation, HeapTuple tuple,
 	 * It is not possible to take and hold a lock across the checks for all
 	 * granularities because each target could be in a separate partition.
 	 */
-	if (tuple != NULL)
+	if (ItemPointerIsValid(tid))
 	{
 		SET_PREDICATELOCKTARGETTAG_TUPLE(targettag,
 										 relation->rd_node.dbNode,
 										 relation->rd_id,
-										 ItemPointerGetBlockNumber(&(tuple->t_self)),
-										 ItemPointerGetOffsetNumber(&(tuple->t_self)));
+										 ItemPointerGetBlockNumber(tid),
+										 ItemPointerGetOffsetNumber(tid));
 		CheckTargetForConflictsIn(&targettag);
 	}
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 33387fb71b..69c7b6a781 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -286,6 +286,8 @@ InitProcGlobal(void)
 	/* Create ProcStructLock spinlock, too */
 	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
 	SpinLockInit(ProcStructLock);
+
+	pg_atomic_init_u64(&ProcGlobal->oldestXidWithEpochHavingUndo, 0);
 }
 
 /*
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dfbda5458f..890c7a337e 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -17,6 +17,9 @@
 #include "access/htup_details.h"
 #include "access/itup.h"
 #include "access/xlog.h"
+#include "access/zhtup.h"
+#include "access/zheap.h"
+#include "storage/bufmgr.h"
 #include "storage/checksum.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
@@ -107,7 +110,8 @@ PageIsVerified(Page page, BlockNumber blkno)
 		 * the block can still reveal problems, which is why we offer the
 		 * checksum option.
 		 */
-		if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+		if (((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 ||
+			 (p->pd_flags & ~PD_ZHEAP_VALID_FLAG_BITS) == 0) &&
 			p->pd_lower <= p->pd_upper &&
 			p->pd_upper <= p->pd_special &&
 			p->pd_special <= BLCKSZ &&
@@ -414,17 +418,6 @@ PageRestoreTempPage(Page tempPage, Page oldPage)
 	pfree(tempPage);
 }
 
-/*
- * sorting support for PageRepairFragmentation and PageIndexMultiDelete
- */
-typedef struct itemIdSortData
-{
-	uint16		offsetindex;	/* linp array index */
-	int16		itemoff;		/* page offset of item data */
-	uint16		alignedlen;		/* MAXALIGN(item data len) */
-} itemIdSortData;
-typedef itemIdSortData *itemIdSort;
-
 static int
 itemoffcompare(const void *itemidp1, const void *itemidp2)
 {
@@ -437,7 +430,7 @@ itemoffcompare(const void *itemidp1, const void *itemidp2)
  * After removing or marking some line pointers unused, move the tuples to
  * remove the gaps caused by the removed items.
  */
-static void
+void
 compactify_tuples(itemIdSort itemidbase, int nitems, Page page)
 {
 	PageHeader	phdr = (PageHeader) page;
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df1..b657eb275f 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrtype.o undofile.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/README b/src/backend/storage/smgr/README
index 37ed40b645..641926f876 100644
--- a/src/backend/storage/smgr/README
+++ b/src/backend/storage/smgr/README
@@ -10,16 +10,14 @@ memory, but these were never supported in any externally released Postgres,
 nor in any version of PostgreSQL.)  The "magnetic disk" manager is itself
 seriously misnamed, because actually it supports any kind of device for
 which the operating system provides standard filesystem operations; which
-these days is pretty much everything of interest.  However, we retain the
-notion of a storage manager switch in case anyone ever wants to reintroduce
-other kinds of storage managers.  Removing the switch layer would save
-nothing noticeable anyway, since storage-access operations are surely far
-more expensive than one extra layer of C function calls.
+these days is pretty much everything of interest.  However, we retained the
+notion of a storage manager switch and it turned out to be useful for plugging
+in a new storage manager to support buffered undo logs.
 
 In Berkeley Postgres each relation was tagged with the ID of the storage
-manager to use for it.  This is gone.  It would be probably more reasonable
-to associate storage managers with tablespaces, should we ever re-introduce
-multiple storage managers into the system catalogs.
+manager to use for it.  This is gone.  While earlier PostgreSQL releases were
+hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo
+pseudo-database to undo_file.c.
 
 The files in this directory, and their contents, are
 
@@ -31,6 +29,12 @@ The files in this directory, and their contents, are
     md.c	The "magnetic disk" storage manager, which is really just
 		an interface to the kernel's filesystem operations.
 
+    undo_file.c The undo log storage manager.  This supports
+		buffer-pool based access to the contents of undo log
+		segment files.  It supports a limited subset of the
+		smgr interface: it can only read and write blocks of
+		existing files.
+
     smgrtype.c	Storage manager type -- maps string names to storage manager
 		IDs and provides simple comparison operators.  This is the
 		regproc support for type "smgr" in the system catalogs.
@@ -38,6 +42,9 @@ The files in this directory, and their contents, are
 		in the catalogs anymore.)
 
 Note that md.c in turn relies on src/backend/storage/file/fd.c.
+undo_file.c also uses fd.c to read and write blocks, but it expects
+src/backend/access/undo/undolog.c to manage the files holding those
+blocks.
 
 
 Relation Forks
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a50509f..4c489a2e59 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -45,7 +45,7 @@
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * Special values for the segno arg to RememberFsyncRequest.
+ * Special values for the segno arg to mdrequestsync.
  *
  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
@@ -1420,7 +1420,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
 	}
 	else
 	{
@@ -1456,8 +1456,7 @@ register_unlink(RelFileNodeBackend rnode)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
 	}
 	else
 	{
@@ -1476,7 +1475,7 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ * mdrequestsync() -- callback from checkpointer side of fsync request
  *
  * We stuff fsync requests into the local hash table for execution
  * during the checkpointer's next checkpoint.  UNLINK requests go into a
@@ -1497,7 +1496,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1640,7 +1639,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1679,7 +1678,7 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86..57e1668b5d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
+	void		(*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum,
+									 int segno);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -81,15 +83,33 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
+		.smgr_requestsync = mdrequestsync,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_pre_ckpt = mdpreckpt,
 		.smgr_sync = mdsync,
 		.smgr_post_ckpt = mdpostckpt
+	},
+	/* undo logs */
+	{undofile_init, undofile_shutdown, undofile_close, undofile_create,
+	 undofile_exists, undofile_unlink, undofile_extend, undofile_prefetch,
+	 undofile_read, undofile_write, undofile_writeback, undofile_nblocks,
+	 undofile_truncate,
+	 undofile_requestsync,
+	 undofile_immedsync, undofile_preckpt, undofile_sync,
+	 undofile_postckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
+/*
+ * In ancient Postgres the catalog entry for each relation controlled the
+ * choice of storage manager implementation.  Now we have only md.c for
+ * regular relations, and undofile.c for undo log storage in the undolog
+ * pseudo-database.
+ */
+#define SmgrWhichForRelFileNode(rfn)			\
+	((rfn).dbNode == 9 ? 1 : 0)
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -185,11 +205,18 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		/* Which storage manager implementation? */
+		reln->smgr_which = SmgrWhichForRelFileNode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
 			reln->md_num_open_segs[forknum] = 0;
+			reln->md_seg_fds[forknum] = NULL;
+		}
+
+		reln->private_data = NULL;
 
 		/* it has no owner yet */
 		add_to_unowned_list(reln);
@@ -722,6 +749,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	smgrsw[reln->smgr_which].smgr_truncate(reln, forknum, nblocks);
 }
 
+/*
+ *	smgrrequestsync() -- Enqueue a request for smgrsync() to flush data.
+ */
+void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno);
+}
+
 /*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
new file mode 100644
index 0000000000..afba64eb9b
--- /dev/null
+++ b/src/backend/storage/smgr/undofile.c
@@ -0,0 +1,546 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module provides SMGR-compatible
+ * interface to the files that back undo logs on the filesystem, so that undo
+ * log data can use the shared buffer pool.  Other aspects of undo log
+ * management are provided by undolog.c, so the SMGR interfaces not directly
+ * concerned with reading, writing and flushing data are unimplemented.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/storage/smgr/undofile.c
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/undofile.h"
+#include "utils/memutils.h"
+
+/* intervals for calling AbsorbFsyncRequests in undofile_sync */
+#define FSYNCS_PER_ABSORB		10
+
+/*
+ * Special values for the fork arg to undofile_requestsync.
+ */
+#define FORGET_UNDO_SEGMENT_FSYNC	(InvalidBlockNumber)
+
+/*
+ * While md.c expects random access and has a small number of huge
+ * segments, undofile.c manages a potentially very large number of smaller
+ * segments and has a less random access pattern.  Therefore, instead of
+ * keeping a potentially huge array of vfds we'll just keep the most
+ * recently accessed N.
+ *
+ * For now, N == 1, so we just need to hold onto one 'File' handle.
+ */
+typedef struct UndoFileState
+{
+	int		mru_segno;
+	File	mru_file;
+} UndoFileState;
+
+static MemoryContext UndoFileCxt;
+
+typedef uint16 CycleCtr;
+
+/*
+ * An entry recording the segments that need to be fsynced by undofile_sync().
+ * This is a bit simpler than md.c's version, though it could perhaps be
+ * merged into a common struct.  One difference is that we can have much
+ * larger segment numbers, so we'll adjust for that to avoid having a lot of
+ * leading zero bits.
+ */
+typedef struct
+{
+	RelFileNode rnode;
+	Bitmapset  *requests;
+	CycleCtr	cycle_ctr;
+} PendingOperationEntry;
+
+static HTAB *pendingOpsTable = NULL;
+static MemoryContext pendingOpsCxt;
+
+static CycleCtr undofile_sync_cycle_ctr = 0;
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok);
+static File undofile_get_segment_file(SMgrRelation reln, int segno);
+
+void
+undofile_init(void)
+{
+	UndoFileCxt = AllocSetContextCreate(TopMemoryContext,
+										"UndoFileSmgr",
+										ALLOCSET_DEFAULT_SIZES);
+
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		pendingOpsCxt = AllocSetContextCreate(UndoFileCxt,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingOperationEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOpsTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+}
+
+void
+undofile_shutdown(void)
+{
+}
+
+void
+undofile_close(SMgrRelation reln, ForkNumber forknum)
+{
+}
+
+void
+undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_create is not supported");
+}
+
+bool
+undofile_exists(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_exists is not supported");
+}
+
+void
+undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_unlink is not supported");
+}
+
+void
+undofile_extend(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber blocknum, char *buffer,
+				bool skipFsync)
+{
+	elog(ERROR, "undofile_extend is not supported");
+}
+
+void
+undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	elog(ERROR, "undofile_prefetch is not supported");
+}
+
+void
+undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  char *buffer)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileRead(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_READ);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+}
+
+static void
+register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file)
+{
+	/* Temp relations should never be fsync'd */
+	Assert(!SmgrIsTemp(reln));
+
+	if (pendingOpsTable)
+	{
+		/* push it into local pending-ops table */
+		undofile_requestsync(reln->smgr_rnode.node, forknum, segno);
+	}
+	else
+	{
+		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno))
+			return;				/* passed it off successfully */
+
+		ereport(DEBUG1,
+				(errmsg("could not forward fsync request because request queue is full")));
+
+		if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(file))));
+	}
+}
+
+void
+undofile_write(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum, char *buffer,
+			   bool skipFsync)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileWrite(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_WRITE);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		/*
+		 * short write: unexpected, because this should be overwriting an
+		 * entirely pre-allocated segment file
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+
+	if (!skipFsync && !SmgrIsTemp(reln))
+		register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file);
+}
+
+void
+undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+				   BlockNumber blocknum, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		File	file;
+		int		nflush;
+
+		file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+
+		/* compute number of desired writes within the current segment */
+		nflush = Min(nblocks,
+					 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE));
+
+		FileWriteback(file,
+					  (blocknum % UNDOSEG_SIZE) * BLCKSZ,
+					  nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
+
+BlockNumber
+undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_nblocks is not supported");
+	return 0;
+}
+
+void
+undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	elog(ERROR, "undofile_truncate is not supported");
+}
+
+void
+undofile_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_immedsync is not supported");
+}
+
+void
+undofile_preckpt(void)
+{
+}
+
+void
+undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+	PendingOperationEntry *entry;
+	bool		found;
+
+	Assert(pendingOpsTable);
+
+	if (forknum == FORGET_UNDO_SEGMENT_FSYNC)
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_FIND,
+													  NULL);
+		if (entry)
+			entry->requests = bms_del_member(entry->requests, segno);
+	}
+	else
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_ENTER,
+													  &found);
+		if (!found)
+		{
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+			entry->requests = bms_make_singleton(segno);
+		}
+		else
+			entry->requests = bms_add_member(entry->requests, segno);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+void
+undofile_forgetsync(Oid logno, Oid tablespace, int segno)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = 9;
+	rnode.spcNode = tablespace;
+	rnode.relNode = logno;
+
+	if (pendingOpsTable)
+		undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno);
+	else if (IsUnderPostmaster)
+	{
+		while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno))
+			pg_usleep(10000L);
+	}
+}
+
+void
+undofile_sync(void)
+{
+	static bool undofile_sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+	int			segno;
+
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	AbsorbFsyncRequests();
+
+	if (undofile_sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOpsTable);
+		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+	}
+
+	undofile_sync_cycle_ctr++;
+	undofile_sync_in_progress = true;
+
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		Bitmapset	   *requests;
+
+		/* Skip entries that arrived after we arrived. */
+		if (entry->cycle_ctr == undofile_sync_cycle_ctr)
+			continue;
+
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr);
+
+		if (!enableFsync)
+			continue;
+
+		requests = entry->requests;
+		entry->requests = NULL;
+
+		segno = -1;
+		while ((segno = bms_next_member(requests, segno)) >= 0)
+		{
+			File		file;
+
+			if (!enableFsync)
+				continue;
+
+			file = undofile_open_segment_file(entry->rnode.relNode,
+											  entry->rnode.spcNode,
+											  segno, true /* missing_ok */);
+
+			/*
+			 * The file may be gone due to concurrent discard.  We'll ignore
+			 * that, but only if we find a cancel request for this segment in
+			 * the queue.
+			 *
+			 * It's also possible that we succeed in opening a segment file
+			 * that is subsequently recycled (renamed to represent a new range
+			 * of undo log), in which case we'll fsync that later file
+			 * instead.  That is rare and harmless.
+			 */
+			if (file <= 0)
+			{
+				char		name[MAXPGPATH];
+
+				/*
+				 * Put the request back into the bitset in a way that can't
+				 * fail due to memory allocation.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				/*
+				 * Check if a forgetsync request has arrived to delete that
+				 * segment.
+				 */
+				AbsorbFsyncRequests();
+				if (bms_is_member(segno, entry->requests))
+				{
+					UndoLogSegmentPath(entry->rnode.relNode,
+									   segno,
+									   entry->rnode.spcNode,
+									   name);
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m", name)));
+				}
+				/* It must have been removed, so we can safely skip it. */
+				continue;
+			}
+
+			elog(LOG, "fsync()ing %s", FilePathName(file));	/* TODO: remove me */
+			if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0)
+			{
+				char		name[MAXPGPATH];
+
+				strcpy(name, FilePathName(file));
+				FileClose(file);
+
+				/*
+				 * Keep the failed requests, but merge with any new ones.  The
+				 * requirement to be able to do this without risk of failure
+				 * prevents us from using a smaller bitmap that doesn't bother
+				 * tracking leading zeros.  Perhaps another data structure
+				 * would be better.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m", name)));
+			}
+			requests = bms_del_member(requests, segno);
+			FileClose(file);
+
+			if (--absorb_counter <= 0)
+			{
+				AbsorbFsyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
+		}
+
+		bms_free(requests);
+	}
+
+	undofile_sync_in_progress = true;
+}
+
+void undofile_postckpt(void)
+{
+}
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok)
+{
+	File		file;
+	char		path[MAXPGPATH];
+
+	UndoLogSegmentPath(relNode, segno, spcNode, path);
+	file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+	if (file <= 0 && (!missing_ok || errno != ENOENT))
+		elog(ERROR, "cannot open undo segment file '%s': %m", path);
+
+	return file;
+}
+
+/*
+ * Get a File for a particular segment of a SMgrRelation representing an undo
+ * log.
+ */
+static File undofile_get_segment_file(SMgrRelation reln, int segno)
+{
+	UndoFileState *state;
+
+
+	/*
+	 * Create private state space on demand.
+	 *
+	 * XXX There should probably be a smgr 'open' or 'init' interface that
+	 * would do this.  smgr.c currently initializes reln->md_XXX stuff
+	 * directly...
+	 */
+	state = (UndoFileState *) reln->private_data;
+	if (unlikely(state == NULL))
+	{
+		state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
+		reln->private_data = state;
+	}
+
+	/* If we have a file open already, check if we need to close it. */
+	if (state->mru_file > 0 && state->mru_segno != segno)
+	{
+		/* These are not the blocks we're looking for. */
+		FileClose(state->mru_file);
+		state->mru_file = 0;
+	}
+
+	/* Check if we need to open a new file. */
+	if (state->mru_file <= 0)
+	{
+		state->mru_file =
+			undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   segno, InRecovery);
+		if (InRecovery && state->mru_file <= 0)
+		{
+			/*
+			 * If in recovery, we may be trying to access a file that will
+			 * later be unlinked.  Tolerate missing files, creating a new
+			 * zero-filled file as required.
+			 */
+			UndoLogNewSegment(reln->smgr_rnode.node.relNode,
+							  reln->smgr_rnode.node.spcNode,
+							  segno);
+			state->mru_file =
+				undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+										   reln->smgr_rnode.node.spcNode,
+										   segno, false);
+			Assert(state->mru_file > 0);
+		}
+		state->mru_segno = segno;
+	}
+
+	return state->mru_file;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5ab7d3cd8d..0054baa35a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4129,6 +4129,7 @@ PostgresMain(int argc, char *argv[],
 		 * not preventing advance of global xmin while we wait for the client.
 		 */
 		InvalidateCatalogSnapshotConditionally();
+		XactPerformUndoActionsIfPending();
 
 		/*
 		 * (1) If we've reached idle state, tell the frontend we're ready for
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 970c94ee80..3b35c9d62d 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1016,7 +1016,7 @@ ProcessUtilitySlow(ParseState *pstate,
 
 							/*
 							 * parse and validate reloptions for the toast
-							 * table
+							 * table.
 							 */
 							toast_options = transformRelOptions((Datum) 0,
 																((CreateStmt *) stmt)->options,
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 525decb6f1..b5687fea8f 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -29,9 +29,11 @@ const char *const LockTagTypeNames[] = {
 	"page",
 	"tuple",
 	"transactionid",
+	"subtransactionid",
 	"virtualxid",
 	"speculative token",
 	"object",
+	"undoaction",
 	"userlock",
 	"advisory"
 };
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f955f1912a..219e98e1ad 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -28,6 +28,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/rel.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
@@ -137,12 +138,44 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
 	Oid			relid = PG_GETARG_OID(0);
 	int64		result;
 	PgStat_StatTabEntry *tabentry;
+	Relation	rel = heap_open(relid, AccessShareLock);
 
-	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+	/*
+	 * Counter tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
+		RelationStorageIsZHeap(rel))
 		result = 0;
 	else
 		result = (int64) (tabentry->tuples_hot_updated);
 
+	heap_close(rel, AccessShareLock);
+
+	PG_RETURN_INT64(result);
+}
+
+
+Datum
+pg_stat_get_tuples_inplace_updated(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatTabEntry *tabentry;
+	Relation	rel = heap_open(relid, AccessShareLock);
+
+	/*
+	 * Counter tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
+		!RelationStorageIsZHeap(rel))
+		result = 0;
+	else
+		result = (int64) (tabentry->tuples_hot_updated);
+
+	heap_close(rel, AccessShareLock);
+
 	PG_RETURN_INT64(result);
 }
 
@@ -1685,12 +1718,43 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS)
 	Oid			relid = PG_GETARG_OID(0);
 	int64		result;
 	PgStat_TableStatus *tabentry;
+	Relation	rel = heap_open(relid, AccessShareLock);
 
-	if ((tabentry = find_tabstat_entry(relid)) == NULL)
+	/*
+	 * Counter t_tuples_hot_updated stores number of hot updates for heap
+	 * table and the number of inplace updates for zheap table.
+	 */
+	if ((tabentry = find_tabstat_entry(relid)) == NULL ||
+		RelationStorageIsZHeap(rel))
+		result = 0;
+	else
+		result = (int64) (tabentry->t_counts.t_tuples_hot_updated);
+
+	heap_close(rel, AccessShareLock);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
+pg_stat_get_xact_tuples_inplace_updated(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_TableStatus *tabentry;
+	Relation	rel = heap_open(relid, AccessShareLock);
+
+	/*
+	 * Counter t_tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
+	if ((tabentry = find_tabstat_entry(relid)) == NULL ||
+		!RelationStorageIsZHeap(rel))
 		result = 0;
 	else
 		result = (int64) (tabentry->t_counts.t_tuples_hot_updated);
 
+	heap_close(rel, AccessShareLock);
+
 	PG_RETURN_INT64(result);
 }
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 120550f526..71f22b0382 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3432,9 +3432,13 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
 	}
 #endif
 
-	/* Indexes, sequences must have Invalid frozenxid; other rels must not */
+	/*
+	 * Indexes, sequences, zheap relations must have Invalid frozenxid; other
+	 * rels must not
+	 */
 	Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
-			relation->rd_rel->relkind == RELKIND_SEQUENCE) ?
+			relation->rd_rel->relkind == RELKIND_SEQUENCE ||
+			RelationStorageIsZHeap(relation)) ?
 		   freezeXid == InvalidTransactionId :
 		   TransactionIdIsNormal(freezeXid));
 	Assert(TransactionIdIsNormal(freezeXid) == MultiXactIdIsValid(minmulti));
@@ -3517,6 +3521,10 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
 
 	/* Flag relation as needing eoxact cleanup (to remove the hint) */
 	EOXactListAdd(relation);
+
+	/* Initialize the metapage for zheap relation. */
+	if (RelationStorageIsZHeap(relation))
+		ZheapInitMetaPage(relation, MAIN_FORKNUM);
 }
 
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index c6939779b9..12e7704fda 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -121,6 +121,7 @@ bool		allowSystemTableMods = false;
 int			work_mem = 1024;
 int			maintenance_work_mem = 16384;
 int			max_parallel_maintenance_workers = 2;
+int         rollback_overflow_size = 64;
 
 /*
  * Primary determinants of sizes of shared-memory structures.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1d57177cb5..2d90513714 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -557,6 +557,7 @@ BaseInit(void)
 	InitFileAccess();
 	smgrinit();
 	InitBufferPoolAccess();
+	UndoLogInit();
 }
 
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 11b6df209a..2cb7767fba 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -63,6 +63,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
+#include "postmaster/undoworker.h"
 #include "postmaster/walwriter.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
@@ -120,6 +121,7 @@ extern int	CommitDelay;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
+extern char *undo_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
@@ -1876,6 +1878,17 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Decides whether to launch an undo worker."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&disable_undo_launcher,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -2860,6 +2873,16 @@ static struct config_int ConfigureNamesInt[] =
 		5000, 1, INT_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Rollbacks greater than this size are done lazily"),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&rollback_overflow_size,
+		64, 0, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
@@ -3545,6 +3568,17 @@ static struct config_string ConfigureNamesString[] =
 		check_temp_tablespaces, assign_temp_tablespaces, NULL
 	},
 
+	{
+		{"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Sets the tablespace(s) to use for undo logs."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE
+		},
+		&undo_tablespaces,
+		"",
+		check_undo_tablespaces, assign_undo_tablespaces, NULL
+	},
+
 	{
 		{"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
 			gettext_noop("Sets the path for dynamically loadable modules."),
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index a376875269..9a5a0b17ba 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -78,8 +78,8 @@ pg_control_system(PG_FUNCTION_ARGS)
 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-	Datum		values[19];
-	bool		nulls[19];
+	Datum		values[20];
+	bool		nulls[20];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -91,7 +91,7 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 	 * Construct a tuple descriptor for the result row.  This must match this
 	 * function's pg_proc entry!
 	 */
-	tupdesc = CreateTemplateTupleDesc(18);
+	tupdesc = CreateTemplateTupleDesc(19);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "checkpoint_lsn",
 					   LSNOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "redo_lsn",
@@ -128,6 +128,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 					   XIDOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
 					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 19, "oldest_xid_with_epoch_having_undo",
+					   INT8OID, -1, 0);
 	tupdesc = BlessTupleDesc(tupdesc);
 
 	/* Read the control file. */
@@ -202,6 +204,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 									 time_t_to_timestamptz(ControlFile->checkPointCopy.time));
 	nulls[17] = false;
 
+	values[18] = Int64GetDatum(ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo);
+	nulls[18] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1fa02d2c93..9190c3f9b2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -737,4 +737,10 @@
 # CUSTOMIZED OPTIONS
 #------------------------------------------------------------------------------
 
+# If often there are large transactions requiring rollbacks, then we can push
+# them to undo-workers for better performance. The size specifeid by the
+# parameter below, determines the minimum size of the rollback requests to be
+# sent to the undo-worker.
+#
+#rollback_overflow_size = 64
 # Add settings for extensions here
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 211a96380e..ea0221060b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -209,11 +209,13 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_undo",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
+	"base/undo",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 895a51f89d..55ddbb29ac 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -278,6 +278,8 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestXidWithEpochHavingUndo:" UINT64_FORMAT "\n"),
+		   ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
@@ -329,6 +331,8 @@ main(int argc, char *argv[])
 		   ControlFile->toast_max_chunk_size);
 	printf(_("Size of a large-object chunk:         %u\n"),
 		   ControlFile->loblksize);
+	printf(_("Transaction slots per zheap page:     %u\n"),
+		   ControlFile->zheap_page_trans_slots);
 	/* This is no longer configurable, but users may still expect to see it: */
 	printf(_("Date/time type storage:               %s\n"),
 		   _("64-bit integers"));
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 6fb403a5a8..673ac29785 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -58,6 +58,11 @@
 #include "pg_getopt.h"
 #include "getopt_long.h"
 
+#ifndef WIN32
+#define pg_mv_file		rename
+#else
+#define pg_mv_file		pgrename
+#endif
 
 static ControlFileData ControlFile; /* pg_control values */
 static XLogSegNo newXlogSegNo;	/* new XLOG segment # */
@@ -85,6 +90,7 @@ static void FindEndOfXLOG(void);
 static void KillExistingXLOG(void);
 static void KillExistingArchiveStatus(void);
 static void WriteEmptyXLOG(void);
+static bool FindLatestUndoCheckPointFile(char *latest_undo_checkpoint_file);
 static void usage(void);
 
 
@@ -115,6 +121,9 @@ main(int argc, char *argv[])
 	char	   *DataDir = NULL;
 	char	   *log_fname = NULL;
 	int			fd;
+	char		latest_undo_checkpoint_file[MAXPGPATH];
+	char		new_undo_checkpoint_file[MAXPGPATH];
+	bool		found = false;
 
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_resetwal"));
 
@@ -448,6 +457,7 @@ main(int argc, char *argv[])
 		if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
 			ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
+		ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo = 0;
 	}
 
 	if (set_oldest_commit_ts_xid != 0)
@@ -514,6 +524,25 @@ main(int argc, char *argv[])
 	 * Else, do the dirty deed.
 	 */
 	RewriteControlFile();
+
+	/*
+	 * Find the newest undo checkpoint file under pg_undo directory and rename
+	 * it as per the latest checkpoint redo location in control file.
+	 */
+	found = FindLatestUndoCheckPointFile(latest_undo_checkpoint_file);
+	if (!found)
+		fprintf(stderr, _("Could not find the latest undo checkpoint file.\n"));
+
+	snprintf(new_undo_checkpoint_file, sizeof(new_undo_checkpoint_file),
+			 "pg_undo/%016" INT64_MODIFIER "X", ControlFile.checkPointCopy.redo);
+
+	if (pg_mv_file(latest_undo_checkpoint_file, new_undo_checkpoint_file) != 0)
+	{
+		fprintf(stderr, _("Unable to rename %s to %s.\n"), latest_undo_checkpoint_file,
+				new_undo_checkpoint_file);
+		exit(1);
+	}
+
 	KillExistingXLOG();
 	KillExistingArchiveStatus();
 	WriteEmptyXLOG();
@@ -716,6 +745,8 @@ GuessControlValues(void)
 	ControlFile.checkPointCopy.oldestMultiDB = InvalidOid;
 	ControlFile.checkPointCopy.time = (pg_time_t) time(NULL);
 	ControlFile.checkPointCopy.oldestActiveXid = InvalidTransactionId;
+	ControlFile.checkPointCopy.nextXid = 0;
+	ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo = 0;
 
 	ControlFile.state = DB_SHUTDOWNED;
 	ControlFile.time = (pg_time_t) time(NULL);
@@ -808,6 +839,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile.checkPointCopy.newestCommitTsXid);
+	printf(_("Latest checkpoint's oldestXidWithEpochHavingUndo:" UINT64_FORMAT "\n"),
+		   ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -884,6 +917,8 @@ PrintNewControlValues(void)
 			   ControlFile.checkPointCopy.oldestXid);
 		printf(_("OldestXID's DB:                       %u\n"),
 			   ControlFile.checkPointCopy.oldestXidDB);
+		printf(_("OldestXidWithEpochHavingUndo:" UINT64_FORMAT "\n"),
+			   ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo);
 	}
 
 	if (set_xid_epoch != -1)
@@ -1303,6 +1338,55 @@ WriteEmptyXLOG(void)
 	close(fd);
 }
 
+/*
+ * Find the latest modified undo checkpoint file under pg_undo directory and
+ * delete all other files.
+ */
+static bool
+FindLatestUndoCheckPointFile(char *latest_undo_checkpoint_file)
+{
+	char **filenames;
+	char **filename;
+	char latest[UNDO_CHECKPOINT_FILENAME_LENGTH + 1];
+	bool result = false;
+
+	memset(latest, 0, sizeof(latest));
+
+	/* Copy all the files from pg_undo directory into filenames */
+	filenames = pgfnames("pg_undo");
+
+	/*
+	 * Start reading each file under pg_undo to identify the latest
+	 * modified file and remove the older files that are not required.
+	 */
+	for (filename = filenames; *filename; filename++)
+	{
+		if (!(strlen(*filename) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(latest, *filename))
+		{
+			if (latest[0] != '\0')
+			{
+				snprintf(latest_undo_checkpoint_file, MAXPGPATH, "pg_undo/%s",
+						 latest);
+				if (unlink(latest_undo_checkpoint_file) != 0)
+					fprintf(stderr, _("could not unlink file \"%s\": %s\n"),
+							*filename, strerror(errno));
+			}
+			memcpy(latest, *filename, UNDO_CHECKPOINT_FILENAME_LENGTH);
+			latest[UNDO_CHECKPOINT_FILENAME_LENGTH] = '\0';
+			result = true;
+		}
+	}
+
+	if (result)
+		snprintf(latest_undo_checkpoint_file, MAXPGPATH, "pg_undo/%s", latest);
+
+	pgfnames_cleanup(filenames);
+
+	return result;
+}
 
 static void
 usage(void)
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 47119dc42d..036f094904 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -468,6 +468,15 @@ copy_xact_xlog_xid(void)
 					  GET_MAJOR_VERSION(new_cluster.major_version) < 1000 ?
 					  "pg_clog" : "pg_xact");
 
+	/* copy old undo checkpoint files to new data dir */
+	copy_subdir_files("pg_undo", "pg_undo");
+
+	/*
+	 * copy old undo logs to new data dir assuming that the
+	 * undo logs exist in default location i.e. 'base/undo'.
+	 */
+	copy_subdir_files("base/undo", "base/undo");
+
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true, true,
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..d29e76a637 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,8 +20,12 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#include "access/tpd_xlog.h"
+#include "access/undoaction_xlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/zheapam_xlog.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
diff --git a/src/include/access/genham.h b/src/include/access/genham.h
new file mode 100644
index 0000000000..92122ea5c5
--- /dev/null
+++ b/src/include/access/genham.h
@@ -0,0 +1,143 @@
+/*-------------------------------------------------------------------------
+ *
+ * genham.h
+ *	  POSTGRES generalized heap access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/genham.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef GENHAM_H
+#define GENHAM_H
+
+#include "access/multixact.h"
+#include "access/sdir.h"
+#include "access/skey.h"
+#include "nodes/lockoptions.h"
+#include "storage/buf.h"
+#include "storage/itemptr.h"
+#include "storage/lockdefs.h"
+#include "utils/relcache.h"
+
+typedef struct BulkInsertStateData *BulkInsertState;
+
+/* struct definitions appear in relscan.h */
+typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
+
+/*
+ * When heap_update, heap_delete, or heap_lock_tuple fail because the target
+ * tuple is already outdated, they fill in this struct to provide information
+ * to the caller about what happened.
+ * ctid is the target's ctid link: it is the same as the target's TID if the
+ * target was deleted, or the location of the replacement tuple if the target
+ * was updated.
+ * xmax is the outdating transaction's XID.  If the caller wants to visit the
+ * replacement tuple, it must check that this matches before believing the
+ * replacement is really a match.
+ * cmax is the outdating command's CID, but only when the failure code is
+ * HeapTupleSelfUpdated (i.e., something in the current transaction outdated
+ * the tuple); otherwise cmax is zero.  (We make this restriction because
+ * HeapTupleHeaderGetCmax doesn't work for tuples outdated in other
+ * transactions.)
+ * in_place_updated_or_locked indicates whether the tuple is updated or locked.
+ * We need to re-verify the tuple even if it is just marked as locked, because
+ * previously someone could have updated it in place.
+ */
+typedef struct HeapUpdateFailureData
+{
+	ItemPointerData ctid;
+	TransactionId xmax;
+	CommandId	cmax;
+	bool		traversed;
+	bool        in_place_updated_or_locked;
+} HeapUpdateFailureData;
+
+/* Result codes for HeapTupleSatisfiesVacuum */
+typedef enum
+{
+	HEAPTUPLE_DEAD,				/* tuple is dead and deletable */
+	HEAPTUPLE_LIVE,				/* tuple is live (committed, no deleter) */
+	HEAPTUPLE_RECENTLY_DEAD,	/* tuple is dead, but not deletable yet */
+	HEAPTUPLE_INSERT_IN_PROGRESS,		/* inserting xact is still in progress */
+	HEAPTUPLE_DELETE_IN_PROGRESS	/* deleting xact is still in progress */
+} HTSV_Result;
+
+/* Result codes for ZHeapTupleSatisfiesVacuum */
+typedef enum
+{
+	ZHEAPTUPLE_DEAD,				/* tuple is dead and deletable */
+	ZHEAPTUPLE_LIVE,				/* tuple is live (committed, no deleter) */
+	ZHEAPTUPLE_RECENTLY_DEAD,	/* tuple is dead, but not deletable yet */
+	ZHEAPTUPLE_INSERT_IN_PROGRESS,		/* inserting xact is still in progress */
+	ZHEAPTUPLE_DELETE_IN_PROGRESS,	/* deleting xact is still in progress */
+	ZHEAPTUPLE_ABORT_IN_PROGRESS		/* rollback is still pending */
+} ZHTSV_Result;
+
+/*
+ * Possible lock modes for a tuple.
+ */
+typedef enum LockTupleMode
+{
+	/* SELECT FOR KEY SHARE */
+	LockTupleKeyShare,
+	/* SELECT FOR SHARE */
+	LockTupleShare,
+	/* SELECT FOR NO KEY UPDATE, and UPDATEs that don't modify key columns */
+	LockTupleNoKeyExclusive,
+	/* SELECT FOR UPDATE, UPDATEs that modify key columns, and DELETE */
+	LockTupleExclusive
+} LockTupleMode;
+
+#define MaxLockTupleMode	LockTupleExclusive
+
+
+static const struct
+{
+	LOCKMODE	hwlock;
+	int			lockstatus;
+	int			updstatus;
+}
+
+			tupleLockExtraInfo[MaxLockTupleMode + 1] =
+{
+	{							/* LockTupleKeyShare */
+		AccessShareLock,
+		MultiXactStatusForKeyShare,
+		-1						/* KeyShare does not allow updating tuples */
+	},
+	{							/* LockTupleShare */
+		RowShareLock,
+		MultiXactStatusForShare,
+		-1						/* Share does not allow updating tuples */
+	},
+	{							/* LockTupleNoKeyExclusive */
+		ExclusiveLock,
+		MultiXactStatusForNoKeyUpdate,
+		MultiXactStatusNoKeyUpdate
+	},
+	{							/* LockTupleExclusive */
+		AccessExclusiveLock,
+		MultiXactStatusForUpdate,
+		MultiXactStatusUpdate
+	}
+};
+
+#define UnlockTupleTuplock(rel, tup, mode) \
+	UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+
+extern bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
+					 LockTupleMode mode, LockWaitPolicy wait_policy,
+					 bool *have_tuple_lock);
+extern void GetVisibilityMapPins(Relation relation, Buffer buffer1,
+					Buffer buffer2, BlockNumber block1, BlockNumber block2,
+					Buffer *vmbuffer1, Buffer *vmbuffer2);
+extern void RelationAddExtraBlocks(Relation relation, BulkInsertState bistate);
+extern Buffer ReadBufferBI(Relation relation, BlockNumber targetBlock,
+					BulkInsertState bistate);
+
+#endif   /* GENHAM_H */
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 527138440b..7ca5058555 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -260,17 +260,24 @@ typedef struct xl_hash_init_bitmap_page
  *
  * Backup Blk 0: bucket page
  * Backup Blk 1: meta page
+ *
+ * In Hot Standby, we need to scan the entire relation to verify whether any
+ * hash delete index item conflicts with any standby query. For that, we need to
+ * know the relation type which is stored in xlog record.
  */
+#define	XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP	0x0001
+
 typedef struct xl_hash_vacuum_one_page
 {
 	RelFileNode hnode;
 	int			ntuples;
+	uint8		flags;			/* See XLOG_HASH_VACUUM_* flags for details */
 
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+	(offsetof(xl_hash_vacuum_one_page, flags) + sizeof(uint8))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a309db1a1c..c16a3526a7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -14,8 +14,7 @@
 #ifndef HEAPAM_H
 #define HEAPAM_H
 
-#include "access/sdir.h"
-#include "access/skey.h"
+#include "access/genham.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -35,57 +34,6 @@ typedef struct BulkInsertStateData *BulkInsertState;
 
 struct TupleTableSlot;
 
-/*
- * Possible lock modes for a tuple.
- */
-typedef enum LockTupleMode
-{
-	/* SELECT FOR KEY SHARE */
-	LockTupleKeyShare,
-	/* SELECT FOR SHARE */
-	LockTupleShare,
-	/* SELECT FOR NO KEY UPDATE, and UPDATEs that don't modify key columns */
-	LockTupleNoKeyExclusive,
-	/* SELECT FOR UPDATE, UPDATEs that modify key columns, and DELETE */
-	LockTupleExclusive
-} LockTupleMode;
-
-#define MaxLockTupleMode	LockTupleExclusive
-
-/*
- * When heap_update, heap_delete, or heap_lock_tuple fail because the target
- * tuple is already outdated, they fill in this struct to provide information
- * to the caller about what happened.
- * ctid is the target's ctid link: it is the same as the target's TID if the
- * target was deleted, or the location of the replacement tuple if the target
- * was updated.
- * xmax is the outdating transaction's XID.  If the caller wants to visit the
- * replacement tuple, it must check that this matches before believing the
- * replacement is really a match.
- * cmax is the outdating command's CID, but only when the failure code is
- * HeapTupleSelfUpdated (i.e., something in the current transaction outdated
- * the tuple); otherwise cmax is zero.  (We make this restriction because
- * HeapTupleHeaderGetCmax doesn't work for tuples outdated in other
- * transactions.)
- */
-typedef struct HeapUpdateFailureData
-{
-	ItemPointerData ctid;
-	TransactionId xmax;
-	CommandId	cmax;
-	bool		traversed;
-} HeapUpdateFailureData;
-
-/* Result codes for HeapTupleSatisfiesVacuum */
-typedef enum
-{
-   HEAPTUPLE_DEAD,             /* tuple is dead and deletable */
-   HEAPTUPLE_LIVE,             /* tuple is live (committed, no deleter) */
-   HEAPTUPLE_RECENTLY_DEAD,    /* tuple is dead, but not deletable yet */
-   HEAPTUPLE_INSERT_IN_PROGRESS,   /* inserting xact is still in progress */
-   HEAPTUPLE_DELETE_IN_PROGRESS    /* deleting xact is still in progress */
-} HTSV_Result;
-
 /* struct definition is private to rewriteheap.c */
 typedef struct RewriteStateData *RewriteState;
 
@@ -139,6 +87,7 @@ extern void heap_rescan(TableScanDesc scan, ScanKey key, bool set_params,
 			bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_rescan_set_params(TableScanDesc scan, ScanKey key,
 					   bool allow_strat, bool allow_sync, bool allow_pagemode);
+
 extern void heap_endscan(TableScanDesc scan);
 extern HeapTuple heap_getnext(TableScanDesc scan, ScanDirection direction);
 extern struct TupleTableSlot *heap_getnextslot(TableScanDesc sscan, ScanDirection direction,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 708f73f0ea..2c91378b14 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -816,5 +816,6 @@ extern MinimalTuple minimal_tuple_from_heap_tuple(HeapTuple htup);
 extern size_t varsize_any(void *p);
 extern HeapTuple heap_expand_tuple(HeapTuple sourceTuple, TupleDesc tupleDesc);
 extern MinimalTuple minimal_expand_tuple(HeapTuple sourceTuple, TupleDesc tupleDesc);
+extern Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull);
 
 #endif							/* HTUP_DETAILS_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..67d83d1e37 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -120,17 +120,24 @@ typedef struct xl_btree_split
  * single index page when *not* executed by VACUUM.
  *
  * Backup Blk 0: index page
+ *
+ * In Hot Standby, we need to scan the entire relation to verify whether any
+ * btree delete record conflicts with any standby query. For that, we need to
+ * know the relation type which is stored in xlog record.
  */
+#define	XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP	0x0001
+
 typedef struct xl_btree_delete
 {
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	int			nitems;
+	uint8		flags;			/* See XLOG_BTREE_DELETE_* flags for details */
 
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nitems) + sizeof(int))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about page reuse within btree.
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 51a3ad74fa..fa26ffa16b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -105,6 +105,14 @@ typedef struct IndexFetchHeapData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
+typedef struct IndexFetchZHeapData
+{
+	IndexFetchTableData xs_base;
+
+	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+} IndexFetchZHeapData;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
diff --git a/src/include/access/rewritezheap.h b/src/include/access/rewritezheap.h
new file mode 100644
index 0000000000..5e9e243336
--- /dev/null
+++ b/src/include/access/rewritezheap.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * rewritezheap.h
+ *	  Declarations for zheap rewrite support functions
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994-5, Regents of the University of California
+ *
+ * src/include/access/rewritezheap.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef REWRITE_ZHEAP_H
+#define REWRITE_ZHEAP_H
+
+#include "access/zhtup.h"
+#include "utils/relcache.h"
+
+/* struct definition is private to rewritezheap.c */
+typedef struct RewriteZheapStateData *RewriteZheapState;
+
+extern RewriteZheapState begin_zheap_rewrite(Relation OldHeap, Relation NewHeap,
+				   TransactionId OldestXmin, TransactionId FreezeXid,
+				   MultiXactId MultiXactCutoff, bool use_wal);
+extern void end_zheap_rewrite(RewriteZheapState state);
+extern void reform_and_rewrite_ztuple(ZHeapTuple tuple, TupleDesc oldTupDesc,
+	TupleDesc newTupDesc, Datum *values, bool *isnull,
+	RewriteZheapState rwstate);
+extern void rewrite_zheap_tuple(RewriteZheapState state, ZHeapTuple oldTuple,
+				   ZHeapTuple newTuple);
+
+#endif							/* REWRITE_ZHEAP_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 0bbe9879ca..2328a1cc48 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,8 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL)
+PG_RMGR(RM_ZHEAP_ID, "Zheap", zheap_redo, zheap_desc, zheap_identify, NULL, NULL, zheap_mask)
+PG_RMGR(RM_ZHEAP2_ID, "Zheap2", zheap2_redo, zheap2_desc, zheap2_identify, NULL, NULL, zheap_mask)
+PG_RMGR(RM_UNDOACTION_ID, "UndoAction", undoaction_redo, undoaction_desc, undoaction_identify, NULL, NULL, NULL)
+PG_RMGR(RM_TPD_ID, "TPD", tpd_redo, tpd_desc, tpd_identify, NULL, NULL, zheap_mask)
diff --git a/src/include/access/tpd.h b/src/include/access/tpd.h
new file mode 100644
index 0000000000..e5de47a0e4
--- /dev/null
+++ b/src/include/access/tpd.h
@@ -0,0 +1,135 @@
+/*-------------------------------------------------------------------------
+ *
+ * tpd.h
+ *	  POSTGRES TPD definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tpd.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TPD_H
+#define TPD_H
+
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zheap.h"
+#include "storage/block.h"
+#include "utils/rel.h"
+
+/* TPD page information */
+typedef struct TPDPageOpaqueData
+{
+	BlockNumber tpd_prevblkno;
+	BlockNumber tpd_nextblkno;
+	uint32		tpd_latest_xid_epoch;
+	TransactionId	tpd_latest_xid;
+} TPDPageOpaqueData;
+
+typedef TPDPageOpaqueData *TPDPageOpaque;
+
+#define SizeofTPDPageOpaque (offsetof(TPDPageOpaqueData, tpd_latest_xid) + sizeof(TransactionId))
+
+/* TPD entry information */
+#define INITIAL_TRANS_SLOTS_IN_TPD_ENTRY	8	
+/*
+ * Number of item to trasaction slot mapping entries in addition to max
+ * itemid's in heap page.  This is required to support newer inserts on the
+ * page, otherwise, we might immediately need to allocate a new bigger TPD
+ * entry.
+ */
+#define ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY	8
+
+typedef struct TPDEntryHeaderData
+{
+	BlockNumber	blkno;		/* Heap block number to which this TPD entry
+							 * belongs. */
+	uint16		tpe_num_map_entries;
+	uint16		tpe_num_slots;
+	uint16		tpe_flags;
+} TPDEntryHeaderData;
+
+typedef TPDEntryHeaderData *TPDEntryHeader;
+
+#define SizeofTPDEntryHeader (offsetof(TPDEntryHeaderData, tpe_flags) + sizeof(uint16))
+
+#define	TPE_ONE_BYTE	0x0001
+#define	TPE_FOUR_BYTE	0x0002
+#define	TPE_DELETED		0x0004
+
+#define	OFFSET_MASK	0x3FFFFF
+
+#define TPDEntryIsDeleted(tpd_e_hdr) \
+( \
+	(tpd_e_hdr.tpe_flags & TPE_DELETED) != 0 \
+)
+
+/* Maximum size of one TPD entry. */
+#define MaxTPDEntrySize \
+	((int) (BLCKSZ - SizeOfPageHeaderData - SizeofTPDPageOpaque - sizeof(ItemIdData)))
+
+/*
+ * MaxTPDTuplesPerPage is an upper bound on the number of tuples that can
+ * fit on one zheap page.
+ */
+#define MaxTPDTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData - SizeofTPDPageOpaque) / \
+			(SizeofTPDEntryHeader  + sizeof(ItemIdData))))
+
+extern OffsetNumber TPDPageAddEntry(Page tpdpage, char *tpd_entry, Size size,
+							OffsetNumber offset);
+extern void SetTPDLocation(Buffer heapbuffer, Buffer tpdbuffer, uint16 offset);
+extern void ClearTPDLocation(Buffer heapbuf);
+extern void TPDInitPage(Page page, Size pageSize);
+extern bool TPDFreePage(Relation rel, Buffer buf, BufferAccessStrategy bstrategy);
+extern int TPDAllocateAndReserveTransSlot(Relation relation, Buffer buf,
+								OffsetNumber offnum, UndoRecPtr *urec_ptr);
+extern TransInfo *TPDPageGetTransactionSlots(Relation relation, Buffer heapbuf,
+						   OffsetNumber offnum, bool keepTPDBufLock,
+						   bool checkOffset, int *num_map_entries,
+						   int *num_trans_slots, int *tpd_buf_id,
+						   bool *tpd_e_pruned, bool *alloc_bigger_map);
+extern int TPDPageReserveTransSlot(Relation relation, Buffer heapbuf,
+						OffsetNumber offset, UndoRecPtr *urec_ptr, bool *lock_reacquired);
+extern int TPDPageGetSlotIfExists(Relation relation, Buffer heapbuf, OffsetNumber offnum,
+					   uint32 epoch, TransactionId xid, UndoRecPtr *urec_ptr,
+					   bool keepTPDBufLock, bool checkOffset);
+extern int TPDPageGetTransactionSlotInfo(Buffer heapbuf, int trans_slot,
+					OffsetNumber offset, uint32 *epoch, TransactionId *xid,
+					UndoRecPtr *urec_ptr, bool NoTPDBufLock, bool keepTPDBufLock);
+extern void TPDPageSetTransactionSlotInfo(Buffer heapbuf, int trans_slot_id,
+					uint32 epoch, TransactionId xid, UndoRecPtr urec_ptr);
+extern void TPDPageSetUndo(Buffer heapbuf, int trans_slot_id,
+				bool set_tpd_map_slot, uint32 epoch, TransactionId xid,
+				UndoRecPtr urec_ptr, OffsetNumber *usedoff, int ucnt);
+extern void TPDPageSetOffsetMapSlot(Buffer heapbuf, int trans_slot_id,
+									OffsetNumber offset);
+extern void TPDPageGetOffsetMap(Buffer heapbuf, char *tpd_entry_data,
+								int map_size);
+extern int TPDPageGetOffsetMapSize(Buffer heapbuf);
+extern void TPDPageSetOffsetMap(Buffer heapbuf, char *tpd_offset_map);
+extern bool TPDPageLock(Relation relation, Buffer heapbuf);
+extern XLogRedoAction XLogReadTPDBuffer(XLogReaderState *record,
+										uint8 block_id);
+extern uint8 RegisterTPDBuffer(Page heappage, uint8 block_id);
+extern void TPDPageSetLSN(Page heappage, XLogRecPtr recptr);
+extern void UnlockReleaseTPDBuffers(void);
+extern Size PageGetTPDFreeSpace(Page page);
+extern void ResetRegisteredTPDBuffers(void);
+
+/* interfaces exposed via prunetpd.c */
+extern int TPDPagePrune(Relation rel, Buffer tpdbuf, BufferAccessStrategy strategy,
+				OffsetNumber target_offnum,  Size space_required, bool can_free,
+				bool *update_tpd_inplace, bool *tpd_e_pruned);
+extern void TPDPagePruneExecute(Buffer tpdbuf, OffsetNumber *nowunused,
+								int nunused);
+extern void TPDPageRepairFragmentation(Page page, Page tmppage,
+					OffsetNumber target_offnum, Size space_required);
+
+/* Reset globals related to TPD buffers. */
+extern void ResetTPDBuffers(void);
+#endif   /* TPD_H */
diff --git a/src/include/access/tpd_xlog.h b/src/include/access/tpd_xlog.h
new file mode 100644
index 0000000000..f16cd63c9a
--- /dev/null
+++ b/src/include/access/tpd_xlog.h
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * tpd_xlog.h
+ *	  POSTGRES tpd XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tpd_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TPD_XLOG_H
+#define TPD_XLOG_H
+
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "storage/off.h"
+
+/*
+ * WAL record definitions for tpd.c's WAL operations
+ */
+#define XLOG_ALLOCATE_TPD_ENTRY			0x00
+#define XLOG_TPD_CLEAN					0x10
+#define XLOG_TPD_CLEAR_LOCATION			0x20
+#define XLOG_INPLACE_UPDATE_TPD_ENTRY	0x30
+#define XLOG_TPD_FREE_PAGE				0x40
+#define	XLOG_TPD_CLEAN_ALL_ENTRIES		0x50
+
+#define	XLOG_TPD_OPMASK				0x70
+
+/*
+ * When we insert 1st tpd entry on new page during reserve slot, we can (and
+ * we do) restore entire page in redo.
+ */
+#define XLOG_TPD_INIT_PAGE				0x80
+
+#define XLOG_OLD_TPD_BUF_EQ_LAST_TPD_BUF	0x01
+
+/* This is what we need to know about tpd entry allocation */
+typedef struct xl_tpd_allocate_entry
+{
+	/* tpd entry related info */
+	BlockNumber	prevblk;
+	BlockNumber	nextblk;
+	OffsetNumber offnum;		/* inserted entry's offset */
+
+	uint8		flags;
+	/* TPD entry data in backup block 0 */
+} xl_tpd_allocate_entry;
+
+#define SizeOfTPDAllocateEntry	(offsetof(xl_tpd_allocate_entry, flags) + sizeof(uint8))
+
+/* This is what we need to know about tpd entry cleanup */
+#define XL_TPD_CONTAINS_OFFSET			(1<<0)
+
+typedef struct xl_tpd_clean
+{
+	uint8			flags;
+} xl_tpd_clean;
+
+#define SizeOfTPDClean	(offsetof(xl_tpd_clean, flags) + sizeof(uint8))
+
+/* This is what we need to know about tpd free page */
+
+typedef struct xl_tpd_free_page
+{
+	BlockNumber		prevblkno;
+	BlockNumber		nextblkno;
+} xl_tpd_free_page;
+
+#define SizeOfTPDFreePage	(offsetof(xl_tpd_free_page, nextblkno) + sizeof(BlockNumber))
+
+extern void tpd_redo(XLogReaderState *record);
+extern void tpd_desc(StringInfo buf, XLogReaderState *record);
+extern const char *tpd_identify(uint8 info);
+
+#endif   /* TPD_XLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 83ec3f1979..7b983efba4 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -68,6 +68,10 @@
 	(AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
 	(int32) ((id1) - (id2)) > 0)
 
+/* Extract xid from a value comprised of epoch and xid  */
+#define GetXidFromEpochXid(epochxid)			\
+	((uint32) (epochxid) & 0XFFFFFFFF)
+
 /* ----------
  *		Object ID (OID) zero is InvalidOid.
  *
diff --git a/src/include/access/tupmacs.h b/src/include/access/tupmacs.h
index 1c3741da65..70ce407d04 100644
--- a/src/include/access/tupmacs.h
+++ b/src/include/access/tupmacs.h
@@ -14,6 +14,7 @@
 #ifndef TUPMACS_H
 #define TUPMACS_H
 
+#include "access/genham.h"
 
 /*
  * check to see if the ATT'th bit of an array of 8-bit bytes is set.
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index f99291e30d..7c0bc4f1e6 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -16,6 +16,8 @@
 #include "access/htup_details.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
+#include "access/zheap.h"
+#include "access/zhtup.h"
 
 /*
  * This enables de-toasting of index entries.  Needed until VACUUM is
@@ -136,6 +138,25 @@ extern HeapTuple toast_insert_or_update(Relation rel,
 					   HeapTuple newtup, HeapTuple oldtup,
 					   int options);
 
+/* ----------
+ * ztoast_insert_or_update -
+ *
+ *	Called by zheap_insert() and zheap_update().
+ * ----------
+ */
+
+extern ZHeapTuple ztoast_insert_or_update(Relation rel,
+					   ZHeapTuple newtup, ZHeapTuple oldtup,
+					   int options);
+
+/* ----------
+ * ztoast_delete -
+ *
+ *	Called by zheap_delete().
+ * ----------
+ */
+extern void ztoast_delete(Relation rel, ZHeapTuple oldtup, bool is_speculative);
+
 /* ----------
  * toast_delete -
  *
@@ -236,4 +257,14 @@ extern Size toast_datum_size(Datum value);
  */
 extern Oid	toast_get_valid_index(Oid toastoid, LOCKMODE lock);
 
+extern int toast_open_indexes(Relation toastrel,
+				   LOCKMODE lock,
+				   Relation **toastidxs,
+				   int *num_indexes);
+extern bool toastrel_valueid_exists(Relation toastrel, Oid valueid);
+extern bool toastid_valueid_exists(Oid toastrelid, Oid valueid);
+extern void toast_close_indexes(Relation *toastidxs, int num_indexes,
+					LOCKMODE lock);
+extern void init_toast_snapshot(Snapshot toast_snapshot);
+
 #endif							/* TUPTOASTER_H */
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 0e932daa48..d43d873442 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -18,6 +18,8 @@
 #include "access/xact.h"
 #include "datatype/timestamp.h"
 #include "storage/lock.h"
+#include "postmaster/undoloop.h"
+#include "access/undolog.h"
 
 /*
  * GlobalTransactionData is defined in twophase.c; other places have no
@@ -41,7 +43,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 				TimestampTz prepared_at,
 				Oid owner, Oid databaseid);
 
-extern void StartPrepare(GlobalTransaction gxact);
+extern void StartPrepare(GlobalTransaction gxact, UndoRecPtr *, UndoRecPtr *);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
diff --git a/src/include/access/undoaction_xlog.h b/src/include/access/undoaction_xlog.h
new file mode 100644
index 0000000000..bfc64182eb
--- /dev/null
+++ b/src/include/access/undoaction_xlog.h
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoaction_xlog.h
+ *	  undo action XLOG definitions
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoaction_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOACTION_XLOG_H
+#define UNDOACTION_XLOG_H
+
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "storage/off.h"
+
+/*
+ * WAL record definitions for undoactions.c's WAL operations
+ */
+#define XLOG_UNDO_PAGE				0x00
+#define XLOG_UNDO_RESET_SLOT		0x10
+#define XLOG_UNDO_APPLY_PROGRESS	0x20
+
+/*
+ * xl_undoaction_page flag values, 8 bits are available.
+ */
+#define XLU_PAGE_CONTAINS_TPD_SLOT			(1<<0)
+#define XLU_PAGE_CLEAR_VISIBILITY_MAP		(1<<1)
+#define XLU_CONTAINS_TPD_OFFSET_MAP			(1<<2)
+#define XLU_INIT_PAGE						(1<<3)
+
+/* This is what we need to know about delete */
+typedef struct xl_undoaction_page
+{
+	UndoRecPtr	urec_ptr;
+	TransactionId	xid;
+	int			trans_slot_id;	/* transaction slot id */
+} xl_undoaction_page;
+
+#define SizeOfUndoActionPage	(offsetof(xl_undoaction_page, trans_slot_id) + sizeof(int))
+
+/* This is what we need to know about undo apply progress */
+typedef struct xl_undoapply_progress
+{
+	UndoRecPtr	urec_ptr;
+	uint32		progress;
+} xl_undoapply_progress;
+
+#define SizeOfUndoActionProgress	(offsetof(xl_undoapply_progress, progress) + sizeof(uint32))
+
+/*
+ * xl_undoaction_reset_slot flag values, 8 bits are available.
+ */
+#define XLU_RESET_CONTAINS_TPD_SLOT			(1<<0)
+
+/* This is what we need to know about delete */
+typedef struct xl_undoaction_reset_slot
+{
+	UndoRecPtr	urec_ptr;
+	int			trans_slot_id;	/* transaction slot id */
+	uint8		flags;
+} xl_undoaction_reset_slot;
+
+#define SizeOfUndoActionResetSlot	(offsetof(xl_undoaction_reset_slot, flags) + sizeof(uint8))
+
+extern void undoaction_redo(XLogReaderState *record);
+extern void undoaction_desc(StringInfo buf, XLogReaderState *record);
+extern const char *undoaction_identify(uint8 info);
+
+#endif   /* UNDOACTION_XLOG_H */
diff --git a/src/include/access/undodiscard.h b/src/include/access/undodiscard.h
new file mode 100644
index 0000000000..4234c0cb54
--- /dev/null
+++ b/src/include/access/undodiscard.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  undo discard definitions
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undodiscard.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDODISCARD_H
+#define UNDODISCARD_H
+
+#include "access/undolog.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+#include "storage/lwlock.h"
+
+/*
+ * Discard the undo for all the transaction whose xid is smaller than xmin
+ *
+ *	Check the DiscardInfo memory array for each slot (every undo log) , process
+ *	the undo log for all the slot which have xid smaller than xmin or invalid
+ *	xid. Fetch the record from the undo log transaction by transaction until we
+ *	find the xid which is not smaller than xmin.
+ */
+extern void UndoDiscard(TransactionId xmin, bool *hibernate);
+
+/* To calculate the size of the hash table size for rollabcks. */
+extern int RollbackHTSize(void);
+
+/* To initialize the hash table in shared memory for rollbacks. */
+extern void InitRollbackHashTable(void);
+
+#endif   /* UNDODISCARD_H */
+
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000000..f0b5a24099
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence, xl_undolog_meta *);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section where we have prepared the undo record.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno,
+				OffsetNumber offset,
+				TransactionId xid,
+				UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence,
+				   xl_undolog_meta *undometa);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+extern void PrepareUpdateUndoActionProgress(UndoRecPtr urecptr, int progress);
+extern void UndoRecordUpdateTransInfo(void);
+
+/* Reset globals related to undo buffers */
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 0000000000..e36faaa75d
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,332 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/xlogreader.h"
+#include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+
+#ifndef FRONTEND
+#include "storage/lwlock.h"
+#endif
+
+/* The type used to identify an undo log and position within it. */
+typedef uint64 UndoRecPtr;
+
+/* The type used for undo record lengths. */
+typedef uint16 UndoRecordSize;
+
+/* Undo log statuses. */
+typedef enum
+{
+	UNDO_LOG_STATUS_UNUSED = 0,
+	UNDO_LOG_STATUS_ACTIVE,
+	UNDO_LOG_STATUS_EXHAUSTED,
+	UNDO_LOG_STATUS_DISCARDED
+} UndoLogStatus;
+
+/*
+ * Undo log persistence levels.  These have a one-to-one correspondence with
+ * relpersistence values, but are small integers so that we can use them as an
+ * index into the "logs" and "lognos" arrays.
+ */
+typedef enum
+{
+	UNDO_PERMANENT = 0,
+	UNDO_UNLOGGED = 1,
+	UNDO_TEMP = 2
+} UndoPersistence;
+
+#define UndoPersistenceLevels 3
+
+/*
+ * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
+ * enumerator.
+ */
+#define UndoPersistenceForRelPersistence(rp)						\
+	((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :			\
+	 (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
+
+/*
+ * Convert from UndoPersistence to a relpersistence value.
+ */
+#define RelPersistenceForUndoPersistence(up)				\
+	((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :	\
+	 (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :		\
+	 RELPERSISTENCE_TEMP)
+
+/*
+ * Get the appropriate UndoPersistence value from a Relation.
+ */
+#define UndoPersistenceForRelation(rel)									\
+	(UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence))
+
+/* Type for offsets within undo logs */
+typedef uint64 UndoLogOffset;
+
+/* printf-family format string for UndoRecPtr. */
+#define UndoRecPtrFormat "%016" INT64_MODIFIER "X"
+
+/* printf-family format string for UndoLogOffset. */
+#define UndoLogOffsetFormat UINT64_FORMAT
+
+/* Number of blocks of BLCKSZ in an undo log segment file.  128 = 1MB. */
+#define UNDOSEG_SIZE 128
+
+/* Size of an undo log segment file in bytes. */
+#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE)
+
+/* The width of an undo log number in bits.  24 allows for 16.7m logs. */
+#define UndoLogNumberBits 24
+
+/* The width of an undo log offset in bits.  40 allows for 1TB per log.*/
+#define UndoLogOffsetBits (64 - UndoLogNumberBits)
+
+/* Special value for undo record pointer which indicates that it is invalid. */
+#define	InvalidUndoRecPtr	((UndoRecPtr) 0)
+
+/* End-of-list value when building linked lists of undo logs. */
+#define InvalidUndoLogNumber -1
+
+/*
+ * The maximum amount of data that can be stored in an undo log.  Can be set
+ * artificially low to test full log behavior.
+ */
+#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits)
+
+/* Type for numbering undo logs. */
+typedef int UndoLogNumber;
+
+/* Extract the undo log number from an UndoRecPtr. */
+#define UndoRecPtrGetLogNo(urp)					\
+	((urp) >> UndoLogOffsetBits)
+
+/* Extract the offset from an UndoRecPtr. */
+#define UndoRecPtrGetOffset(urp)				\
+	((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1))
+
+/* Make an UndoRecPtr from an log number and offset. */
+#define MakeUndoRecPtr(logno, offset)			\
+	(((uint64) (logno) << UndoLogOffsetBits) | (offset))
+
+/* The number of unusable bytes in the header of each block. */
+#define UndoLogBlockHeaderSize SizeOfPageHeaderData
+
+/* The number of usable bytes we can store per block. */
+#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize)
+
+/* The pseudo-database OID used for undo logs. */
+#define UndoLogDatabaseOid 9
+
+/* Length of undo checkpoint filename */
+#define UNDO_CHECKPOINT_FILENAME_LENGTH	16
+
+/*
+ * UndoRecPtrIsValid
+ *		True iff undoRecPtr is valid.
+ */
+#define UndoRecPtrIsValid(undoRecPtr) \
+	((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr))
+
+/* Extract the relnode for an undo log. */
+#define UndoRecPtrGetRelNode(urp)				\
+	UndoRecPtrGetLogNo(urp)
+
+/* The only valid fork number for undo log buffers. */
+#define UndoLogForkNum MAIN_FORKNUM
+
+/* Compute the block number that holds a given UndoRecPtr. */
+#define UndoRecPtrGetBlockNum(urp)				\
+	(UndoRecPtrGetOffset(urp) / BLCKSZ)
+
+/* Compute the offset of a given UndoRecPtr in the page that holds it. */
+#define UndoRecPtrGetPageOffset(urp)			\
+	(UndoRecPtrGetOffset(urp) % BLCKSZ)
+
+/* Compare two undo checkpoint files to find the oldest file. */
+#define UndoCheckPointFilenamePrecedes(file1, file2)	\
+	(strcmp(file1, file2) < 0)
+
+/* What is the offset of the i'th non-header byte? */
+#define UndoLogOffsetFromUsableByteNo(i)								\
+	(((i) / UndoLogUsableBytesPerPage) * BLCKSZ +						\
+	 UndoLogBlockHeaderSize +											\
+	 ((i) % UndoLogUsableBytesPerPage))
+
+/* How many non-header bytes are there before a given offset? */
+#define UndoLogOffsetToUsableByteNo(offset)				\
+	(((offset) % BLCKSZ - UndoLogBlockHeaderSize) +		\
+	 ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage)
+
+/* Add 'n' usable bytes to offset stepping over headers to find new offset. */
+#define UndoLogOffsetPlusUsableBytes(offset, n)							\
+	UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n))
+
+/* Find out which tablespace the given undo log location is backed by. */
+extern Oid UndoRecPtrGetTablespace(UndoRecPtr insertion_point);
+
+/* Populate a RelFileNode from an UndoRecPtr. */
+#define UndoRecPtrAssignRelFileNode(rfn, urp)			\
+	do													\
+	{													\
+		(rfn).spcNode = UndoRecPtrGetTablespace(urp);	\
+		(rfn).dbNode = UndoLogDatabaseOid;				\
+		(rfn).relNode = UndoRecPtrGetRelNode(urp);		\
+	} while (false);
+
+/*
+ * Control metadata for an active undo log.  Lives in shared memory inside an
+ * UndoLogControl object, but also written to disk during checkpoints.
+ */
+typedef struct UndoLogMetaData
+{
+	UndoLogStatus status;
+	Oid		tablespace;
+	UndoPersistence persistence;	/* permanent, unlogged, temp? */
+	UndoLogOffset insert;			/* next insertion point (head) */
+	UndoLogOffset end;				/* one past end of highest segment */
+	UndoLogOffset discard;			/* oldest data needed (tail) */
+	UndoLogOffset last_xact_start;	/* last transactions start undo offset */
+
+	/*
+	 * If the same transaction is split over two undo logs then it stored the
+	 * previous log number, see file header comments of undorecord.c for its
+	 * usage.
+	 *
+	 * Fixme: See if we can find other way to handle it instead of keeping
+	 * previous log number.
+	 */
+	UndoLogNumber prevlogno;		/* Previous undo log number */
+	bool	is_first_rec;
+
+	/*
+	 * last undo record's length. We need to save this in undo meta and WAL
+	 * log so that the value can be preserved across restart so that the first
+	 * undo record after the restart can get this value properly.  This will be
+	 * used going to the previous record of the transaction during rollback.
+	 * In case the transaction have done some operation before checkpoint and
+	 * remaining after checkpoint in such case if we can't get the previous
+	 * record prevlen which which before checkpoint we can not properly
+	 * rollback.  And, undo worker is also fetch this value when rolling back
+	 * the last transaction in the undo log for locating the last undo record
+	 * of the transaction.
+	 */
+	uint16	prevlen;
+} UndoLogMetaData;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_meta
+{
+	UndoLogMetaData	meta;
+	UndoLogNumber	logno;
+	TransactionId	xid;
+} xl_undolog_meta;
+
+#ifndef FRONTEND
+
+/*
+ * The in-memory control object for an undo log.  As well as the current
+ * meta-data for the undo log, we also lazily maintain a snapshot of the
+ * meta-data as it was at the redo point of a checkpoint that is in progress.
+ *
+ * Conceptually the set of UndoLogControl objects is arranged into a very
+ * large array for access by log number, but because we typically need only a
+ * smallish number of adjacent undo logs to be active at a time we arrange
+ * them into smaller fragments called 'banks'.
+ */
+typedef struct UndoLogControl
+{
+	UndoLogNumber logno;
+	UndoLogMetaData meta;			/* current meta-data */
+	XLogRecPtr      lsn;
+	bool	need_attach_wal_record;	/* need_attach_wal_record */
+	pid_t		pid;				/* InvalidPid for unattached */
+	LWLock	mutex;					/* protects the above */
+	TransactionId xid;
+	/* State used by undo workers. */
+	TransactionId	oldest_xid;		/* cache of oldest transaction's xid */
+	uint32		oldest_xidepoch;
+	UndoRecPtr	oldest_data;
+	LWLock		discard_lock;		/* prevents discarding while reading */
+	LWLock		rewind_lock;		/* prevent rewinding while reading */
+
+	UndoLogNumber next_free;		/* protected by UndoLogLock */
+} UndoLogControl;
+
+#endif
+
+/* Space management. */
+extern UndoRecPtr UndoLogAllocate(size_t size,
+								  UndoPersistence level);
+extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid,
+											size_t size,
+											UndoPersistence persistence);
+extern void UndoLogAdvance(UndoRecPtr insertion_point,
+						   size_t size,
+						   UndoPersistence persistence);
+extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid);
+extern bool UndoLogIsDiscarded(UndoRecPtr point);
+
+/* Initialization interfaces. */
+extern void StartupUndoLogs(XLogRecPtr checkPointRedo);
+extern void UndoLogShmemInit(void);
+extern Size UndoLogShmemSize(void);
+extern void UndoLogInit(void);
+extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace,
+							   char *path);
+extern void ResetUndoLogs(UndoPersistence persistence);
+
+/* Interface use by tablespace.c. */
+extern bool DropUndoLogsInTablespace(Oid tablespace);
+
+/* GUC interfaces. */
+extern void assign_undo_tablespaces(const char *newval, void *extra);
+
+/* Checkpointing interfaces. */
+extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo,
+							   XLogRecPtr priorCheckPointRedo);
+
+#ifndef FRONTEND
+
+extern UndoLogControl *UndoLogGet(UndoLogNumber logno);
+extern UndoLogControl *UndoLogNext(UndoLogControl *log);
+extern bool AmAttachedToUndoLog(UndoLogControl *log);
+
+#endif
+
+extern void UndoLogSetLastXactStartPoint(UndoRecPtr point);
+extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetCurrentLocation(UndoPersistence persistence);
+extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno,
+										  TransactionId xid);
+extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen);
+extern bool IsTransactionFirstRec(TransactionId xid);
+extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen);
+extern uint16 UndoLogGetPrevLen(UndoLogNumber logno);
+extern bool NeedUndoMetaLog(XLogRecPtr redo_point);
+extern void UndoLogSetLSN(XLogRecPtr lsn);
+extern void LogUndoMetaData(xl_undolog_meta *xlrec);
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void undolog_redo(XLogReaderState *record);
+/* Discard the undo logs for temp tables */
+extern void TempUndoDiscard(UndoLogNumber);
+extern Oid UndoLogStateGetDatabaseId(void);
+
+#endif
diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h
new file mode 100644
index 0000000000..c6354bc7b5
--- /dev/null
+++ b/src/include/access/undolog_xlog.h
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog_xlog.h
+ *	  undo log access XLOG definitions.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog_xlog.h
+ *
+ *------------------------------------------------------------------------