From: Andres Freund Date: Thu, 6 Dec 2018 04:28:24 +0000 (-0800) Subject: ZHEAP on Pluggable V1. X-Git-Url: http://git.postgresql.org/gitweb/static/gitweb.js?a=commitdiff_plain;h=4a9c21d57f7177b932bc76490990029ffffe7046;p=users%2Fandresfreund%2Fpostgres.git ZHEAP on Pluggable V1. Squashed commit of the following: Pluggable Storage Integration + commit 309145fb8ab59c993e26855cf0bf51d37902773f (zheap/master) Author: mahendra Date: 2018-12-10 11:14:44 +0530 Fix WAL logging for tpd map update during Rollback We were not WAL logging tpd map update in few cases, fix that. By Mahendra Singh Thalor and Amit Kapila commit deda887629ca8bba859c0fa2a2e459e2c922c2bd Author: mahendra Date: 2018-12-10 10:45:04 +0530 BugFix : Correcting block number read in tpd_xlog_free_page In tpd_xlog_free_page function, we were trying to read from block 3 the data written in block 2. Hence correcting the block number. Patch by me, reviewed by Amit Kapila commit 79738e7038e267c2c685cc5a5221153269317eeb Author: Dilip Kumar Date: 2018-12-07 10:54:33 +0530 Revert "For slots marked as invalid xact, skip fetching undo for some cases" This reverts commit 796299af2ca14800f8d69560d33f89ac293c1d15. commit 796299af2ca14800f8d69560d33f89ac293c1d15 Author: Kuntal Ghosh Date: 2018-12-05 17:27:18 +0530 For slots marked as invalid xact, skip fetching undo for some cases For a slot marked as invalid xact, if the corresponding xid preceds oldest xid having undo or the corresponding undo record is already discarded, we can consider the slot as frozen. Patch by me. Investigated by Amit Kapila, Dilip Kumar and me. commit e26c804789fdcf24fc588a734b2196a81a51eef4 Author: Dilip Kumar Date: 2018-12-05 13:09:06 +0530 Merged review comment fixes from pg hackers patch Dilip Kumar and Amit Kapila commit 1fee88b7ec5d9aed7e1c0429292906e30f457f0d Author: Mithun CY Date: 2018-12-04 02:11:08 -0800 Fix CLANG's warnings. By Mithun C Y, review by Amit Kaplia. commit 5b48c0c175dabfaf34099f0cc7feefe0b2367a0c Author: Mithun CY Date: 2018-12-04 01:35:40 -0800 Fix thinkos, caught by CLANG compilers. By Mithun C Y and Review by Dilip Kumar. commit a1e1859d6910f49bb6eec8d04113dbba78deee3a Author: Dilip Kumar Date: 2018-12-03 14:54:38 +0530 Improve comments and code for previous commit. commit 132fefaf953e632e07ddc53f8e0198e655cf0896 Author: Dilip Kumar Date: 2018-12-02 16:35:02 +0530 In ZGetMultiLockMembers we fetch the undo record without a buffer lock so it's possible that a transaction in the slot can rollback and rewind the undo record pointer. To prevent that we acquire the rewind lock before rewinding the undo record pointer and the same lock will be acquire by ZGetMultiLockMembers lock in shared mode. Other places where we fetch the undo record we don't need this lock as we are doing that under the buffer lock. So remember to acquire the rewind lock in shared mode wherever we are fetching the undo record of non commited transaction without buffer lock. Mahendra Thalor Reviewed and modified by Dilip Kumar commit c200352aa6917b8c3d4a0e771a1f19d6f4eec6d9 Author: Rafia Sabih Date: 2018-11-30 14:06:07 +0530 Bugfix in TPD recovery Patch by me, reviewed by Dilip Kumar commit fd9152fa3ecfe1d0d8e6d06571d91f082d0f4ee9 Author: Dilip Kumar Date: 2018-11-29 07:17:27 -0800 Remove multilocker flags for the cases for unwanted cases Currently, we are setting multilocker flag whenever lock is acquired on the tuple which has some performance penalty. As part of this patch we have avoided setting multilocker flag for the case when a updater is taking a lock through eval plan qual mechanishm. Amit Kapila and Dilip Kumar commit 79882e67fe4572081fc65a0dc28c20a773ef0533 Author: Kuntal Ghosh Date: 2018-11-29 18:26:12 +0530 Fix issue in TPD allocate entry While allocating an entry for TPD, we traverse all the tuples in the page and check for tuples the correspond to the last slot and update the corresponding offset map in TPD with the actual transaction slot. commit 8a6d050f121d2e588defee48d4a3ad9e43ac6136 Author: Rafia Sabih Date: 2018-11-29 16:57:02 +0530 Bugfix in recovery of TPD free page Patch by me, reviewed by Amit Kapila commit 47b86637a3d1ae71102c1a7a7704aaf7e95feffc Author: Kuntal Ghosh Date: 2018-11-28 14:10:10 +0530 Fix issues in TPD 1. If previous and next block number is invalid for a TPD page, then that is the only TPD page in the same relation. We should handle the case correctly in TPDFreePage. 2. While allocating a new TPD page, we ask a new page from FSM. It's possible that FSM returns a zheap page on which the current backend already holds a lock in exclusive mode. Hence, try for a conditional lock. If it can't get the lock immediately, extend the relation and allocate a new TPD block. 3. In TPDPageLock, if a TPD buffer is already pruned, we don't take a lock on the same. Here, we should take the opportunity to clear the TPD location from the corresponding zheap buffer. 4. In lazy vacuum, we don't vacuum a TPD page. Instead, we try to prune the page. But, if it's already pruned, we should skip it. 5. There are several places in the code where we lock a TPD page before entering critical section. For non-inplace updates on different buffers, it is possible that both old and new zheap buffer corresponding to the same TPD buffer. Hence, we should be careful that we don't try to lock the TPD page two times. Else, we'll be waiting on self. Patch by me. Reviewed by Amit Kapila. commit 8c31fef78166e7df53470f43c6bd1d53d18341e7 Author: Kuntal Ghosh Date: 2018-11-28 14:31:09 +0530 Skip using tuple if corresponding item is deleted If item id is marked as deleted, we can't check the tuple. Else, it'll lead to segmentation fault. Patch by me. Reviewed by Amit Kapila. commit c66b03f50c074236882fb0370a87dad8df4a597c Author: Amit Kapila Date: 2018-11-27 09:10:44 +0530 Fix windows build. commit 35e2e1b5907d85ee02d74b728a23ab7865405e64 Author: Rafia Sabih Date: 2018-11-26 14:34:16 +0530 Forget Local buffers For zheap, forget the local buffers whenever used. Patch by me, reviewed by Amit Kapila commit 3a1f846c48cea443478478622c12b6df28fe15a6 Author: Amit Kapila Date: 2018-11-25 10:12:13 +0530 Fetch CID from undo only when required. If the current command doesn't need to modify any tuple and the snapshot used is not of any previous command, then it can see all the modifications made by current transactions till now. So, we don't even attempt to fetch CID from undo in such cases. Patch by Amit Kapila, reviewed and verified by Dilip Kumar commit e48be86489daf6bc78071d7fe7fe5e1eb5b2a705 Author: Mithun CY Date: 2018-11-23 07:58:01 -0800 Bug Fix, correct an invalid Assert. slot_xid is uninitialized and Assert condition is invalid. Patch By Mahendra Thalor Review by Amit Kapila. commit 12bf9650d247c64d96868ada87fcda07a2ff46b5 Author: Dilip Kumar Date: 2018-11-23 03:31:06 -0800 Merge review comments fixed on community patch Dilip Kumar commit d06ab949c8774f6989ffefc65a100b7208516a06 Author: Amit Kapila Date: 2018-11-22 15:53:22 +0530 Fix the modification of TPD map during rollback. If the previous transaction slot points to a TPD slot then we need to update the slot in the offset map of the TPD entry. This is the case where during DO operation the previous updater belongs to a non-TPD slot whereas now the same slot has become a TPD slot. In such cases, we need to update offset-map. Patch Mahendra Singh, reviewed by Amit Kapila commit 9f1fd917e954b6b430b2575e84bd669166ee8cdf Author: Mithun CY Date: 2018-11-21 22:09:16 -0800 Make the expected file to be generated by a source file. Patch By Mahendra Thalor Review by Mithun C Y. commit 3e255b05d87a65183adc8cac204529937c978e13 Author: Beena Emerson Date: 2018-11-20 11:47:49 +0530 Fix crash in pg_stat_* functions commit b04aeb0a05 added Assets to ensure that we hold some relevant lock during relation_open. This commit corrects the locks used in relation_open for the zheap pg_stat functions to avoid Assertion failure. Patch by me, reviewed by Dilip Kumar commit 0e2f8eb102744121d7203213c87c124f75578665 Author: Rafia Sabih Date: 2018-11-19 13:07:25 +0530 Reset undo buffers in case of abort In case of errors, when the transaction aborts, the undo buffers and the buffer index are now reset. Patch by me, reviewed by Dilip Kumar and Amit Kapila commit 88f9e6e42d0c769158b283b378c4c0d19af91139 Author: Dilip Kumar Date: 2018-11-18 04:43:36 -0800 Fix trailing space in previous commit commit b925c7ab4df48f42951b9690216d9c9e78f25ba3 Author: Dilip Kumar Date: 2018-11-18 04:18:04 -0800 Only try to lock the TPD page if the zheap page has the TPD slots in it. Path by Mahendra Thalor reviewed by Dilip Kumar commit 257e8e7990e9d85a373a41d0dce17f7f56ed5c61 Author: Kuntal Ghosh Date: 2018-11-15 17:46:14 +0530 Add expected file for multiple-row-versions isolation test In this test, we get a serialization failure due to in-place updates in zheap. But, this is an expected behavior. Discussions: https://postgr.es/m/CAGz5QCJzreUqJqHeXrbEs6xb0zCNKBHhOj6D9Tjd3btJTzydxg@mail.gmail.com commit 1b44ba6c5154fa6eed3e56bfb97285a78aa58003 Author: Kuntal Ghosh Date: 2018-11-02 12:00:41 +0530 Implement ALTER TABLE SET TABLESPACE for zheap To alter the tablespace for a zheap table, we copy the pages one-by-one to the new tablespace. Following is the algorithm to perform the same: For each zheap page, a. If it's a meta page, copy it as it is. b. If it's a TPD page, copy it as it is. c. If it's a zheap data page, apply pending aborts, copy the page and corresponding TPD page if we've rolled back any transaction from the TPD. Patch by me. Reviewed by Amit Kapila. commit 2b5f6aa07e8d705f238bf8b1a308e6f761e05a98 Author: Kuntal Ghosh Date: 2018-11-14 11:35:00 +0530 Create a wrapper function for fetching transaction slots for a page We've created a wrapper function GetTransactionsSlotsForPage for fetching all transaction information for a zheap page and its corresponding TPD page. Patch by me. Reviewed by Amit Kapila. commit 0931d56b34aaf5c9c2b665da751aec051aa7d90a Author: Dilip Kumar Date: 2018-11-15 01:50:11 -0800 Fix assert It must be in critical section only if it's not in recovery commit 96ede480d229e874e204895856bcded0fb33f68b Author: Dilip Kumar Date: 2018-11-15 01:41:43 -0800 Remove relfilenode and tablespace id from the undo record and store reloid There was no reason why we stored relfilenode and tablespaceid in undo record, instead we could just store reloid. Earlier, we might have kept it thinking that we will perform rollback without database connection but that's not the case now. This will save the space as well as it will be useful when we need to transfer the undo records across the relfilenodes e.g ALTER TABLE SET TABLESPACE. Dilip Kumar reviewed by Amit Kapila commit 842e9e36821e2ad1ce2f557631a2af8bb96f2d23 Author: Dilip Kumar Date: 2018-11-15 01:28:07 -0800 Fix cosmetic review comments in undoinsert Dilip Kumar review by Amit Kapila commit 01f981fdeddad6eff83f1a12c8751a373df56a1c Author: Rafia Sabih Date: 2018-11-14 20:48:14 +0530 Discard undo logs in single-user mode For zheap, discard the undo logs at commit time when in single user mode. Patch by me, reviewed by Dilip Kumar and Amit Kapila commit 1bcef1bf35738b060e5c412428aac502245abe2b Author: Rafia Sabih Date: 2018-11-14 17:03:47 +0530 Bugfix for visibility map buffer In case the page is newly extended, the vmbuffer will not be valid. Hence avoid checking the status for it in zheap insert and update. Reported by Neha, patch by me, reviewed by Amit Kapila commit 1b1cace0bb7195025ea5bd19848953f10fe5cd30 Author: Dilip Kumar Date: 2018-11-13 05:14:26 -0800 Removed invalid assert in extend_undo_log It is possible that while we try to extend the undo log it's already extended by discard (recycle old undo log) so this situation is possible. Patch by Mahindra reviewed by Dilip Kumar and Amit Kapila commit aa25f816c637f9123f67a4c7e1fa14f64c15d72c Author: Dilip Kumar Date: 2018-11-11 19:35:32 -0800 Fix compiler error commit 48d8236385a8894f17e642bb182e6be01b5fcfb7 Author: Amit Kapila Date: 2018-11-12 08:33:05 +0530 Fix warning. commit 0495698cb6e1e6d29ddb146332ddc61299c99aeb Author: Amit Kapila Date: 2018-11-10 16:15:45 +0530 Prune entire TPD page at one-shot if possible. We have used tpd_latest_xid_epoch stored in page to prune the entire TPD page. Basically, if tpd_latest_xid_epoch precedes oldestXidhaving undo, then we can assume all the entries in the page can be pruned. Apart from that, I have changed the logic so that page is freed during pruning if all the entries are removed from the page. this will ensure that pages will be reclaimed whenever they are empty, not only during vacuum. In the passing, I have fixed another related issue which is after we get the new page from fsm, we need to ensure that it is a tpd page before using it. Patch by Amit Kapila, reviewed and verified by Dilip Kumar commit 2c0af615df41867e26fd80900f0e38f032750752 Author: Kuntal Ghosh Date: 2018-11-09 15:22:26 +0530 While computing infomask for updating a tuple, don't copy update flags When we compute infomask during updating a tuple, we don't have to copy update flags. We compute it later in zheap_update. commit 48f1a2d2973ae25b33adfeac4129769fadfea134 Author: Kuntal Ghosh Date: 2018-11-09 13:27:35 +0530 Small bugfix in UndoDiscardOneLog We can't compare log->meta.insert directly with the undo_ptr. We need to create a undo pointer first. commit 0b374e91021b49795d2ec05243e3c9c24363d7d5 Author: Mithun CY Date: 2018-11-08 22:09:05 -0800 Bugfix in TPDPageGetTransactionSlots to handle truncated pages. Block number starts from zero, when compared with the total number of blocks in relation, we need to adjust the block number accordingly. Fix by Mithun C Y, review by Amit Kapila. commit 38e5a0cee902dae0c18961d91699ba254ea0fcb3 Author: Kuntal Ghosh Date: 2018-11-08 15:16:02 +0530 In zheap_update, clear the in-place update flag for non-inplace updates Reported by Gunjan Kumar. commit 05c36ee9f886c9feb274360d9b523ea0d9e4dac7 Author: Kuntal Ghosh Date: 2018-11-08 11:51:20 +0530 Fix a testcase in expected file for trigger.sql regression test commit f52a281728a425df027d4a201c09730c5a8f0c42 Author: Kuntal Ghosh Date: 2018-11-08 11:14:55 +0530 While locking the tuple, set cid as FirstCommandId While locking the tuple, we set the command id as FirstCommandId since it doesn't modify the tuple, just updates the infomask. commit 2595ad4d099abf21d59017f8cebfa416fe454d2d Author: Kuntal Ghosh Date: 2018-11-07 15:07:49 +0530 Fix a bug for projecting table oid There are certain places in the code where we form a heap tuple using ExecCopyTuple and store it in slot along with the zheap tuple. In that case, we don't copy the tableOid to heap tuple. Hence, for fetching tableOid, we should prefer a zheap tuple instead of a heap tuple. commit a3d175ba1b4a089b74a9f68be4b035695d57d698 Author: Mithun CY Date: 2018-11-07 21:11:32 -0800 Post push compilation error fix for 903ce21849 By Mithun C Y commit fdd2e03c972269b4a0fe95b13433ce55ac1f8bd7 Author: Dilip Kumar Date: 2018-11-06 02:11:35 -0800 Bugfix in recovery of invalid_xact and freeze TPD page slot information was not set if BLK_NEED_REDO is false for the heap page. Also, during recovery relation descriptor is used to take the pruning decision which is wrong because during recovery it should be done automatically by a seperate WAL. Dilip Kumar and Amit Kapila commit 0fc2cb13d810d7979b51c2b6ea44fa4f9119f7a4 Author: Kuntal Ghosh Date: 2018-11-05 16:31:38 +0530 Fix shared memory size for rollback hash table While creating shared memory segment for rollback hash table, we should set the size of the segment correctly. Patch by me. Investigated by Dilip Kumar and me. commit 62c4099af465e7f739aa0f391724eb4efb02ba08 Author: Kuntal Ghosh Date: 2018-11-05 13:52:25 +0530 Fix warnings in tpd.c commit 9e203e16ffd03867bda789c33f0bdeece04df7a0 Author: Kuntal Ghosh Date: 2018-11-04 21:57:23 +0530 Assign OID correctly while converting tuple commit b8bcb9891a1fc24e1c02933ee84eba1a017af692 Author: Rafia Sabih Date: 2018-11-05 13:46:47 +0530 Handling rollbacks in single user mode for zheap Never push rollback requests to worker when in single user mode. Patch by me, reviewed by Amit Kapila commit a95fc28e9073d6e7815f3772e1255efe2967c847 Author: Amit Kapila Date: 2018-11-05 11:03:54 +0530 Add empty TPD pages to FSM. The empty TPD pages are added to FSM by vacuum which prune such pages as well. The empty pages from FSM can be used either by zheap or TPD when required. We need to ensure that when we access TPD pages, such a page can be pruned or truncated away by vacuum. After pruning the TPD page, it can be freed by removing it from the chain of TPD pages. As of now, only vacumm can add TPD pages into FSM, but we might want to improve it someday that backends can also do the same. Patch by me with help from Dilip Kumar who also tested and verified the patch. commit 49a60ee0c3bbb844621bc8167f8e04ca0c032509 Author: Rafia Sabih Date: 2018-11-05 10:14:13 +0530 Adding new output files for regress-suite Some system attributes are not supported for zheap tuples, hence, for combocid and transactions new .out files are added for zheap. Patch by me, reviewed by Kuntal Ghosh commit 3d8d5f45b4b928fbcc9d1ff29c3f8fc822009a4f Author: Amit Kapila Date: 2018-11-05 08:09:29 +0530 Fix valgrind error. Reported by Tomas Vondra, patch by me, verified by Mithun commit 8e3a8679dfb0248699155eec8f07098df7d7f3ad Author: Rafia Sabih Date: 2018-11-02 16:44:26 +0530 Change case in one error message Pointed out by Kuntal Ghosh, patch by me. commit 1bcb9489f3077ef03616ff19ff46b0e3a2980aa6 Author: Amit Kapila Date: 2018-11-02 16:36:32 +0530 Fix the errors reported by valgrind. In the passing, I have noticed the xidepoch was not assigned properly, so fixed that as well. Reported by Tomas Vondra, Patch by me, verified by Mithun C Y. commit f69fa9edc361c374cb7ce8780b5e63b6412c9d08 Author: Rafia Sabih Date: 2018-11-02 16:26:04 +0530 Bugfix in heap_truncate for zheap relations Initialize meta page for zheap relations only when the complete relation is truncated. Patch by Amit Kapila, tested by me commit 704e3b777c385e319ded0b0fa317409fc41dc7a1 Author: Kuntal Ghosh Date: 2018-11-02 15:40:19 +0530 Add expected file for rowsecurity regression test In rowsecurity test, there is a test case that uses table sample scan with bernoulli distribution. The output of tablesample scan partially depends on the block number of the relpage from which the tuples are fetched. For zheap, blocknumber 0 is meta page. Hence, tuples are stored from block 1. Hence, the output of table sample scan can be different. commit 153ea5c17e65144382c8c68a22d46527f0f7de2d Author: Mithun CY Date: 2018-11-01 08:09:35 -0700 Release Undo buffer locks after wal replay of XLOG_UNDO_APPLY_PROGRESS By Mithun C Y And Mahendra Thalor. commit 5fed5fe9eaa00c1196dc47d18b91c9e2bb336aab Author: Kuntal Ghosh Date: 2018-11-01 15:33:47 +0530 Add expected file for stats.sql regression test In zheap, we increase pgstat_info->trans->tuples_updated only for non-inplace updates, otherwise vacuum will be triggered for in-place updates as well. But for heap, we always increase tuples_updated since heap always creates a new version of the tuple during updates. Hence, we need to fix the output in expected file for zheap. commit 987049f5b11351ee6b985ddc76e978e2d77fa403 Author: Kuntal Ghosh Date: 2018-11-01 14:52:39 +0530 Add expected file for strings.sql regression test There is a test case in strings.sql that counts the number of pages for toast table. In zheap, toast tables are also created in zheap format which always includes a zheap meta page. We should count that in the expected file. commit 3ef3b9b67ae27d9568ce9380c9f646bd894b8bd0 Author: Kuntal Ghosh Date: 2018-11-01 12:00:50 +0530 Add storage_engine in reloptions regression test commit 94777e98221cf49e42f9a7bddf5947072424e368 Author: Mithun CY Date: 2018-10-31 22:23:28 -0700 Add alternative expected files for zheap regression. Order of zheap tuples in pages will be diffrent form heap. Hence zheap specific expected files are needed. By Mithun C Y commit 7570b3bed887531f31dcc25ea602233c8d56ee6d Author: Dilip Kumar Date: 2018-10-31 22:06:56 -0700 Fix warning of hash seq scan leak and also removed unwanted assert from the zheap mask. Dilip Kumar Reviewed by Amit Kapila and Kuntal Ghosh commit 14fdcc4b9eda81577d81ba3c7b040f86850e78eb Author: Rafia Sabih Date: 2018-11-01 09:54:26 +0530 System attributes for Zheap For zheap tuples, Xmin is given as the xid which last modified the tuple. The other system attributes Xmax, Cmin, and Cmax are not supported for zheap, for now. Patch by me, reviewed by Amit Kapila commit 794b8f4df739d3565663f51ca7117dbe855980c2 Author: Amit Kapila Date: 2018-11-01 09:11:31 +0530 Update README.md. commit 6ce786252d5ddaac6fdb07506f6cde3d66e093cd Author: Amit Kapila Date: 2018-10-31 19:11:40 +0530 Updated readme to match latest status. commit 8c780e8474aed4c38ae482d94683cba1be0e9a29 Author: Dilip Kumar Date: 2018-10-31 06:34:45 -0700 Fix assert in RollbackFromHT Ideally number of entries should be <= ROLLBACK_HT_SIZE commit 3017c86cdaba643d61f32eab351b63acfa381720 Author: Dilip Kumar Date: 2018-10-31 06:18:52 -0700 Merge additional test in zheap expected file commit 8bf58ba522c27429d0da0260a35abc5d519feece Author: Dilip Kumar Date: 2018-10-31 05:58:00 -0700 Fix defect in undo worker connection Take a database object lock before connecting to the database so that the database does not get dropped concurrently. Dilip Kumar Reviewed by Amit Kapila commit 007af507d94b992f0a37caa10015d134a0782b07 Author: Kuntal Ghosh Date: 2018-10-31 14:35:02 +0530 Skip aborting rewinded undo records Before discarding undo records, the undo discard worker checks whether it has to issue a rollback request for the corresponding aborted transaction. It's possible that the transaction got aborted by some other backend at the same time and the undo records got rewinded. Hence, the undo worker should recheck to detect whether the undo records got rewinded. In that case, there is no need to issue a rollback request. Reviewed by Amit Kapila commit 28bae62a6745f223e99e1af27328ed6218752bb4 Author: Amit Kapila Date: 2018-10-31 16:01:02 +0530 Eliminate alignment padding whereever possible. We omit all alignment padding for pass-by-value types. Even in the current heap, we never point directly to such values, so the alignment padding doesn't help much; it lets us fetch the value using a single instruction, but that is all. Pass-by-reference types will work as they do in the heap. Many pass-by-reference data types will be varlena data types (typlen = -1) with short varlena headers so no alignment padding will be introduced in that case anyway, but if we have varlenas with 4-byte headers or if we have fixed-length pass-by-reference types (e.g. interval, box) then we'll still end up with padding. We can't directly access unaligned values; instead, we need to use memcpy. We believe that the space savings will more than pay for the additional CPU costs. We don't need alignment padding between the tuple header and the tuple data as we always make a copy of the tuple to support in-place updates. Likewise, we ideally don't need any alignment padding between tuples. However, there are places in zheap code where we access tuple header directly from page (ex. zheap_delete, zheap_update, etc.) for which we them to be aligned at two-byte boundary). Amit Kapila and Kuntal Ghosh commit 580dcc4fc69e57dac919d1ac1068f876505f6c17 Author: Rafia Sabih Date: 2018-10-31 14:59:31 +0530 Fix for a compiler warning Reported by Neha Sharma commit 7f2397b0b87a9941a34998ffc63990eab32f09f1 Author: Mithun CY Date: 2018-10-31 00:43:06 -0700 Wal log page extension if it was not done previously. Patch by Amit Kapila, Review by Mithun C Y. commit 9ffbe108474a898e76302bb834d040f288b78e67 Author: Kuntal Ghosh Date: 2018-10-30 13:17:34 +0530 Fix bug in registering TPD buffers for WAL While creating a WAL record, if a concurrent checkpoint occurs, WAL insertion may fail. In that case, we need to prepare the WAL record again. At the same time, we should clear the registered TPD buffer array. Else, it'll not be registered in WAL in next try. Patch by me. Investigated by Amit Kapila, Dilip Kumar and me. Reviewed by Amit Kapila and Dilip Kumar. commit b30e67a310db0bd75a44094c8acd9e87c1f5c51e Author: Kuntal Ghosh Date: 2018-10-31 11:51:01 +0530 Fix eval-plan-qual isolation test After rebasing the zheap branch with latest PG HEAD, there are some additional testcases in eval-plan-qual isolation test. Add those changes in regression out file targeted for zheap. commit 799ca622b624f0b13916ea8ce4ea1afad51d5f1c Author: Amit Kapila Date: 2018-10-31 11:50:49 +0530 Return error on trying to update a row moved to another partition. The approach used here is to set the ZHEAP_MOVED (ZHEAP_DELETED | ZHEAP_UPDATED) flag when the tuple is moved to different partition. All the cases that are handled in heap needs to be handled for zheap as well. Amit Kapila, based on earlier patch by Amit Khandekar which has used a different approach. commit d9918b8ec386619a3cd2ac28d8f2ea8698aeec99 Author: Dilip Kumar Date: 2018-10-31 11:22:29 +0530 Defect fix post rebase commit 10492266c5789bbec1fdbfd13220dcacbea633c6 Author: Kuntal Ghosh Date: 2018-10-30 10:16:43 +0530 TPD fix by Dilip RM43761 commit 7524cd8cf9f7a20bfb44aefe32f0085bec1ce385 Author: Dilip Kumar Date: 2018-10-30 18:36:32 +0530 During recovery restore dbid from WAL Patch by Mahendra Thalor Reviewed by Dilip Kumar and Kuntal Ghosh commit e73d2bf79db041177d35c9f3333e126318460f8c Author: Dilip Kumar Date: 2018-10-30 17:57:00 +0530 Bug fix after rebase Dilip Kumar and Kuntal Ghosh commit 5340d395c5d63e56f0398fee816393a613e385fb Author: Mithun CY Date: 2018-10-25 05:32:50 -0700 Avoid page compaction if the tuple is marked updated, deleted or inplace updated by an open transaction. By Mithun C Y commit ed9356efb9f5ad9622beef61ce75cfb9b8c5ea39 Author: Kuntal Ghosh Date: 2018-10-24 10:35:11 +0530 Fix an isolation test case for zheap In zheap, when a transaction rolls back, it needs to perforfm undo actions on the modfied relation. For that, it needs a lock on that relation. If any concurrent transaction holds an exclusive lock on the relation, the rollback operation will wait untill the lock is available. commit 5711aac11e6443d16e6b0aeb2219c9ce0ac6f3b7 Author: Amit Kapila Date: 2018-10-23 21:24:07 +0530 Fix compiler warning. Reported by Thomas Munro. commit 407f6350335597bb2029109df3c0a4d090410a4f Author: Amit Kapila Date: 2018-10-23 14:12:58 +0530 Move functions that can allocate memory or allow locking outside critical section. At few places in the code, TPDPageGetTransactionSlotInfo was being called from critical section which leads to assertion failure. Lock the TPD page whereever required before entring critical section. commit 390416aa17bc215e541c3cf83acd5e336bb6e350 Author: Kuntal Ghosh Date: 2018-10-23 12:13:23 +0530 Bugfix in tpdxlog.c commit a68d7ef43e07849bd75bd5e39634e3ab29efbe0b Author: Kuntal Ghosh Date: 2018-10-23 12:11:07 +0530 Fix a variable initialization commit 09a0518389ccce88b381b3f44b4d88769feb7634 Author: Mithun CY Date: 2018-10-22 18:10:00 -0700 Fix orderering issues in regression tests. With zheap we have inplace updates so order of tuples in zheap page will be different from heap. Adding order by clause to test to get consistent results for both heap and zheap. By Mithun C Y commit b5f1649f55aa366f239fe37dc2e06c9b2cecdf17 Author: Rafia Sabih Date: 2018-10-18 16:37:00 +0530 Bug fix in zheap_lock_tuple Added a code-path in zheap_lock_tuple to check for the latest copy of the tuple in case it is modified by some aborted transaction. Issue reported by Neha Sharma, patch by me, and reviewed by Amit Kapila commit f9a00ee12997bc0ac823ac4419b8aa3f7024aee6 Author: Kuntal Ghosh Date: 2018-10-18 11:06:04 +0530 Add expected file for eval-plan-qual isolation test Include metapages in ctid for zheap tuples. Also, Updated a test case related to self join. Basically, when performing a self join if it needs to pass through EvalPlanQualFetch path, it's possible that both sides of the join see the same value due to in-place update. This behaviour is different from heap, but similar to other undo-based storage. commit 804d6edc03cc8aa9b5bb0428bccda27a48c1c8fb Author: Kuntal Ghosh Date: 2018-10-18 10:09:54 +0530 Include meta page in isolation test results In vacuum-reltuples isolation test, include zheap metapage in relpages. commit c0f90396dd27d28c27d0b59e682680289fcba669 Author: Amit Kapila Date: 2018-10-17 18:28:05 +0530 Bump the number of initial TPD entries. By mistake, commit 1400d8f84b has changed the number, it was added just for testing purpose, but it should have been removed before commit. commit 14d67f52665110d64666e1addc255df15015dc6b Author: Amit Kapila Date: 2018-10-17 16:49:10 +0530 Support inplace update of TPD entries. Whenever TPD entry needs to be extended, we call Call TPDPagePrune to ensure that it will create a space adjacent to current offset for the new (bigger) TPD entry, if possible. We use compactify_ztuples to ensure that space adjacent to existing entry can be created. For inplace update of TPD entry, we just replace the old entry with new entry at the same location similar to inplace updates of tuples. We also write WAL record this operation. Amit Kapila and Kuntal Ghosh, reviewed and verified by Dilip Kumar commit 83c27f9fae800fb73eff68f1987136adaf7965ab Author: Rafia Sabih Date: 2018-10-17 06:26:24 +0530 Bugfixes in zheap update, delete, and lock - Modify the infomask only after writing the corresponding undo. - Always be in inside critical section when modifying tuple. - Check if the current member is the only locker with multilockers, then no need to lock tuple again. Issue reported by Neha Sharma. Patch by me, reviewed by Amit Kapila and Kuntal Ghosh commit 69b6b791ffaa0d04cc10bc0603c8bbc39f58dfa8 Author: Amit Khandekar Date: 2018-10-15 14:31:43 +0530 Improve test coverage of update.sql. Commit 521461cfd91b9b44f0ecc392b3470922e78246b0 had removed some redundant RETURNING clauses in some statements, in order to make the output consistent. This commit reverts back those changes and instead uses WITH clause over the update statements, so that we could use ORDER BY for consistent ordering . This helps retain the RETURNING clauses, which also makes sure we don't reduce the test coverage that we may have got for the update-partition-key feature. Patch by me; idea suggested and changes reviewed by Mithun Cy. commit 0a261c3bae670e9b3748a9eeb82706a96a2b4f32 Author: Kuntal Ghosh Date: 2018-10-15 11:27:51 +0530 Fix payload length in standby commit dac8323bfc43f7c0dec9c02dd53ca761b3c098b8 Author: Kuntal Ghosh Date: 2018-10-15 10:15:32 +0530 After fetching xid recheck for frozen xid commit abdd14af03e32e9abfe7b7ed0c6bb88a0a6184af Author: Kuntal Ghosh Date: 2018-10-15 09:53:14 +0530 Clear TPD entry only in exclusive lock When we clear TPD entry from a TPD page, we should have a exclusive lock on the TPD buffer. Hence, perform a check for the same before clearing the entry. commit 52b84f9fb21468080c8ea252336a1994f915ca8b Author: Dilip Kumar Date: 2018-10-15 10:55:07 +0530 Bug fix in recovery Multiple bug fix in recovery of the TPD and undo action WAL Dilip Kumar reviewed by Amit Kapila commit 6ebf7d2e1f12fcd92bb5c7ebe11e83eda3824c32 Author: Dilip Kumar Date: 2018-10-15 10:51:17 +0530 Bug fix in tpd If the tpd entry is pruned then we can directly use the last slot for the transaction and no need to try for extending the tpd entry Amit Kapila reviewed by Dilip Kumar commit aa09e789b0b40f9593f97135c6c9af4c74464023 Author: Kuntal Ghosh Date: 2018-10-10 13:47:25 +0530 Skip calling zheap_fetchinsertxid for non-serializable xacts For non-serializable xacts, we can skip calling zheap_fetchinsertxid to fetch the targetxmin that is passed to PredicateLockTid. A call to zheap_fetchinsertxid is costly since it has to fetch undorecords. commit 2b2963b37cb1316de21ed65e8c859b76467616ca Author: Kuntal Ghosh Date: 2018-10-10 11:07:35 +0530 Fix compiler warnings commit 0a82317f0378b35122df6bef1d473826f149c121 Author: Dilip Kumar Date: 2018-10-08 11:58:38 +0530 Bugfix in undoworker commit. Shared memory size was not calculated for undo launcher and removing one unwanted hunk. commit c77321a0364149c3831844f02082579a070d47f0 Author: Mithun CY Date: 2018-10-07 14:47:06 -0700 Fix regression tests and output for row order changes. With the introduction of inplace updates in zheap, order of rows within a page has changed. This fix changes the queries whose output depend on stored row order. Those queries are rewritten to have an order by clause to get consistent result for both heap and zheap. By Mithun C Y commit e5a37a151fc52b0bc10b2dd6af13af3b71136360 Author: Amit Khandekar Date: 2018-10-05 22:20:44 +0530 Fix a concurrency issue with UPDATEs and triggers. In GetZTupleForTrigger(), after it calls zheap_lock_tuple(), a check for hufd.in_place_updated_or_locked was missing. So if there was a concurrent in-place update, it used to conclude that the tuple was deleted. Add the missing check. Patch by me, reviewed by Dilip Kumar. commit f5c0d3dd3cbab42628d311fbb50614f1bce93ec1 Author: Amit Khandekar Date: 2018-10-05 22:01:36 +0530 Regression test changes for update.sql update.sql had to undergo some modifications because the RETURNING clause was returning in different order. Also some UPDATE statements that updated multiple rows had to be modified such that they update only single row, because the erroring row was different in case of zheap. There were two erroring rows, and out of them whichever is the first in the update scan result got displayed in the error message. This test also has some more ORDER BY clauses. commit bbd7a64c3495b77bf77406511afa4f1183a7fcc4 Author: Amit Khandekar Date: 2018-10-05 21:37:18 +0530 Fix tuple handling with trigger functions and transition capture. In ExecUpdate(), for zheap tables, ExecARUpdateTriggers() was called only if resultRelInfo->ri_TrigDesc is true, which is incorrect. This function also needs to be called for transition capture even when trigdesc is false. Due to this, transition table rows were not getting generated. Furthermore, in ExecInsert(), for zheap tables, ExecARUpdateTriggers() was getting called using tuple when it should be called using ztuple. Fixed both these by making the newtuple parameter of ExecARUpdateTriggers() a void * type and renaming it to newtuple_abstract, so that we can pass tuple or ztuple according to the table storage. And then in ExecARUpdateTriggers(), do the conversion from zheap to heap if it's a zheap tuple. On similar lines, do changes for ExecARInsertTriggers(), where trigtuple is made trigtuple_abstract. Now that this function accepts an abstract tuple, fix another pending issue in copy.c : in CopyFrom() and CopyFromInsertBatch(), pass either zheap or heap tuples to ExecARInsertTriggers(). Reviewed by Dilip Kumar and Amit Kapila. commit 51da00138eeceda6ea57e59760e49b9d60518274 Author: Rafia Sabih Date: 2018-10-05 17:04:22 +0530 This commit completes the work of adding the new lock type for sub-transactions in zheap that was started in commit 13b72c94cc77f6471e46f6e9da29423994753240. This commit handles the cases of waiting on sub-transactions when using dirty snapshot, inserting index tuples in btree, and while checking for constraints violation. Patch by me, reviewed by Amit Kapila commit 0cb8b6c800d4437c0e413c8bd2473181d843f1e3 Author: Mithun CY Date: 2018-10-05 02:22:45 -0700 Fix condition for page pruning. In zheap_page_prune_guts condition for pruning was wrongly set. This patch fixes same. Along with it patch also fixes the code which made NULL pointer dereference. By Amit Kapila and Mithun C Y commit 1662919e20b45e515849304f68e310b28939aae6 Author: Amit Kapila Date: 2018-10-04 19:36:10 +0530 Support Cluster/Vacuum Full in zheap. As of now we only rewrite LIVE tuples and we freeze them before storing in new heap. This is not a good idea as we lose all the visibility information of tuples, but OTOH, the same can't be copied from the original tuple as that is maintained in undo and we don't have facility to modify undorecords. We have some ideas how to do that and those are documented in rewritezheap.c. Patch by Amit Kapila with help from Mithun C Y who also reviewed the patch. commit 62606342d267ad671d7084798ac9febd1230919c Author: Kuntal Ghosh Date: 2018-10-04 11:22:11 +0530 Fix compiler warnings commit bfe41c9db8b397d9c230b36f5c7f6f7826b7fa64 Author: Kuntal Ghosh Date: 2018-10-04 10:49:57 +0530 Fix assert for insertion of frozen tuple If we're freezing a tuple during insertion, we can use the HEAP_INSERT_SKIP_WAL optimization since we don't write undo for the same. Hence, adjust the assert in zheap_prepare_insert accordingly. commit 8defd8aa1852f24ca81552efbf158d3abeda7662 Author: Kuntal Ghosh Date: 2018-10-03 19:07:57 +0530 In zheap parallel scan, use SnapshotAny if required commit 268ab31eebddb41e4f5c9bfb026b9e88f24d4416 Author: Kuntal Ghosh Date: 2018-10-03 19:06:48 +0530 During in-place updates, update t_hoff correctly commit 14dd056736a4cdf4949f798bf1386b80671ada85 Author: Kuntal Ghosh Date: 2018-10-03 19:01:57 +0530 Initialize startblock for zheap scan Since zheap scan unconditionally uses scan->rs_startblock in zheap_getnext, we should initialize the same by default. Otherwise, valgrind barks loudly. This may also result in undefined behaviour in release mode. commit bf1b20a81832f1c007690cda95088aef857bd34d Author: Kuntal Ghosh Date: 2018-10-03 19:00:25 +0530 Fix scan initialization for zheap tables Use zheap_beginscan_parallel in _bt_parallel_scan_and_sort for scanning zheap tables. commit e090a7c5b3f82578d38860cf0e0244aee5e585e7 Author: Dilip Kumar Date: 2018-09-28 15:03:14 +0530 undoworker for hadling the rollback This patch introduces two type of workers the discard-worker and the undo-launcher. The discard worker's main responsibility is to discard the older undo and the undo-launcher will process the rollback hash table and launch undo-worker one for each dbid. The undoworker will take the database connection and start processing all the request for that db. Once the undo-action is applied than it will mark the transaction header in the undo as processed and remove the entry from the rollback hash table. Dilip Kumar Reviewed by Amit Kapila commit 93ed5d44e3f9c62b78fa651d7a9082cfc57791f6 Author: Kuntal Ghosh Date: 2018-09-25 13:28:47 +0530 Fix some TPD issues 1. Introduced Asserts for ZHEAP_METAPAGE before calling ZheapInitPage. Metapage should not be initialized using this method. 2. Introduced Asserts to identify zheap metapage after reading the same. 3. Fixed tpd_desc. 4. In TPDPageAddEntry, we shouldn't shuffle the item pointers. 5. In TPDAllocatePageAndAddEntry, when new page is not added, we shouldn't over-write previous and next block number in tpd opaque space. 6. Set LSN in page header for meta page. 7. For PageSetUNDO while locking a tuple, send set_tpd_map_slot=false. 8. Properly initilize fist and last tpd page in meta page. 9. In zheap_xlog_update, send proper undo pointer in TPDSetUNDO. 10. Fixed the storage of old tpd slot in xlrec. 11. Fixed compactify_ztuple memory overrun issue. Patch by Dilip Kumar and me. Reviewed by Amit Kapila. commit 8683fff396da98ad00ae3b853714b0788bff99b6 Author: Kuntal Ghosh Date: 2018-09-27 16:54:01 +0530 Fix compiler warnings in xact.c commit 3f86bb91ece478ab9ab562cc1972bb85d9f2a965 Author: Rafia Sabih Date: 2018-09-26 15:32:12 +0530 Introducing sub-transactions lock type in zheap With this patch, the sub-transactions will have a new type of lock for them. Now, instead of waiting for top-level transaction lock, the tuples are free once the sub-transaction is committed. This way, we neither waste our transaction-slots by assigning transaction ids to sub-transaction, nor we suffer in performance by waiting on top-level transactions. Patch by Amit Kapila and me. commit f0a2c8247aad9fe1569709599c5b267f473a6024 Author: Dilip Kumar Date: 2018-09-25 17:22:41 +0530 Bugfix in inserting the undo record when previous transaction and the current transaction are in different undo logs. Earlier it was just comparing the block number to find whether we have already read the buffer or not, but that's not enough when we are comparing block no. which are in two different undo logs. Dilip kumar Reported by Thomas and Reviewed by Amit Kapila commit 4060bd90a3a47dfc9e5b396b3071c9c4d6bd0dcd Author: Dilip Kumar Date: 2018-09-25 17:17:48 +0530 Add comment for specific handling of geting CID for inserted tuple with lockers. Dilip Kumar Reviewed by Amit Kapila commit e5eee5471d739b1abf084feed6635c5a526e43b4 Author: Dilip Kumar Date: 2018-09-25 14:15:21 +0530 Fix compilation error commit 0fe57bf56a54fecf31d7b810c47853ba8e00925b Author: Kuntal Ghosh Date: 2018-09-25 13:26:05 +0530 Fix zheap_mask declaration commit 61226ffd40ad0841d6c1cd4bbf792ec9e85349ba Author: Kuntal Ghosh Date: 2018-09-20 11:44:55 +0530 Dont call GetTransactionSlotInfo for all visible tuple For all visible tuple, the corresponding slot can be re-used or pruned. Hence, we shouldn't call GetTransactionSlotInfo to retrieve the transaction information for the same slot. commit 20c3244baba9a72530fc4747da962190a15623f2 Author: Kuntal Ghosh Date: 2018-09-25 12:04:00 +0530 Check WAL Consistency for TPD and zheap meta pages commit 5f8604946763e6b537f8f68f382b5da62a7d5a86 Author: Mithun CY Date: 2018-09-24 06:24:18 -0700 Test fix, Allow inplace updates if the row can fit on the same page. By Mithun C Y Review Amit Kapila. commit 0b83a48a01eb57ffd8a247c8adbe3c85de1d8d98 Author: Amit Kapila Date: 2018-09-24 18:46:55 +0530 Allow inplace updates if the row can fit on the same page. Currently, we perform inplace updates only when the new row is smaller than old row or if the old row is the last row on page and it has space after it. However, there are more cases where we can perform inplace updates: (a) If there's no free space immediately following the tuple, but there is a space in the page to accomodate the entire tuple. (b) If there's no free space immediately following the tuple, but there is a space in page to accommodate the delta tuple (new_tuple_size - old_tuple_size). We allow pruning function to rearrange the page such that it can make space adjacent to the tuple being updated. This is only possible if the page has at least space to support equal to (newtupsize - oldtupsize). Otherwise, also we try to prune the dead/deleted tuples to see if the new tuple can be accommodated on same page and that will allow inplace updates. To perform pruning, we make the copy of the page. We don't scribble on that copy, rather it is only used during repair fragmentation to copy the tuples. So, we need to ensure that after making the copy, we operate on tuples, otherwise, the temporary copy will become useless. It is okay to scribble on itemid's or special space of page. While rearranging the page tuples will be placed in itemid order. It will help in the speedup of future sequential scans. Note that we use the temporary copy of the page to copy the tuples as writing in itemid order will overwrite some tuples. We have also changed the patch such that REDO will perform repairfragmentation only if we it has been done during DO operation. Amit Kapila, Mithun C Y. commit 85262536130fbf40afb96b4d7a64692d31171cc5 Author: edb Date: 2018-09-19 09:57:29 +0530 Fix compiler warning in tpd.c Reported by Mithun commit de73a4d3c89af83d08f3e2822f0ad07090d89f77 Author: Mithun CY Date: 2018-09-18 19:46:36 -0700 Bugfix in vacuuming zheap table. The function count_nondeletable_pages is using vac_strategy variable defined for heap even when it is called for a zheap relation. This patch fixes it to use zheap's vac_strategy variable. Reported by Neha Sharma, fix by Mithun C Y, review by Amit Kapila. commit 92fdaf6736bd975fcf6fba3486e2578781c16404 Author: Mithun CY Date: 2018-09-18 19:36:25 -0700 Remove no longer needed fixme comments on visibility map. Reported by Amit Kapila. commit 2509bd2e93e2cc16f4555cd2112895badc304471 Author: Mithun CY Date: 2018-09-18 19:35:15 -0700 Bug fix in execute_undo_actions_page. ove ZheapInitPage after XLogInsert, as TPD related information in page are needed by some of the functions called before. By Mithun C Y and Amit Kapila. commit c5817d6ade8c1ba423bb0d04f983a65e718138d0 Author: Amit Kapila Date: 2018-09-18 08:10:04 +0530 Support variable sized TPD entries. We extend the TPD entries when (a) while reserving a slot, we found that there aren't enough slots in TPD entry or offset-map doesn't have much space, (b) while getting the existing TPD entry we found that offset-map doesn't have enough space. If we find the space in the same TPD page, then we perform inplace update of the TPD entry, otherwise, a non-inplace-update is performed. In non-inplace-update, we mark the old entry as deleted and later during pruning, if we encounter any deleted entry, we directly prune it. Currently, the implementation of in-place updates is not complete, so we always perform non-in-place update. Patch by Amit Kapila, Dilip Kumar and Kuntal Ghosh. commit f17c96014308f8f01385a82e41be8601073d147e Author: Kuntal Ghosh Date: 2018-09-14 16:43:14 +0530 In replay of freeze_xact, read TPD buffer before using it commit 40ce6e610d138de1ebb2944ce619db9f53df61f1 Author: Kuntal Ghosh Date: 2018-09-14 12:39:18 +0530 Fix a flag in xl_zheap_lock In xl_zheap_lock flags, we've skipped the second bit for no good reason. Reported by Dilip Kumar. commit 58000e7691eda261ddc99566f24b03692bf2b19a Author: Kuntal Ghosh Date: 2018-09-13 17:51:11 +0530 Implement serializable isolation APIs for zheap For PredicateLockTid, ConflictIn and ConflictOut, we pass the tid to identify the tuple. For conflict out, we've introduced a function ZHeapTupleHasSerializableConflictOut that performs all zheap related work to figure out whether the reader conflicts out with any other writers. In ZHeapTupleHasSerializableConflictOut, we refetch the tuple and check the recent status of the tuple. Using that, we decide whether we conflicts out. We've a special handling for the tuple which is in-place updated or the latest transaction that modified that tuple got aborted. In that case, we check whether the latest committed transaction that modified that tuple is a concurrent transaction. Based on that, we take a decision whether we have any serialization conflict. Patch by me with help from Amit Kapila. Reviewed by Amit Kapila and Thomas Munro. commit b401a9c4da9e9038a0a72e4c31f31ba87bd43dbb Author: Kuntal Ghosh Date: 2018-08-07 19:49:32 +0530 Make serializable code independent of storage This code aims to make PredicateLockTuple, CheckForSerializableConflictIn and CheckForSerializableConflictOut independent of storage tuple. PredicateLockTuple and CheckForSerializableConflictIn method can work with tid only. However, CheckForSerializableConflictOut requires the storage tuple to check latest visibility status of the tuple. Hence, I've separated the *SatisfiesVacuum and its usage towards conflict resolution in a separate storage specific function. I've also renamed PredicateLockTuple to PredicateLockTid. Patch by me. Reviewed by Amit Kapila and Thomas Munro. commit 44d776147cdafc55ea3fe95569bb039352e580ac Author: edb Date: 2018-09-10 21:27:38 +0530 Change behaviour of zheap_fetchinsertxid Earlier this function was returning xid which has inserted the tuple. But, the tuple can be inserted by multi_insert and inplace-update operations as well. Hence, we should handle that. commit e5b821e0cfc905ea32e11a9619cf797de2342104 Author: Rafia Sabih Date: 2018-09-06 18:28:49 +0530 Modifying the size of rs_visztuples in HeapScanDescData The size of rs_visztuples was kept to a fixed value which was causing the failure in regression suite when running with increased blocksize. Now, it is modified to the value of MaxZHeapTuplesPerPageAlign0. Patch by me, reviewed by Amit Kapila. commit 0136ec40cb54df56f52af626df680ae5981e8fd3 Author: Rafia Sabih Date: 2018-09-06 17:07:53 +0530 Bugfix in toast table updation Patch by me and reviewed by Amit Kapila commit e12bb90c6751d60564535a7effc8e08994d61f3c Author: dilip kumar Date: 2018-09-03 11:32:30 +0530 Fix compilation warning in test_undorecord.c Reported By Amit Kapila. commit c11609baa42bd0332bc90a97a2d27814a66346b9 Author: Thomas Munro Date: 2018-09-03 15:43:14 +1200 Reorder handling of ONCOMMIT_TEMP_DISCARD. The previous coding accidentally caused ONCOMMIT_NOOP to enter the new ONCOMMIT_TEMP_DISCARD case. Patch from Rafia. commit 16a7bf70c67dae185d74e5cd988af8bb6d2bb4fd Author: Thomas Munro Date: 2018-09-02 05:02:57 +1200 Fix test_undo undo_append_file() procedure. commit 6f94ac34370ffd305bf702fa514136eab98e43f7 Author: Thomas Munro Date: 2018-04-23 22:46:34 +1200 Delete undo log files in dropped tablespaces in recovery. When a tablespace is dropped, we clear out any remaining undo log files. commit c7e00a820e5daaa3c7a268cdae195a859c5476af Author: Thomas Munro Date: 2018-04-23 18:43:50 +1200 Handle missing undo log segment files during WAL replay. In recovery, segment files may not be present because they will be deleted by later WAL records. Following the example of regular relation files, we'll supply empty files. Previously, I created unexpectedly missing files during startup, which didn't work correctly if we crashed after dropping a tablespace (I couldn't create the files in the tablespace directory if it has already been deleted). This new approach does what other PostgreSQL code does, and creates a new tablespace directory as required (expecting it to be deleted by a later WAL record). Thomas Munro commit 3bbf152f4d330e582e950a45790d0e375d93bc5c Author: dilip kumar Date: 2018-08-26 14:11:59 +0530 Fix thinko in assert. commit 1e9d17cc2406683b014f872ce873c845adafe3af Author: dilip kumar Date: 2018-08-26 14:10:21 +0530 Fix discrepancy between ZHeapTupleSatisfiesUpdate and HeapTupleSatisfiesUpdate also for the non-inplace update, it was trying to fetch the cid from the prev_undoptr but in case of non-inplace update it will not find the previous version of the undo so get the cid while copying the tuple. Dilip Kumar, Amit Kapila Review and tested by Ashutosh Sharm commit 4cc08c3ee0c31346af8c8a02a45a7bb759c3ed68 Author: Kuntal Ghosh Date: 2018-08-10 22:24:22 +0530 Refetch tuple after reserving slots in the page When we reserve a slot in the page, we sometimes freeze some tuple in the same page. Hence, we should re-fetch the tuple to update the slot information. Patch by me. Reviewed by Amit Kapila. commit 056a37b464058d6c386b6337ad65589f15c1a9b3 Author: Kuntal Ghosh Date: 2018-08-10 16:49:11 +0530 An optimization for temp tables For temp relations, we don't have to check all the slots since no other backend can access the same relation. If a slot is available, we return it from here. Else, we freeze the slot in PageFreezeTransSlots. Note that for temp tables, oldestXidWithEpochHavingUndo is not relevant as the undo for them can be discarded on commit. Hence, comparing xid with oldestXidWithEpochHavingUndo during visibility checks can lead to incorrect behavior. To avoid that, we mark the tuple as frozen for any previous transaction id. In that way, we don't have to compare the previous xid of tuple with oldestXidWithEpochHavingUndo. Patch by me. Reviewed by Amit Kapila. commit 10c225fbd9a3aa8239efaf8cf6d00e8b7e9d589d Author: Kuntal Ghosh Date: 2018-08-10 13:50:15 +0530 GUC to enable/disable undo launcher We've introduced a new GUC variable called disable_undo_launcher to enable/disable the undo launcher process. If true, the postmaster won't register the undo launcher. By default, it's set to false. This is a postmaster option. Changing the value requires a restart. Patch by me. Reviewed by Amit Kapila. commit e613fa25d714ddba99f2b3675894f3d0d141df4f Author: dilip kumar Date: 2018-08-23 16:40:35 +0530 Undorecord cleanup. Created two files undorecord.c and undoinsert.c where undorecord.c mainly deals with how to insert and read the undo record and the undoinsert.c provides external interfaces to prepare,insert and fetch undo and deals with buffers management required for undo record. Dilip Kumar Review by Amit Kapila commit dccab1c216d8677bb6ad716e13adc08da44f79a9 Author: dilip kumar Date: 2018-08-23 10:39:53 +0530 bug fix in pg_control_checkpoint Reported by Andres commit 6d3f5716cb4257e8226212fbe002f05cc0d60145 Author: Kuntal Ghosh Date: 2018-08-17 18:54:13 +0530 Fix issue in vmbuffer wal replay While logging visibility map changes for zheap, we don't register the zheap buffer. That's intentional since we don't set any hint bits in zheap buffer for tracking visibility. But, we've to store the block number in WAL so that we can track the block number for setting the corresponding visibility map bit while replaying the WAL for the same. commit 9fc32f608a0900f8ed051cc010e72a40abe318ac Author: Kuntal Ghosh Date: 2018-08-20 14:39:05 +0530 Fix prevxid for insert/multiinsert WAL replay We should set the prev xid as frozen during WAL replay of insert and multiinsert. commit 9fcccee5d272c437a898171a4a97317ba1430ee0 Author: Kuntal Ghosh Date: 2018-08-20 16:25:02 +0530 Fix slot index in undo_xlog_reset_xid commit 0047fe43fe0fa5a10833d7f864d81b6a2e4ac4ed Author: Kuntal Ghosh Date: 2018-08-17 17:59:59 +0530 In zheap_update, fix WAL record for lock tuple This is leftover of commit 9b1f493a6335d07024. In zheap_update, we also lock the tuple. Hence, we need to fix the transaction slot related issue that was fixed as part of the above-mentioned commit. commit e5f5c83c3120ac90c0a8501a584935a0ceb782f4 Author: Kuntal Ghosh Date: 2018-08-16 16:13:10 +0530 Lock undo buffer while preparing an undorecord Currently, we're locking the undo buffers in InsertPreparedUndo. This is called under critical section. If we encounter any error while taking the lock under critical section, it'll lead to server crash. We can easily avoid this situation by taking the lock in PrepareUndoInsert. commit 25593d3c8351df93c9d5d57ec5b88ba94f10df7a Author: Rafia Sabih Date: 2018-08-21 11:37:09 +0530 Bugfix in temp table rollbacks If rollback_oversize is set to zero then rollbacks of temp tables were pushed to RollbackHT. This is corrected to perform the rollbacks by the backend itself. Reported by Neha Sharma, patch reviewed by Amit Kapila commit 420810a9cc500cdc539d34e8e4e49296337453e8 Author: Rafia Sabih Date: 2018-08-20 12:00:03 +0530 Bugfixes in UserAbortTransactionBlock - If rollback request could not be pushed, then backend executes the undo actions. - Corrected arguments for PushRollbackRequest. Reported by Dilip Kumar. commit 771f43b086060f2b1bba25a857693886aa0ef4e9 Author: Mithun CY Date: 2018-08-19 19:21:41 -0700 Bug fix, In lazy_cleanup_index and lazy_vacuum_index we shall use vac_strategy passed to it by either zheap or heap relation. By Mithun C Y, reviewed by Amit Kapila. commit bd0b43b93ed204749347d3c885ee258942cc8221 Author: Rafia Sabih Date: 2018-08-17 15:55:22 +0530 Bugfix for speculative insert in toast table Reported by Kuntal Ghosh, reviewed by Amit Kapila commit cb36f7e7e848cfb11a5062684451788a41f048f9 Author: Rafia Sabih Date: 2018-08-17 15:53:50 +0530 Bugfix in ZMultilockers Found as part of a bug reported by Neha, reviewed by Amit Kapila commit 22e1c1c33e1e5ed129de8fbfa539d943ae42c844 Author: Kuntal Ghosh Date: 2018-08-16 13:46:54 +0530 For insert/multi-insert, set previous xid as frozen When we modify a tuple, we set the previous xid for this tuple as frozen if its previous modifier xid is older than the discarded xid. For insert/ multi-insert, we don't have any previous modifier. Hence, we can set it as frozen unconditionally. commit 9668bfac227acb13c87dcdf8a54e440aab768e07 Author: Rafia Sabih Date: 2018-08-16 09:32:43 +0530 Indentation fix. Reported by Andres. commit 45e405946ff54b5461fdeb7f2a4826c7a87a890a Author: Kuntal Ghosh Date: 2018-08-14 17:14:23 +0530 Use ZHeapTupleHasInvalidXact wrapper consistently commit 043603d58f4dfc4bd8b1f7ddaa4f2c73e5713f24 Author: Kuntal Ghosh Date: 2018-08-14 17:12:46 +0530 Update regression tests file in pageinspect commit 4d0a2ebf9e8bc40f44430174d9990b26b473405c Author: Kuntal Ghosh Date: 2018-08-14 16:19:35 +0530 Bugfix in zheap_update for non-inplace updates For non-inplace, we should always mark the tuple as locker-only if we're propagating the key-share lock to the new tuple. Reported by Rafia Sabih. Reviewed by Amit Kapila. Patch by me. commit 1b8c7eba990341809096a245937c48e2fdb875d2 Author: Kuntal Ghosh Date: 2018-08-14 14:49:36 +0530 Fix bug in regression tests for pageinspect The previous commit of pageinspect forgot to attach the zheap.sql and zheap.out file required for regression tests. commit 02d983e5a5b4a6e2b76b59023668a98457465060 Author: Amit Kapila Date: 2018-08-14 13:43:31 +0530 Propagate lockers information. We were not propagating the lockers information when the tuple has multi-lockers bit set. Fix it. Reported by Ashutosh Sharma, Patch by Amit Kapila. commit b16f0fec8868482773d5a7bc37ec9369dcfaa476 Author: Kuntal Ghosh Date: 2018-08-14 10:52:08 +0530 In bitmap scan, don't keep the pin on the buffer For bitmap scan, we always use pagemode to scan the tuple. Hence, there is no need to keep the lock on the tuple. commit 191d6f03c143097d63e17e057935a30f3a4db6bc Author: Rafia Sabih Date: 2018-08-13 15:25:56 +0530 Bug fixes - fixes for incorrect retieval of TPD buffer - fixes for incorrect size calculation of undo-header at recovery time - fix for accessing an uninitialised variable in TPDPageGetTransactionSlots - fixed one space issue in zheapamxlog.c Came across these while working on an issue reported by Neha Sharma. Reviewed by Dilip Kumar commit 5fcf8c59430fe4ae105d544d9f406c1e7e3e3959 Author: dilip kumar Date: 2018-08-10 17:34:05 +0530 Bug fix in undo action When previous version of the tuple has the TPD slot that time we need to pass a flag to the function so that it can set the tpd slot in the offset map. Mistakenly that option was always passed as 0. Dilip Kumar Reviwed by Kuntal Ghosh commit ef717922bb7f357a772e2f274766e1d1049beb37 Author: dilip kumar Date: 2018-08-10 15:27:29 +0530 Fix Typo Reported by Andres. commit f68d2c8862ee2f0981ccbc767c1435cca647b804 Author: Amit Khandekar Date: 2018-08-10 09:35:12 +0530 Allow in-place updates for some expression indexes. This is a zheap port of commit c203d6cf81b4d : "Allow HOT updates for some expression indexes." Since we can do HOT udpates for such expressions, allow in-place updates for the same expressions. Add a new regression test zheap_func_index.sql derived from the existing test func_index.sql. The new test uses pg_stat_get_xact_tuples_inplace_updated(). Amit Khandekar, reviewed by Amit Kapila. commit c5e306aa3afc6b8ac2975be699772c76734d0ff0 Author: Amit Khandekar Date: 2018-08-10 09:24:10 +0530 Add a pg_stat function for inplace updates in a transaction. There was already a pg_stat_get_tuples_inplace_updated() function for getting in-place updates in a session. But the _xact_ version was missing. Amit Khandekar, reviewed by Amit Kapila. commit 24c5086eb332d3e0549238f25641887c9c9d03fb Author: Kuntal Ghosh Date: 2018-06-25 20:22:29 +0530 Add pageinspect functions to analyze zheap page and zheap tuples It adds two functions zheap_page_items to inspect the tuples in the page. It also adds another function zheap_page_slots to inspect the transaction slots in the special space. If the page contains TPD slot, then zheap_page_slots doesn't show the same since it doesn't contain any transaction information. TPD slot information can be shown in future once the structure of TPD page is stabilized. Patch by me, reviewed by Ashutosh Sharma commit a8442adccf916a5de5c79f4c8a155cc6d4fe9919 Author: Kuntal Ghosh Date: 2018-08-08 17:28:47 +0530 Initialize some local tuples in zheap scan APIs It also fixes a bug in zheapgetpage. We should not copy the tuple for deleted item pointers. commit abb680be862a99fc23b3abfa28f159b545eba5e6 Author: Kuntal Ghosh Date: 2018-08-09 14:22:37 +0530 Fix some compiler warnings commit 11b199c7d779a2beb6ba9217303d4b73359665c2 Author: dilip kumar Date: 2018-08-09 11:39:09 +0530 Allow freezing and reusing of the TPD slots Currently, for zheap pages we allow to freeze slots of the the all visible xids and also allow to reuse the slot of the committed xids. This patch is implementing the same for the TPD slots. The mechanism of the freezing and reusing the tpd slots are same as for the zheap page slots. Dilip Kumar reviewed by Amit Kapila commit 67ca8248109f52c10f06347ac82ab20693547b7f Author: Amit Kapila Date: 2018-08-09 12:04:50 +0530 Remove unnecessary changes related to zheap from RelationGetBufferForTuple. Now, that we have a corresponding separate function for zheap, we can remove the zheap related changes from the function RelationGetBufferForTuple. commit e18858cffcbf964d7cf3aab7a087673bd492652b Author: dilip kumar Date: 2018-08-09 11:22:40 +0530 Bug fix in execution undo action Set the proper slot on the TPD offset map while replaying undo actions if the prior version of the tuple is pointing to the TPD slot. Prior to this we are simply overwriting the older tuple version but the slot is not updated in the TPD offset map. Dilip Kumar review by Amit Kapila commit db2d9ae8a2c609e2535932b1bbcc9bb9181ca7b8 Author: Mithun CY Date: 2018-08-08 22:10:29 -0700 Post push fix for 2c9d6d9216ad28be2 which by mistake resued a flag value. By Mithun C Y commit 2c9d6d9216ad28be25b5e4d60c07f5fb2db6096b Author: Mithun CY Date: 2018-08-08 21:49:15 -0700 Bug fix, move calls of visibilitymap_pin or visibilitymap_status outside the critical section at execute_undo_actions_page. commit 7b799efee32450a93b5200774d3550893d4049e0 Author: Kuntal Ghosh Date: 2018-08-07 10:44:00 +0530 Handle whether a tuple is self-locked before modifying it While deleting/updating a tuple, we should check whether the tuple is already locked in desirable lockmode by the current transaction. We've missed this check in zheap_delete/zheap_update when the tuple is marked with multilocker flag. Patch by me and Dilip Kumar. Reported by Thomas Munro. commit b51357225493b0436bedece52075de58334d3abd Author: Kuntal Ghosh Date: 2018-08-07 14:14:24 +0530 Fix a bug while fetching a pruned tuple from all-visible page In zheapgetpage, when a pruned tuple is fetched from an all-visible page, we return NULL (and we should return NULL). But, in page-scan mode, we increase rs_ntuples by mistake. commit 2b9bb36f1a7bdcf1e2792e31d943fad3bbc9fa8b Author: dilip kumar Date: 2018-08-07 08:54:58 +0530 Fix various bugs in TPD with multi-lockers - Locker is setting wrong slot in TPD offset map - Locker is not calculating proper TPD slot for members. Dilip Kumar Reviewed by Amit Kpila commit 3967b9019e051842c2947aa0333c4938facdb636 Author: dilip kumar Date: 2018-08-06 16:02:02 +0530 A minor bug fix in TPD code, missing break statement in switch case Reported by Andres. commit 7949de7ff8bd4cd9d99f5cd81ad1211af5a55c7e Author: Mithun CY Date: 2018-08-05 05:58:36 -0700 Support visibility map for zheap. With the support of visibility map for zheap relation, vacuum task and Index Only scan can skip looking into all visible pages. Also, on page flag PD_ALL_VISIBLE is no more in use for zheap. Index Only scan for zheap is enabled with this patch. By Amit Kapila, Mithun C Y Review by Amit Kapila. commit 585226697f91c61926361ad565a10f7a78144ad5 Author: Amit Khandekar Date: 2018-08-03 12:44:03 +0530 Pass the right priorXMax to ValidateTuplesXact() In zheap_get_latest_tid(), ZHeapTupleGetTransInfo() was getting called with 'resulttup'. But because 'resulttup' is returned from ZHeapTupleSatisfiesVisibility(), it can be a tuple generated from an undo record. And if we use ZHeapTupleGetTransInfo() to get the xid that modified resulttup, it returns the xid that created the original tuple instead of the xid that modified the tuple. This results in the wrong xid being passed as priorXMax to ValidateTuplesXact(), which in turn leads to assertion failures. So move the ZHeapTupleGetTransInfo() call before ZHeapTupleSatisfiesVisibility() call, so that we could use 'tp' rather than resulttup. 'tp' gets freed by ZHeapTupleSatisfiesVisibility(), so we could not use tp after ZHeapTupleSatisfiesVisibility() call. Added a new isolation test. commit 7ad7a9aedaf883e02caf5b172374344faa4507b8 Author: Kuntal Ghosh Date: 2018-08-02 19:44:07 +0530 Fix some compiler warnings commit db8af0bf6388182415d0cbcde55882a1bf8de84c Author: Kuntal Ghosh Date: 2018-08-02 14:18:00 +0530 Report the scan location for zheap meta page During scan, we report our scan position for synchronization purposes. We do this before checking for end of scan so that the final state of the position hint is back at the start of the rel. But, if we skip metapage, the scan location may point to a block at the end of the relation. Patch by me, reviewed by Amit Kapila commit caf343183b8940670f9f953014f6815883f9f66c Author: Kuntal Ghosh Date: 2018-08-02 14:14:50 +0530 For parallel scan, find and set the scan's startblock Probably, we've missed this change during earlier rebase. commit ec6c8b936c6de6af49816e29dd04ebc363aad5e6 Author: Thomas Munro Date: 2018-07-26 16:43:44 +1200 Fix silly bug in undolog_xlog_discard(). When an XLOG_UNDOLOG_DISCARD record is replayed, we need to tell the checkpointer to forget about any files that we are about to unlink. I was using the wrong variable, so if a single XLOG_UNDOLOG_DISCARD record caused segment files 1, 2, 3 to be unlinked, I was telling it to forget about fsyncing 3, 3, 3. Then it would eventually try to fsync 1 and 2 and try to raise an error. Repair that. Additionally, in the error path resulting from the above bug, I was also calling FilePathName() on a File that I had failed to open. That causes an assertion failure. Repair that too. Thomas Munro, reported by Neha Sharma, RM43571 commit 9d8be03646a5775000e1008badcabf0a5ff828bc Author: Thomas Munro Date: 2018-07-26 17:56:27 +1200 The startup process shouldn't attach to undo logs. When replaying XLOG_UNDOLOG_META records, the startup process was recording that it was attached to the referenced undo log. That caused corruption of the freelists when it tried to detach on exit. During recovery we shouldn't attach at all; instead we use the xid->undo log mapping. Thomas Munro, RM43614 commit adaaac4518c68b378ec2c60a2a8b29353de6b190 Author: Rafia Sabih Date: 2018-08-01 11:52:41 +0530 Toast table support for zheap Now, toast tables for zheap tables are created of zheap type. This imporves the performance in terms of memory by reducing the bloat in toast tables. With zheap type toast tables, as soon as a transaction deletes a tuple and commits, the space can be utilised for the next insertion. Since, toast tables are larger in size compared to ordinary tables and their updation is handled by insertion + deletion, zheap storage is likely to benefit significantly. Reviwed by Amit Khandekar and Amit Kapila. commit b2509b38661f97c1e1308a5b043a1f34d4d9ebeb Author: Rafia Sabih Date: 2018-08-01 11:46:12 +0530 Bugfix in GetLockerTransInfo The initialization of trans_slots was missing. commit b455083f2a902df76ba43e38fb16260fd26fb4f0 Author: Kuntal Ghosh Date: 2018-07-31 16:21:47 +0530 In zheap, we cannot ignore trans status of backends executing vacuum For zheap, since vacuum process also reserves transaction slot in page, other backend can't ignore this while calculating OldestXmin/RecentXmin. commit df3e6a10a3ba9c5fe81dd80fc1bfd105b7b42478 Author: Kuntal Ghosh Date: 2018-07-31 16:15:23 +0530 During vacuum, reserve sufficient offsets in tpd page In lazy_vacuum_zpage_with_undo, we should allocate sufficient space in tpd page to store the highest unused offset from zheap page. Since, we've to reserve space before determining the unused offsets, we reserve space for maximum used offset in the zheap page. commit 58e36c8929684198c40bb326cfb455e719b7a11d Author: Amit Kapila Date: 2018-07-31 14:51:17 +0530 Support pruning in TPD pages. The basic idea is process all the TPD entries in the page and remove the old entries which are all-visible. We attempt pruning when there is no space in the existing TPD page. Also, while accessing TPD entry, we can consider the entry as pruned, if we find that the ItemIdIsUnused or the block number in TPD entry is different from the heap block number for which we are accessing the TPD entry. Amit Kapila with help from Dilip Kumar. commit 1c37ab83d083241d8738444980ca5510e8e5deb0 Author: Amit Kapila Date: 2018-07-31 14:23:08 +0530 Store Itemids in TPD page. Earlier heappage always point directly to the actual offset of TPD entry in a TPD page. This won't work after pruning as even if the particular page's TPD entry is not pruned, we might not be able to directly access the TPD entry as offset might have moved. Now, we can reach our TPD entry if we can traverse all the TPD entries and TPD entry has block number in it, but that is quite inefficient. To overcome this problem, it is better to store Itemids in the TPD page. Amit Kapila, reviewed and verified by Dilip Kumar commit ed01319cc1d9541b3293b051d3d91eb35b7c448d Author: Thomas Munro Date: 2018-07-31 18:43:17 +1200 Improve smgr README. Wordsmithing. Thomas Munro commit 99f54c695641304061847f6124f04fc9f82a5317 Author: Thomas Munro Date: 2018-07-31 18:33:22 +1200 Add missing case to undolog_identify(). commit 9a956e391937f685fd3284620c0c6796362cfdf0 Author: Thomas Munro Date: 2018-07-31 18:30:04 +1200 Add basic undo log storage tests. Exercise basic undo log storage code under make check-world. Thomas Munro commit 576f60e34994d4e5777c3e1e17c9987199c46a1f Author: Rafia Sabih Date: 2018-07-27 17:55:32 +0530 Discard temp table undo logs for zheap At the commit of a transaction, backend discards temp table undo logs. Reviewed by Dilip Kumar and Amit Kapila commit c61887386927699c1f40c8ad2da0bb49948d38eb Author: Kuntal Ghosh Date: 2018-07-27 17:21:48 +0530 In zheap_update, fix condition for reserving slots. commit 142f2a014312c668eddc20d9a43d5c81baa48675 Author: Kuntal Ghosh Date: 2018-07-27 12:37:02 +0530 An optimization to improve COPY FREEZE in zheap In COPY FREEZE, when we insert a tuple, we always mark it as frozen. Hence, there is a possible performance optimization for the same scenario: 1. We can skip inserting undo records for the tuples to be inserted. 2. There is no need to reserve a transaction slot. Here is the implementation details: 1. Set skip_undo = true if HEAP_INSERT_FROZEN is mentioned. 2. If skip_undo is true, we don't have to reserve a transaction slot in the page. Also, we skip preparing and inserting undo records for the to-be-inserted tuples. 3. For recovery, a new WAL flag XLZ_INSERT_IS_FROZEN is introduced. It's true if HEAP_INSERT_FROZEN is mentioned. During WAL replay, we skip preparing and inserting undo records if XLZ_INSERT_IS_FROZEN is set in WAL records. Patch by me. Reviewed by Amit Kapila. commit 3dd753713bfdedebd5d38cdd709d61f4cf97b60e Author: Kuntal Ghosh Date: 2018-06-22 12:46:15 +0530 Cosmetic changes in zheap_multi_insert and zheap_xlog_multi_insert This commit removes the usage of undo record information at other places in the same function. This makes the coding easy when we intend to skip undo insertions. Patch by me. Reviewed by Amit Kapila. commit bf8288d70bad7da53ea6ab3457ac234ca08ff4f9 Author: Kuntal Ghosh Date: 2018-07-27 16:29:44 +0530 In ZHeapGetVisibleTuple, fetch transaction slot correctly When we call GetTransactionSlotInfo to fetch the transaction info from undo, we should pass TPDSlot=false since we fetch the slot from the item pointer. Item pointer never stores TPD slots. Reported by Neha Sharma. Patch by Mithun CY. Reviewed by Amit Kapila. commit 0f4f8ad4b402ab3163c78691c70ebcf38cc54384 Author: Kuntal Ghosh Date: 2018-07-27 12:15:32 +0530 In ZHeapGetVisibleTuple, handle non-mvcc snapshots Reported by Neha Sharma. Patch by me. Reviewed by Amit Kapila. commit 8d395035cc9a711b6761006b5b2433aa4077ea12 Author: Kuntal Ghosh Date: 2018-07-27 11:12:05 +0530 Fix some compiler warnings commit 2a932c0c1580a699740b5628562f014381dd48c5 Author: Kuntal Ghosh Date: 2018-07-25 16:50:58 +0530 Handle locked-only tuple in ZHeapTupleSatisfiesOldestXmin In ZHeapTupleSatisfiesOldestXmin, we can't take any decision if the tuple is marked as locked-only. It's possible that inserted transaction took a lock on the tuple. Later, if it rolled back, we should return HEAPTUPLE_DEAD, or if it's still in progress, we should return HEAPTUPLE_INSERT_IN_PROGRESS. Similarly, if the inserted transaction got committed, we should return HEAPTUPLE_LIVE. The subsequent checks in the function already takes care of all these possible scenarios, so we don't need any extra checks for locked-only tuple. commit bca6d63cdc78fd8e8d5804b0296b2d1c436af7d5 Author: dilip kumar Date: 2018-07-26 17:52:14 +0530 Fix bug in prepared transaction. StartPrepare is only copying the first member of the array instead of copying whole array. Dilip Kumar Reviewed by Amit Kapila commit 199b463c81cb253b4bb77645629dc9b621ab865c Author: dilip kumar Date: 2018-07-26 17:51:43 +0530 fix the regression test for the zheap. commit 85b792b7a12c91e59abe0189e79587fd516ea1c2 Author: dilip kumar Date: 2018-07-26 17:49:08 +0530 Create an auxiliary resource owner for the undoworker because now we need to have valid resource owner for accessing the buffer. Dilip Kumar Reviewed by Amit Kapila commit f41d0055f5129b99530f1fc0f612b21627e816b2 Author: Kuntal Ghosh Date: 2018-07-25 12:09:10 +0530 Handle single in-progress locker in ZHeapTupleSatisfiesUpdate For single in-progress locker, ZHeapTupleSatisfiesUpdate returns locker's xid and transaction slot along with latest modifier/inserter's xid and transaction slot. We also send the single locker's trans info and latest modifier/inserter's to compute_new_xid_infomask. Kuntal Ghosh and Amit Kapila commit afe785f273dfbb393c0026677e00c32296e7406e Author: Rafia Sabih Date: 2018-07-25 09:06:26 +0530 Rollbacks of zheap temp tables We diligently take care to never push the rollback requests to undo worker for temp tables and the backend itself performs the required undo actions. Reviewed by Dilip Kumar and Amit Kapila commit 9282d623ae17a9ff3f15e5e8e000150a8aee6011 Author: dilip kumar Date: 2018-07-23 18:41:22 +0530 Undo actions is not executed in many cases when failure occurs during commit/abort path. This commit fixes the same. Dilip Kumar Reviewed by Amit Kapila commit 3ac8b197c0f12f91f0f8f93426e818f902036c1e Author: Rafia Sabih Date: 2018-07-23 17:02:02 +0530 Bug fix in PushRollbackHT In PushRollbackHT, if the start_urec_ptr is not given then get it from the log, as done in execute_undo_actions. Reported by Neha Sharma. commit 5c599edf74c44a02332ed50ee939bf974bde6fef Author: Kuntal Ghosh Date: 2018-07-19 16:43:19 +0530 Remove unnecessary GetTransactionSlotInfo() call commit e68f8d97ecbdf730fdd910b0e683542684c6c9d8 Author: Amit Kapila Date: 2018-07-19 08:34:50 +0530 Fix Rollback of multilockers. Uptill now, on rollback, we never change the slot of tuple if the multilockers flag is set on the tuple. This is because we can't find the next highest locker (there could be multiple lockers with same lock level) even by traversing undo chains. To overcome this problem, we come up with a new design where we ensure that the tuple always point to the transaction slot of latest inserter/updater. For example, say after a committed insert/update, a new request arrives to lock the tuple in key share mode, we will keep the inserter's/updater's slot on the tuple and set the multi-locker and key-share bit. If the inserter/updater is already known to be having a frozen slot (visible to every one), we will set the key-share locker bit and the tuple will indicate a frozen slot. Similarly, for a new updater, if the tuple has a single locker, then the undo will have a frozen tuple and for multi-lockers, the undo of updater will have previous inserter/updater slot; in both cases the new tuple will point to the updaters slot. Now, the rollback of a single locker will set the frozen slot on tuple and the rollback of multi-locker won't change slot information on tuple. We don't want to keep the slot of locker on the tuple as after rollback, we will lose track of last updater/inserter. Amit Kapila, Kuntal Ghosh and Dilip Kumar commit 143adfee5ad287dedc744e06eaae0c0ed390005e Author: Kuntal Ghosh Date: 2018-07-16 17:14:54 +0530 Small fix in GetTupleFromUndoForAbortedXact commit dd2f4010834f885af0c2baffa16bc4f6cfeb0664 Author: Kuntal Ghosh Date: 2018-07-10 23:16:25 +0530 Implement ZHeapTupleSatisfiesVacuum for pruning/vacuum For pruning/vacuum, we can skip the tuples inserted/modified by an aborted transaction. It'll be handled by future pruning/vacuum calls once the pending rollback is applied on the tuple. This optimization allows to avoid fetching prior version of the tuple from undo. Patch by me. Reviewed by Amit Kapila. commit d5d9f72eca46bc01efea01f3236f1a3d279459d7 Author: Kuntal Ghosh Date: 2018-07-16 16:45:29 +0530 Handle aborted xact in ZHeapTupleSatisfiesOldestXmin If the latest transaction for the tuple aborted, we fetch a prior committed version of the tuple and return it along with prior comitted xid and status as HEAPTUPLE_LIVE. If the latest transaction for the tuple aborted and it also inserted the tuple, we return the aborted transaction id and status as HEAPTUPLE_DEAD. In this case, the caller *should* never mark the corresponding item id as dead. Because, when undo action for the same will be performed, we need the item pointer. Patch by Amit Kapila and me. Reviewed by Amit Kapila. commit 3cc15be2061d294fe2762da471c321394bf06ebb Author: Kuntal Ghosh Date: 2018-07-13 13:26:43 +0530 In zheap, don't support the optimization for HEAP_INSERT_SKIP_WAL If we skip writing/using WAL, we must force the relation down to disk (using heap_sync) before it's safe to commit the transaction. This requires writing out any dirty buffers of that relation and then doing a forced fsync. For zheap, we've to fsync the corresponding undo buffers as well. It is difficult to keep track of dirty undo buffers and fsync them at end of the operation in some function similar to heap_sync. This commit skips copy_relation_data and copy_heap_data. We need to revisit the same once we implement ALTER TABLE.. SET TABLESPACE and CREATE CLUSTER/VACUUM FULL feature for zheap. Reviewed by Dilip Kumar and Amit Kapila commit d261daaee572c0d6ea953948428a93eec95eee79 Author: Rafia Sabih Date: 2018-07-10 15:45:05 +0530 Modified output file for triggers.sql for zheap When storage_engine = zheap, the behavior for one test case is different than heap. The difference in behavior is because of inplace updates in zheap and non inplace updates in heap. This changed behavior is acceptable for zheap, hence, adding a new output file for zheap. Reported by Neha commit 1ca9a92cde64fcc798c5a0a5e7dbc0b8aae7454f Author: Kuntal Ghosh Date: 2018-07-09 11:58:50 +0530 Bug fix in UndoRecordAllocateMulti Reported by Neha, reviewed by Amit Kapila and Dilip Kumar commit 2eb1df348fe34bd4d68b0892a5e63d6d8a9ddb47 Author: Kuntal Ghosh Date: 2018-07-09 11:57:45 +0530 In recovery, transaction id should be sent in UndoSetPrepareSize Reported by Neha, reviewed by Amit Kapila and Dilip Kumar commit c4b6cc166f11fbc22147e4f5567791ac9bbc572e Author: dilip kumar Date: 2018-07-05 17:39:50 +0530 Handling when a transaction span across undo logs and avoid single WAL logged operation to span across multiple log. The first part of the patch handle a case, when a single WAL logged operation which needs multiple undo records (e.g non-inplace update, multi-insert) we avoid it to go in multiple logs. For that we first allocate all the undo required for the operation in one allocate call. And the second part handle the discarding and rollback when a transaction span across undo logs. Path by Dilip Kumar Reviewed by Amit Kapila. commit 7ed998fd1d369ce8012cd69bed7b020a367c9294 Author: Amit Khandekar Date: 2018-07-03 12:14:47 +0530 Handle changed relfilenode while executing undo actions. While an undo worker executes undo actions for a relation, the same relation can be truncated. In that case, RelidByRelfilenode() when called using the relfilenode saved in the undo record returns invalid relation oid. Use this behaviour to figure out that the relation is truncated, and abort the undo actions. Because this was not handled earlier, the algorithm in execute_undo_actions() failed to identify when exactly the undo records switch to new pages, and kept on fetching records, without releasing them, thus leaving behind those many records and their buffers unreleased. This eventually leads to "no unpinned buffers available" error in the server log. This was reproducible easily when truncate is run immediately after interrupting a long insert, and with shared_buffers set to minimum value. Amit Khandekar, reviewed by Amit Kapila. Reported by Neha Sharma. commit a96820d722ca0244c8bd9a9749f18e5b2604c45d Author: Amit Khandekar Date: 2018-06-28 09:42:58 +0530 Prevent usage of uninitialized tuple during tuple routing. TransitionCaptureState.tcs_original_insert_tuple is of type HeapTuple, and it was assigned a tuple variable which remains unitinitialized in case of zheap table. Fix it so that the zheap tuple is converted to heap tuple and then assigned to tcs_original_insert_tuple. Fixed this in both ExecPrepareTupleRouting() and CopyFrom(). Due to this issue, trigger.sql regression test used to crash on some environments. Amit Khandekar, reviewed by Kuntal Ghosh. commit 304dfa7af0547db1f22d72b30b42baa92647b136 Author: dilip kumar Date: 2018-06-27 20:49:04 -0700 Fix compilation warning commit 89c80035015098c563c104e31406dd23d483fc0f Author: Kuntal Ghosh Date: 2018-06-13 16:49:04 +0530 Change minimum transaction slots per zheap page to one We distinguish a zheap page from a TPD page by comparing the special page size. TPD pages always have 1 slot in its special space. Hence, we've to set minimum slot as two in zheap pages. Tested on Windows by Ashutosh Sharma commit 1265603a63e90def6a1738f5cca962a13d59e43d Author: Amit Kapila Date: 2018-06-25 08:08:51 +0530 Free payload data only when it is allocated. commit 045032bdedff74d583bd54c68a1560f63b7cb967 Author: dilip kumar Date: 2018-06-21 06:24:13 -0700 Isolation test for tpd patch. Patch by Rafia Sabih Reviewed by Dilip Kumar commit 6bbe1f3fde618cf35f64da2726f1100963ba5576 Author: dilip kumar Date: 2018-06-21 06:14:12 -0700 Post TPD commit fix. After Rereserve the slot, trans_slot is not set back to the new slot. Ideally we should get the same slot but if our slot moved to TPD than it can be old_slo +1. Also, removed the invalid Assert and converted to if condition. Patch by Dilip Kumar Reviewed by Amit Kapila commit 3d3697e196c2f44eceaa07b77d00bc9d90ff0c19 Author: Amit Kapila Date: 2018-06-21 16:32:10 +0530 Support TPD which allows transaction slots to be extended beyond page boundary. TPD is nothing but temporary data page consisting of extended transaction slots from heap pages. There are two primary reasons for having TPD (a) In the heap page, we have fixed number of transaction slots which can lead to deadlock, (b) To support cases where a large number of transactions acquire SHARE or KEY SHARE locks on a single page. The TPD overflow pages will be stored in the zheap itself, interleaved with regular pages. We have a meta page in zheap from which all overflow pages are tracked. TPD Entry acts like an extension of the transaction slot array in heap page. Tuple headers normally point to the transaction slot responsible for the last modification, but since there aren't enough bits available to do this in the case where a TPD is used, an offset -> slot mapping is stored in the TPD entry itself. This array can be used to get the slot for tuples in heap page, but for undo tuples we can't use it because we can't track multiple slots that have updated the same tuple. So for undo records, we record the TPD transaction slot number along with the undo record. This commit provides basic support of TPD entries where a fixed number of entries per page can be allocated in extended pages. There is more to do in order to complete the support of TPD. The remaining work consisits of 1. Reuse transaction slots in TPD entry. 2. Allocate bigger TPD entries once the initially configured slots or offsets are exhausted. 3. TPD page pruning and once the page is clean, we can add it to the FSM. 4. Find the free TPD page from FSM. 5. Chain of TPD entries when the single TPD entry can't fit on a page. Amit Kapila with the help of Dilip Kumar and Rafia Sabih commit 15536e74ed38ba970f30684850ad013d2174d6e8 Author: Mithun CY Date: 2018-06-20 04:31:51 -0700 Move Assert to right place. By Mithun C Y review By Dilip Kumar commit 9724bc889e4a99309e74912c739416c4345cd5d1 Author: Mithun CY Date: 2018-06-19 22:24:35 -0700 Release restriction on page compactification. Previously a30d278e8d we did not allow compactification of page if it ever have deleted uncommitted tuple in it. This is not necessary and hence removing same. By Mithun C Y, review by Amit Kapila. commit a07a3976029c6322cc3e3f954eae0cd606e776b2 Author: Mithun CY Date: 2018-06-18 05:29:17 -0700 Fix bugs related to concurrency in zheap update While updating the tuple we might need to release the buffer lock and reaquire same. During that period where we do not hold buffer lock a concurrent process might have moved the tuple to undo and/or pruned the page. So whenever we reacquire the buffer lock check if itemid is deleted and readjust the position of tuple in page buffer. Patch by Mithun C Y and Amit Kapila. commit 2448790b692e8f6581cce59b4d08abc977dbd20b Author: Kuntal Ghosh Date: 2018-06-12 11:13:34 +0530 In PrepareUndoInsert, don't call SubTransGetTopmostTransaction In PrepareUndoInsert, we always send the top transaction id. Hence, there is no need to fetch the parent xid for a transaction. commit 4abb8ba5305ef04f7164be58f4c0c94b11ab793c Author: dilip kumar Date: 2018-06-14 05:00:13 -0700 LogUndoMetaData was called inside XLogBeginInsert, And, LogUndoMetaData iteself call XLogBeginInsert for inserting the meta WAL. Patch by Dilip Kumar commit 7c44b88155a14186ca91ca010cf84d5f7be5fd93 Author: dilip kumar Date: 2018-06-14 04:54:20 -0700 Conditions for rollback request push to worker was not consistant it should only be pushed for the top transaction. Also, when there is an error in top transaction, then we will not have xid for the top transaction. So we have removed the xid based key for rollback hashtable. Path by Rafia Sabih Review by Amit Kapila tested by me. commit 2797c4bc90ad986804b698a4eaa39991ffcf2094 Author: Amit Kapila Date: 2018-06-14 15:37:44 +0530 Implement stats for in-place updates Reuse the existing hot-update stat variable for in-place updates instead of introducing a new variable as that will increase the size of stats structure. It appears ugly to overload the unrelated variable, but it is better to discuss in community before introducing a new variable. Beena Emerson, reviewed by Mithun C Y and me commit 0b8ef5ea24a55c23d54821fa4bc3667319e84801 Author: Kuntal Ghosh Date: 2018-06-08 17:12:59 +0530 Use high 4-bits of xl_info to store undoaction WAL operation type commit 80e9fe40e186dac8ffbd4b9bf2479cb1fca4f770 Author: ashu Date: 2018-06-11 13:04:33 +0530 Pass '/c:' option with findstr command on Windows. Commit 3e9f07a8fe22 uses findstr command to remove the lines having "Options: storage_engine='zheap'" pattern from the results/*.out files on Windows platform but, it doesn't pass the correct option to findstr to ensure that only the lines having the given pattern gets removed form the results/*.out file. Ashutosh Sharma, Reported by Amit Kapila. commit a600e4e055877dbb9071f40f215a8f8bb8662d79 Author: Rafia Sabih Date: 2018-06-08 17:00:07 +0530 Buffer leak fix in zheap_insert commit 303634f18969c53dc31b6668668232f327d7bed6 Author: dilip kumar Date: 2018-06-07 23:08:42 -0700 During rollback we rewind the insert location, Now if the insert location is used by some other transaction then its updating its own next pointer with its own insert location resulting in cycle and due to that undo worker is stuck in this cycle. Patch by Dilip Kumar Review by Amit Kapila and Kuntal Ghosh commit 6fe4dfdda5db82e0e24eeaefa4cdc63a3277be49 Author: dilip kumar Date: 2018-06-07 23:06:34 -0700 ZHeapPageGetCid currently comparing with RecentGlobalXmin to identify the old undo, ideally it should compare with oldest xid having undo. Due to this it fetching many extra undo hence the performance of is low. Fix by Dilip Kumar review by Amit Kapila commit 363c7aa95eeb35f57908ed7c777817cfa38a0f1d Author: Kuntal Ghosh Date: 2018-06-04 22:09:13 +0530 While performing undo, ignore out of range block numbers If a table is truncated just berfore performing undo actions on the same, it's possible to encounter out of range block numbers. In that case, we can safely ignore those block since we don't have to perform rollback on the same. commit 7cd5b08da9380552376a69456ca11c6f97ce1a1c Author: ashu Date: 2018-06-07 17:32:52 +0530 Store the information about lockmode in undorecord and initialize the flags variable in xl_zheap_lock to zero. Commit 0763c68645e9 introduced the support for different tuple locking modes and added the necessary changes in zheap_lock_tuple_guts() to store lockmode information in undo record but, missed to do the similar changes in zheap_update() due to which the undo record pointers for non-inplace updates were not the same in master and standby nodes thereby, resulting in an assertion failure. Additionally, commit f45251a37fe2 introduced flags variable in xl_zheap_lock to store transaction slot related information and did the necessary changes for it in zheap_lock_tuple_guts() but, missed to do the same in zheap_update(). Ashutosh Sharma, reported by Neha Sharma, reviewed by Dilip Kumar. commit 643bf383e77acb5eca372380267d01ed01d3a545 Author: Mithun CY Date: 2018-05-31 03:17:44 -0700 Correct style issues in macro definition. By Mithun C Y comments by Robert Haas. commit aaf09cb0491869b5d6e84e4cc68cfe28418ef6ad Author: Kuntal Ghosh Date: 2018-05-30 15:37:24 +0530 In zheap insert option, HEAP_INSERT_FROZEN should be used In zheap_insert/multi_insert, we can use HEAP_INSERT_FROZEN flag similar to heap to indicate the inserted tuple should be frozen after insertion. commit 409cf3957bee5894cacc603670587476171277fe Author: dilip kumar Date: 2018-05-30 01:42:43 -0700 Applying pending undo action before modifying the page. Currently, if a transaction wants to update a tuple and we find that the other modifier is aborted and undo actions is not yet applied we simply modify the page, which will create an unpredictable behaviour as the execute undo action may rollback the changes made by our transaction. This commit first apply the pending undo action only for that page and then perform the changes. Patch by Dilip Kumar and Amit Kapila Reviewed by Amit Kapila Tested by Kuntal Ghosh commit 736ff4cb26355992800792dc64954e55694a82fb Author: dilip kumar Date: 2018-05-30 01:41:06 -0700 Bugfix in condition check while applying the undo action. Earlier it was not collect when two undo pointer was in different logs. Patch by Dilip Kumar Reviewed By Amit Kapila commit c708efefa812ffe54d94841b3c54c906b263a3da Author: Kuntal Ghosh Date: 2018-05-30 12:00:01 +0530 In zheap_update, don't propagate lockers when no lockers are present When no lockers are present, we should not propagate any locker related information to the newly inserted tuple. Patch by Dilip Kumar, reviewed by Amit Kapila and me. commit 640c67490b76c096405fe38253ae222439707ba5 Author: ashu Date: 2018-05-25 13:20:53 +0530 Correct the insert and discard pointer of the undo log segment file when resetting the undo logs. Report by Neha Sharma, Initial Analysis by Ashutosh Sharma, Patch by Thomas Munro. commit dc28828a8aa5a54fd8a5aced67c4a69cad84faf7 Author: Kuntal Ghosh Date: 2018-05-24 14:39:34 +0530 Allow DML commands that create zheap tables to use parallel query This commit applies the required changes for zheap corrsponding to the commit e9baa5e9fa147e00a2466. Reviewed by Amit Kapila commit be186ca7f02be4c8d88632bc9fff37d2a7680267 Author: Thomas Munro Date: 2018-05-25 15:30:31 +1200 Fix corruption of oldest_data. After commit 9ebe7511 it could be left pointing to space before log->discard, and we'd later try to read from there and possibly see a bunch of zeroes. Thomas Munro, RM43553, reviewed by Rafia Sabih commit ba2406238eb93e400ac223a3d8e78097984072c6 Author: Amit Kapila Date: 2018-05-25 08:31:46 +0530 Fix the freespace recording by vacuum The freespace was not being updated till we have some deleted/dead tuples in the page. Report and initial analysis by Ashutosh Sharma, patch by me commit ab0049c4bb174c332052b187ce60b3f01466bb18 Author: Kuntal Ghosh Date: 2018-05-24 11:11:40 +0530 Revert last commit 853849138bf05bcac85 We're already releasing the lock in function IsPrevTxnUndoDiscarded. Although, we could've released the lock in the same function where it was taken, but let it be as it is for now. commit d7e1b3f1c30456b25fbbaa22ffa6bb9791f60a6e Author: Kuntal Ghosh Date: 2018-05-23 16:38:33 +0530 In PrepareUndoRecordUpdateTransInfo, release discard lock before leaving page commit e329d580a973fcb41dc37b6283fc0db67e8262f3 Author: Kuntal Ghosh Date: 2018-05-21 12:28:12 +0530 In execute_undo_actions, release undo records before exiting When undo record is discarded, we return from execute_undo_actions immediately. But, we should release the undo records and corresponding undo buffers collected earlier in the same function for rollback purpose. commit 9dcdb6468df34f45b2975e5c4cac69a05ed27ac2 Author: Mithun CY Date: 2018-05-22 23:14:02 -0700 Revert the restriction on page pruning. Revert code which disallowed pruning if there exist any open transaction on the page. Patch By Mithun C Y commit 7023615ca40a2041e79863f56c2c1fa6b7c58cda Author: Amit Khandekar Date: 2018-05-23 09:42:55 +0530 Pass nobuflock=false to ZHeapTupleGetTransInfo(). In zheap_get_latest_tid(), ZHeapTupleGetTransInfo() was called with nobuflock=true, which is wrong because the buffer is already locked. Discovered by Kuntal Ghosh, patch by Amit Khandekar. commit ad0dd279a096e5f9763f44799dfc78c18c4bbd26 Author: Amit Khandekar Date: 2018-05-23 09:12:09 +0530 Avoid using tuple freed by a visibility function. In zheap_get_latest_tid(), the tuple that is passed to ZHeapTupleSatisfiesVisibility() was being used subsequently, ignoring the fact that the visibility function frees the passed-in tuple. Rather than the passed-in tuple, use the tuple returned by the visibility function in the subsequent code. While at it, make sure that the returned tuple is also freed if it is different than the passed-in tuple. Discovered by Neha Sharma while testing TidScan implementation for zheap. Amit Khandekar, reviewed by Kuntal Ghosh. commit e47016cdfba942a07cc3c9fb98d698e22251cb77 Author: Amit Khandekar Date: 2018-05-22 09:38:16 +0530 Fix an issue with copying overlapping memory areas. For adjusting zheap tuple location, the tuple header was getting corrupted because memcpy() was used to copy, and the old and new tuple header areas may sometimes overlap, which memcpy does not handle. This was discovered when trigger.sql regression test used to crash on some environments. So use memmove() instead of memcpy(). memmove() is meant to handle this scenario of overlapping areas by first copying the data from source location into a temporary location. Amit Khandekar, reviewed by Dilip Kumar. commit ec2d2d7b51bd843b88c2668d548fb90da60fcc67 Author: Kuntal Ghosh Date: 2018-05-17 10:33:47 +0530 Fetch correct cid for undo tuples In ZHeapTupleGetCid, we should fetch the correct cid from undo. Patch by me, reviewed by Amit Kapila commit ef42a733e045b5624308ce37f4d745b0b9e7d221 Author: Kuntal Ghosh Date: 2018-05-21 13:09:17 +0530 Fix compiler warnings commit 804476bd9929532a83fd0b30ddacfa00b68c5028 Author: Thomas Munro Date: 2018-04-23 14:38:34 +1200 Add a smgrsync() implementation for undofile.c. Teach the checkpointer to go through an sgmr API call to mark a (RelFileNode, forknum, segment) as in need of flushing to disk at the next call to smgrsync(). Previously it called RememberFsyncRequest() in md.c directly, but undofile.c needs to be able to participate in this scheme too. So, add a new function smgrrequestsync(), and have it forward to mdrequestsync() or undofile_requestsync() as appropriate. For now, there is a LOG message when undo segment files are fsync'd, like the existing create/recycle/unlink messages. These will be removed in future. This contains a small amount of code that is copied from md.c, but the fsync queue machinery is being redesigned so this can be rebased later to use a common fsync queue. See commitfest entry 18/1639 (work independent of zheap). Thomas Munro, RM43460, reviewed by Amit Kapila commit 74c0db38b6469fd92bfca32b41ecbadafd99ed40 Author: Thomas Munro Date: 2018-05-15 01:22:41 +1200 Free up undo log DSM segments and trim pg_undo file size. To prevent the pg_undo file from getting gradually larger and the number of undo log DSM segments from gradually increasing in a long running system, update the lowest non-discarded undo log number at each checkpoint so that we can free up resources. In passing, change UNDO_LOG_STATUS_DROPPED to UNDO_LOG_STATUS_DISCARDED, a name that better describes the state. Thomas Munro, RM43532, reviewed by Amit Kapila commit 9b18132fe9ad889e0aaecb2d2019e3bcf9433018 Author: Thomas Munro Date: 2018-04-24 16:45:58 +1200 Recycle memory used in recovery for the xid->undo log map. At each checkpoint, free up memory used to hold information about which undo log old xids are attached to. Also rename associated variables and functions to make things clearer. Thomas Munro, RM43532, reviewed by Amit Kapila commit 8cb810f82a694d794429f0b8e28328baf6334bc5 Author: Rafia Sabih Date: 2018-05-17 15:06:51 +0530 Bug fix in discard mechanism When undo actions were applied by backend and the undo record pointer is rewound, discard the corresponding undo logs and skip performing undo actions. Reported by Neha Sharma, reviewed by Dilip Kumar and Amit Kapila commit 56762bc994dfa14e79c29f13231bd0f6e99e0782 Author: Kuntal Ghosh Date: 2018-05-15 17:16:55 +0530 Fix compiler warnings commit e40d26a76fb67a3b2f1cf144a2e76c8301be6a4b Author: Thomas Munro Date: 2018-05-15 05:12:47 +1200 Improve smgr.c and undofile.c modularity. Add a void pointer called "private_data" to SmgrRelationData where undofile.c can put its state, instead of the previous horrible hack where it was using relm->md_seg_fds. md.c should really use the new member too, but that'll be a patch for another day. Thomas Munro commit bc419ace343682505687cbb39f402d4e90cf228a Author: Thomas Munro Date: 2018-05-15 17:04:31 +1200 Remove obsolete comment from undolog.c. commit 62e17972070c1b994f2ae63abe07f951f59bdbc3 Author: Thomas Munro Date: 2018-05-15 13:40:29 +1200 Update copyright date to 2018. For all undolog-related files. commit b591268e289fba15aeb9f1160d7c164e241c6a11 Author: Thomas Munro Date: 2018-05-09 18:44:56 +1200 Add some user-facing documentation about undo logs. This commit documents the pg_stat_undo_logs view and the layout of files on disk. More wordsmithing will be needed. Thomas Munro commit 350dadc12b77a4efccafbb5f11663c07370bfd01 Author: Thomas Munro Date: 2018-05-09 19:16:24 +1200 Undo log README tweaks. Author: Thomas Munro commit 22560511be9b6d49034573736feb0c7a830e1a3a Author: Thomas Munro Date: 2018-04-16 16:15:54 +1200 Skip unnecessary reads of newly allocated undo log pages. Whenever we're inserting new undo data that happens to fall at the start of a page, we know that there can be no pre-existing data on the page. Therefore we can ask bufmgr.c to zero it out instead of reading it from the storage manager. To facilitate this, create a new BufferReadMode RBM_ZERO, just like RBM_ZERO_AND_LOCK except without the content lock. undorecord.c expects to acquire the content lock a bit later. Since each backend has sole write access to write to the undo log, it's not necessary for bufmgr.c to acquire the lock for us. Thomas Munro, RM43486, reviewed by Amit Kapila commit b88edc54636b2649101106ad08b95c16fbfc1d7f Author: Mithun CY Date: 2018-05-14 21:46:38 -0700 Move previous transaction's undo updation inside critical section Update the previous transaction's undo record inside InsertPreparedUndo after we have actually inserted the undo record. Patch by Mithun C Y Review by Dilip Kumar. commit 87075c70ffaeea5c570340df7aa144823d94a510 Author: Mithun CY Date: 2018-05-14 21:35:47 -0700 Get deleted rows from zheap_fetch. In zheap_fetch if ItemId is set as deleted we need to fetch the old version of rows from undo. Patch by Mithun C Y Review by Kuntal Ghosh commit 229aaef467f263bf779cab93c2bf05011a03f335 Author: Amit Khandekar Date: 2018-05-14 18:18:30 +0530 Fix an issue in parallel btree index build. Commit 9da0cc35284b added support for parallel btree index build. While imitating those changes for zheap in IndexBuildZHeapRangeScan() with commit a33e61f999f03, some changes got missed, due to which an already-unregistered snapshot is tried to be freed again. Added the missing changes. Reviewed by Mithun Cy and Amit Kapila. commit 43d9001c390852fae721b9351a8404f3e063b0d5 Author: Kuntal Ghosh Date: 2018-05-07 16:10:29 +0530 During ROLLBACK, set frozen/invalid xact flag correctly Reviewed by Amit Kapila commit e18393d52f04e6c5ade7529581b521d2f7387bdf Author: Kuntal Ghosh Date: 2018-05-14 12:05:48 +0530 In UndoLogAllocate, set is_first_rec to false by default commit db7d7ee6110a47b420d2be2dc88381e39a099f9c Author: Amit Kapila Date: 2018-05-13 09:29:14 +0530 Update README.md to reflect the current status of zheap. commit 19235c3d03f0505f91b5056871c79a70e79430c4 Author: Kuntal Ghosh Date: 2018-05-08 12:26:35 +0530 Add new expected .out files for zheap to compensate the failures only happening due to inplace updates. commit 4396609900c4bd5efdd7ecebc8d3966411304786 Author: Rafia Sabih Date: 2018-05-08 11:41:45 +0530 Fetch the tuple from undo for aborted transactions when checking the visibility for dirty snapshot Reviewed by Kuntal Ghosh commit 27814264865d23ab8cfae46cdbe9202117ae7c6c Author: Rafia Sabih Date: 2018-05-07 13:18:29 +0530 Thinko fixed for execute_undo_actions caller commit d8a3a11957e1e4647da0782308ca89b5ac931b1b Author: Rafia Sabih Date: 2018-05-07 13:11:26 +0530 Fixed a thinko in RollbackFromHT Reported by Dilip Kumar commit 35a9231220d0866847cd7bd3729c5d231b10e3b5 Author: Thomas Munro Date: 2018-05-04 15:45:25 +1200 Document the undo log IO wait events. Add the new undo log wait events to monitoring.sgml. Thomas Munro commit 7dc8a6862586586b48eadfe708044e679f86f153 Author: Thomas Munro Date: 2018-05-04 15:06:23 +1200 Report read/write/sync wait events for undo checkpoints. Like other file IO, let's make these show up in pg_stat_activity. Thomas Munro, RM43509, based on feedback from Amit Kapila commit ff7193ad4d4472014344e0983ce90adce9be3229 Author: Thomas Munro Date: 2018-04-23 23:19:49 +1200 CRC verification for pg_undo checkpoint files. Add CRC32C checksums to the per-checkpoint files stored under pg_undo. Thomas Munro, RM43509, reviewed by Amit Kapila commit 5ec98770f1e4a1b6fc515447a7d01e1139a8eb27 Author: Thomas Munro Date: 2018-05-01 17:48:01 +1200 Fix compiler error in test_undo.c on Windows. Per CI build report. commit e8ee49180be2391e87a0858d6c2b09eca5fbd716 Author: Amit Kapila Date: 2018-05-03 18:55:15 +0530 Avoid calling PageGetUNDO We need to avoid calling PageGetUndo as it re-access the transaction slots the second time. We can already get it via PageReserveTransactionSlot. This is okay till now, but with TPD, we need to again access the TPD page which will be costly. Patch by me, reviewed and edited by Dilip Kumar commit 83986b54c2b6445517502d1750f5523ed2690bba Author: Kuntal Ghosh Date: 2018-05-03 14:23:22 +0530 Small fix in CopyTupleFromUndoRecord We should free the memory for zheap tuple only after allocating memory for the copied undo tuple. commit c8ce97a9d55c0c3a3f537a389513ce10de1b6c0e Author: Kuntal Ghosh Date: 2018-05-03 11:51:11 +0530 Remove unnecessary undo type from CopyTupleFromUndoRecord Reported and reviewed by Amit Kapila commit a2a88babb3e190f526cdb3fae496c41fc9c1d2f2 Author: Amit Kapila Date: 2018-05-03 12:45:30 +0530 Fix WAL replay of XLOG_ZHEAP_UNUSED We forgot to call PageSetUNDO during wal replay of XLOG_ZHEAP_UNUSED. Patch by me, verified by Mithun C Y commit 9b1f493a6335d0702479bc56506dc34287b46b35 Author: Amit Kapila Date: 2018-05-02 16:54:50 +0530 WAL replay for lock tuple was not using correct transaction slot to update the transaction information This commit fixes the issue by logging and using the correct transaction slot during replay of lock tuple. Patch by me, reviewed by Dilip Kumar commit 6a74940a505bd5e302cc9ac7b3adbb3625c560a8 Author: dilip kumar Date: 2018-05-01 23:19:24 -0700 Bug fix in undo record start header update during recovery Patch by Dilip Kumar Reviewed by Rafia Sabih commit a5c5e1706570f379eb8bf6a12a9226af60912016 Author: Kuntal Ghosh Date: 2018-05-02 13:42:36 +0530 Handle non-exixtent/empty files in pg_regress Commit 3e9f07a8fe222b0e9 excluded storage_engine option from result files using grep/findstr command. But, these commands returns non-zero values in case of non-existent/empty files. Hence, we've to skip the checks for the same. commit fd4c197eca2a0846d08727361073cbc1f82d7cbf Author: dilip kumar Date: 2018-05-01 22:51:15 -0700 Fix tuple lock wal replay lock mode was not stored in lock tuple WAL and also not stored in undo during replay. This commit fixes the same. Patch by Dilip Kumar Reviewed by Amit Kapila commit 3775d97153d1cd63bc98d561816e9662cd03eb53 Author: Kuntal Ghosh Date: 2018-04-30 17:14:08 +0530 Fix copy undo payload Reported by Amit Kapila commit 02a89f973c2be94019d695fb0d917494db319473 Author: dilip kumar Date: 2018-05-01 21:45:18 -0700 Bug fix in update wal replay. Patch by Dilip Kumar Reviewed by Amit Kapila commit c5f63f27f73618b994bd561057d3f4e9c0da15ff Author: Amit Kapila Date: 2018-05-01 12:42:26 +0530 The check to ensure whether undo is discarded was missing at few places. This patch fixes two such occurrences. Patch by me, reported and reviewed by Rafia Sabih commit f336c7a16c72e0048f897f85b1e38b24adc8ff8e Author: Kuntal Ghosh Date: 2018-04-30 11:15:40 +0530 Add new expected .out files for zheap to compensate the failures only happening due to inplace updates. commit 1a8e463d48e1bbf5ea9abb1e07b1512a097fb963 Author: Rafia Sabih Date: 2018-04-30 10:49:06 +0530 Additional flag in execute_undo_actions for relation lock Callers of execute_undo_actions now provide a flag based on if they a lock on the relation. If the caller have relation lock already then no need to lock again in execute_undo_actions. This is required particularly in cases when rollbacking the prepared transactions or rollback to savepoints. In such cases the transaction have already held the relation locks and we need not take another. Reviewed by Dilip Kumar commit 5f229e04af2f187fc757e8f59ba8c075e56cf46c Author: dilip kumar Date: 2018-04-27 04:21:42 -0700 While vacuuming a large zheap table, update upper-level FSM data every so often. for detail refer commit 851a26e26637aac60d6e974acbadb31748b12f86 of PG. Patch by Dilip Kumar Reviewed by Amit kapila commit 90dbf29c4b815ea3b51c8736f412a67d59d5f4f1 Author: Thomas Munro Date: 2018-04-27 21:49:59 +1200 Log the same undo segment messages in REDO and in DO. During DO we currently output LOG messages when undo segment files are recycled etc. Output the same messages in recovery. All of these messages will later be removed, but it's helpful for testing to show them for now. Thomas Munro commit 292856c5c73f80f9089441a9079ce76ab17b9764 Author: ashu Date: 2018-04-27 12:53:10 +0530 Add new expected .out files for zheap to compensate the failures only happening due to inplace updates. Patch by Ashutosh Sharma, suggested by Amit Kapila, reviewed by Kuntal Ghosh. commit 0925b489735ab6ba01c3c57a85175947ec04a373 Author: ashu Date: 2018-04-27 12:51:00 +0530 Allow pg_regress module to exclude storage_engine option printed when viewing the definition of zheap table using \d command. Additionally, also add a new reloption_1.out file to compensate for the diffs generated due to storage_engine option in reltopions field of pg_class table. Patch by Ashutosh Sharma, reviewed by Kuntal Ghosh. commit c13e3a64be6fe571f48c1abc8a837c063594ce1d Author: Kuntal Ghosh Date: 2018-04-18 18:17:16 +0530 Implement Foreign Key Constraint for zheap For zheap, when we call the triggers we convert a zheap tuple to heap tuple. But, we don't store any transaction related information on the heap tuple. Hence, I've to make the following two assumptions/considerations for the foreign key implementation in zheap: 1. RI_FKey_fk_upd_check_required: In this function, we check if the original row was inserted by our own transaction, we must fire the trigger whether or not the keys are equal. For zheap, since we don't have the transaction related information, we always fire the trigger. So, even if the keys are equal, we cannot skip the trigger for zheap. 2. validateForeignKeyConstraint: During ALTER TABLE..ADD CONSTRAINT FOREIGN KEY, this function is called to validate whether we can add the foreign key constraint. First, it tries to fire a LEFT JOIN query to test the validity. If the user doesn't have proper access rights to pktable/fktable, we scan the table in pagescanmode and fire trigger for each tuple. Before firing the trigger, it takes buffer lock to check whether it can skip the trigger. For zheap, we don't retain the pin on the buffer. Hence, we've to read the buffer again before locking the same. Patch by me, Reviewed and tested by Ashutosh Sharma commit 4113663bfa26941ef3b5eec9589e70dfe3643e96 Author: ashu Date: 2018-04-27 11:34:49 +0530 Update t_infomask2 field of old tuple correctly during inplace updates. Patch by Amit Kapila, reviewed by Ashutosh Sharma commit 06fb3f80e349e0a329606702cd94cd2bdc740946 Author: Mithun CY Date: 2018-04-25 23:56:02 -0700 Mark all_dead if ItemId is already dead. By Mithun C Y review by Ashutosh Sharma commit 60731dbac9433dc640d3412504423e15260b1ed6 Author: Mithun CY Date: 2018-04-25 23:52:55 -0700 Fix rebase issue Prototype change of GetOldestXmin was not updated while rebasing to postgres main branch. Above patch corrects same. commit 7e10f40e60dcd2704ba98781ae98d658b3dcbb5a Author: Thomas Munro Date: 2018-04-25 15:07:34 +1200 Track undo logs' is_first_rec correctly in recovery. Previously we could get confused about whether an undo record is the first in a transaction during recovery. Use XLOG_UNDOLOG_ATTACH to set the flag, since that is always emitted before the first zheap WAL record for each transaction. If a checkpoint happens to come between that and the first zheap operation, it doesn't matter because then an XLOG_UNDOLOG_META will be inserted and that will restore the is_first_rec flag, assuming it was also tracked correctly during DO (that's a separate investigation). Thomas Munro, RM43459, reviewed by Dilip Kumar commit 3a82f2f014ac86b7587bc85f12a5df6fc4bb1514 Author: Thomas Munro Date: 2018-04-25 21:10:30 +1200 Make src/test/modules/test_undo compile. After many recent changes it wasn't building. Repair. It isn't useful for testing at the moment, because if you add test content to undo logs it causes the undo worker to crash. I need to figure out a way to prevent that from happening and then convert this into a useful set of tests of undo log machinery. Watch this space. Thomas Munro commit 0bf11c41400b26ec50a2edd4fb70777268993d56 Author: Thomas Munro Date: 2018-03-26 09:59:39 +1300 Change undo log segment size to 1MB. The previous size was 4MB. That size wasn't chosen with much thought, and it creates a fairly large disk footprint for systems with many concurrent backends. Let's try the smaller and nice, round number of 1MB and see how frequently we finish up doing filesystem operations. Early testing with pgbench on powerful machines seem acceptable, since the undo worker is very easily able to recycle segments fast enough so that foreground processes never have to create a new one. Thomas Munro commit 5b7fbed0c316ec15cdf8fc566a503e5428946dd7 Author: Thomas Munro Date: 2018-03-09 16:18:12 +1300 Add a README file describing the undo log storage subsystem. Add src/backend/access/undo/README, and update src/backend/storage/smgr/README to describe the new storage manager. Thomas Munro commit 2b620d2e8b3bc0708106e122d42464f417406071 Author: Thomas Munro Date: 2018-03-20 05:19:45 +1300 Make temporary undo logs use backend-local buffers. For now temporary undo data is not discarded, except at startup when it's all discarded at once. This will be addressed in later commits. Thomas Munro, RM43422 commit 1f176882f95411b5a4fd18740687143fd4cd2c0e Author: Thomas Munro Date: 2018-04-23 14:30:14 +1200 Basic tablespace and persistence support for undo (take II). You can now be attached to a separate undo log for each persistence level (permanent, unlogged, temporary). Undo logs can now be created in tablespaces other than pg_default by setting the new GUC "undo_tablespaces". Temporary undo logs are not yet backend-local; a separate commit will add that. A separate commit will also fix some details of crash recovery. Thomas Munro, RM43422, reviewed by Rafia Sabih, Dilip Kumar, Amit Kapila commit 25eb200b1aa3d05cb1d8365946fc03c3ca1c96fc Author: Thomas Munro Date: 2018-04-23 13:01:51 +1200 Introduce XLOG_UNDOLOG_META WAL records. During checkpoints, undo log meta-data is captured at an arbitrary time after the redo point is chosen. In the case of an online checkpoint, this means that we might capture incorrect meta-data. Correct that by inserting an XLOG_UNDOLOG_META record before the first WAL record that writes to each undo log after a checkpoint. Dilip Kumar, RM43459, reviewed by Thomas Munro commit 47c7e3b5199d561953f693e18c2f85dae7b56264 Author: Thomas Munro Date: 2018-04-23 12:32:00 +1200 Remove code for consistent pg_undo checkpoint files. Instead of trying to make pg_undo files consistent, we have decided to allow them to contain data from any arbitrary time after the redo point. A follow-up commit will introduce new WAL records that will be emitted to correct them. This is not a complete revert of commit d8a02edf as there were some refactorings and small fixes that seem worth keeping. Thomas Munro, RM43459 commit b43da384a36b13679f5f0ec12aa72e218656d0d1 Author: dilip kumar Date: 2018-04-24 07:09:41 -0700 Fix the assert. In recovery we can not ensure that we are attached to the undo log from which we are allocating. commit f306a1c5c10e63eb38c790be504e154413372a90 Author: Mithun CY Date: 2018-04-24 06:35:27 -0700 Force page init on Insert rollback In zheap_xlog_insert we see insert of first and only tuple on the page we re-initialize the page. Force page init on insert or multi insert rollabck so wal consistency check of page on standby still satisfy. This overrides previous commit 15179e57b124 Patch by me and review by Kuntal Ghosh. commit ffc5ea7e8577d3cce4b5e44fda55a4f7df6bed75 Author: Kuntal Ghosh Date: 2018-04-13 14:13:44 +0530 Fix ZHeapPageGetCtid for deleted tuples commit d8107861935ad0e202aeb14ad116722af5d22af5 Author: Kuntal Ghosh Date: 2018-04-24 18:26:23 +0530 Handle deleted item pointers for no-key-exclusive mode Reported by Amit Kapila commit 156e25415049a1d9a940bf709abc5fd5f550ee7e Author: Kuntal Ghosh Date: 2018-04-20 15:15:59 +0530 Fix assert in zheap lock,update and delete tuple For deleted item id, we don't retrieve the tuple. Hence, we should check for the same. commit af6f63e740cd904f234475bf530a9290b982b38b Author: dilip kumar Date: 2018-04-24 03:35:42 -0700 make zheap changes in slot_getsysattr function. This function is now called by execCurrentOf. commit 8b6041f28b8e2554da439da7701500bb2af2c712 Author: Thomas Munro Date: 2018-04-23 11:11:57 +1200 Fix warnings in non-assertion build. Clang complained about uninitialized variables when there was no assertion. commit 2b96b1be41fad5aa00e8d8b8db63aada925f7581 Author: Kuntal Ghosh Date: 2018-04-20 19:17:48 +0530 Fix a bug in ALTER DOMAIN .. ADD CONSTRAINT The previous commit doesn't use beginscan correctly for heap/zheap relation. commit 7cb019bbcec083052f8436f4489fe6e2957b223a Author: Beena Emerson Date: 2018-04-20 16:11:19 +0530 Correct the behaviour of ALTER DOMAIN .. ADD CONSTRAINT in zheap Add zheap support in functions validateDomainConstraint and AlterDomainNotNull which will check if any tuple violates the newly added constraint. Reviewed by Kuntal Ghosh commit dbcfe95943c7f5ddd5fbcb3b84daa62a32c3abf9 Author: Beena Emerson Date: 2018-04-20 14:22:16 +0530 Allow foreign tables to be added as partitions of zheap table Since foreign tables do not support the storage_engine option, exempt them from the checks where we check if the partitions have the same storage_engine option as their ancestors. commit 439869f9fc77a8e6121a74fc433eef7d822951b8 Author: Kuntal Ghosh Date: 2018-04-20 00:38:59 +0530 Add isolation expected file of vacuum-reltuples test for zheap In zheap, we don't have to take cleanup lock for vaccuming the relation. Hence, the vacuum command won't skip the buffer even when another backend holds a pin on the same. commit d50ed5e003d5104171540e3bcb45335bc941cd4a Author: Mithun CY Date: 2018-04-18 13:00:28 -0700 Force Refragmentation on Insert rollback In zheap_xlog_insert we see insert of first and only tuple on the page we re-initialize the page. Force pruning on insert or multi insert so wal consistency check of page on standby still satisfy. Patch by me and review ny Kuntal Ghosh. commit 240b960b6cf36a5611aa13b7f67e6cebedd15b63 Author: Rafia Sabih Date: 2018-04-18 14:26:43 +0530 Introducing a queue for passing rollback requests to undo worker To increase the efficiency of rollback mechanism in zheap, we now have a rollback queue. Now, any rollback request that exceeds the threshold -- rollback_overflow_size, is added to this queue. Whenever undo worker is idle it checks if rollback queue has some entries and executes the required undo actions, if any. The rollbacks required for 'rollbacks to savepoint' are not added to this queue, rather the backend itself executes the required undo actions for them. The rollback queue is implemented as a shared hash table. Reviewd by Beena Emerson and Amit Kapila commit fee9b99b1ccf3dca9a822d4e41e6a3a801598eb1 Author: Mithun CY Date: 2018-04-17 03:37:04 -0700 Block VACUUM FULL on zheap table, which got unblocked by commit 0d27be592d82a44158d Reported by Thomas Munro and Kuntal Ghosh. commit 49743ef26167946e60292aa3251b6bbdfdacd015 Author: Thomas Munro Date: 2018-04-16 17:26:56 +1200 Fix uninitialized variable. Per compiler warning from clang. commit 31cd2baf750596b961c9838a61078f3c9f3eb70f Author: Mithun CY Date: 2018-04-13 05:12:13 -0700 Fix for warnings introduced by commit 0d27be592d82a44158d Reported by Kuntal Ghosh. commit 274b0cb5c7ad77da7f4f98db23619a30571f6ac0 Author: Kuntal Ghosh Date: 2018-04-12 12:53:26 +0530 Declare get_old_lock_mode as static inline Otherwise clang complains about the inline function declaration. Reported by Thomas Munro, reviewed by Amit Kapila commit ee296fd6b96d8294dba9437f30353797e756d6da Author: Amit Kapila Date: 2018-04-12 10:07:15 +0530 Two pass vacuum We need vacuum in zheap for non-delete marked indexes, however, we can use undo to reduce three-pass to two-pass vacuum. When a row is deleted, the vacuum will directly mark the line pointer as unused, writing an undo record as it does, and then mark the corresponding index entries as dead. If vacuum fails midway through the undo can ensure that changes to the heap page are rolled back. If the vacuum goes on to commit, we don't need to revisit the heap page after index cleanup. We must be careful about TID reuse: we will only allow a TID to be reused when the transaction that has marked it as unused has committed. At that point, we can be assured that all the index entries corresponding to dead tuples will be marked as dead. Currently, due to lack of visibility map for zheap, we scan all the pages during vacuum. A future patch which will introduce visibility map in zheap will remove that limitation. Patch by me, Mithun has fixed few bugs and added implementation for Rollback action, also he has done basic verification of the patch commit 4f3a0106661ab1f52be6a4b1c1de37216c710535 Author: Amit Kapila Date: 2018-04-11 16:51:11 +0530 Support different tuple locking modes The basic idea is that we maintain each lockers information in undo and the strongest lockers information on the tuple. If there is more than one locker, then we set multi_locker bit on tuple. Now, if the multi_locker bit is set and the new locker conflicts with the strongest locker, then we traverse all the undo chains in the page and wait for all the conflicting lockers to finish. As we have to wait for all the lockers by releasing the lock on the buffer and then reacquire the buffer lock after waiting for all the transactions is finished, in the meantime, a new locker (say key share) can take a lock on the tuple and we won't be able to detect it unless we do something special. Now, one might think that as before waiting for multiple lockers we have acquired a heavyweight lock on tuple by using heap_acquire_tuplock, no other transaction can acquire xid-based lock (something like key share), but that is not true, as that is allowed for both heap as well as for zheap (till now). Now, the heap can detect such a case because it always creates a new multixact whenever a new locker is added to the existing set of lockers and it puts the newly create multixact id in xmax of tuple, so in above case after reacquiring the buffer lock we can just check if the xmax has changed and if so, then we redo the *TupleSatisfies check and again wait for new lockers. For zheap, after reacquiring the buffer lock, check again if there is any new locker on the tuple and to find this we need to again traverse the undo chains. Although this doesn't sound the best design, it is not clear whether it can really create the problem. If we want we can optimize by checking whether LSN of the page is changed, then only go for chasing all the undo chains, sure that won't work for unlogged tables, but still it is a good optimization. Another thing is that this approach is quite simple, so we inclined to go with this approach. We clear the multi_locker bit lazily like when we are already traversing all the undo chains to verify if there is any new locker on the tuple after taking the buffer lock. The visibility routines don't need any special handling as we are already storing the strongest locker information on a tuple which can help us get the transaction information of updater which is what is required to check visibility of tuple. Patch by me with help from Kuntal Ghosh and Dilip Kumar, reviewed and tested by Kuntal Ghosh commit 5bf3a603cd575ec188e42b1e0b7b0978f0fc54ed Author: Amit Kapila Date: 2018-04-11 16:12:12 +0530 Retrieve the transaction slot of modified tuple This is required for the upcoming tuple locking patch. Patch by me, reviewed by Dilip Kumar commit f2922167ffb0134854ca1fceea7b80d1a657cc23 Author: Amit Kapila Date: 2018-04-05 11:03:14 +0530 Fix the incorrect update of infomask for inplace updates Ensure to copy everything from new tuple in infomask apart from visibility flags. Patch by me, reported by Kuntal and reviewed by Ashutosh Sharma commit 1dc99668b499b30a6fb7146e7c6156178143ced7 Author: dilip kumar Date: 2018-04-04 00:30:38 -0700 If full_page_writes is enabled, and the buffer image is not included in the WAL then we can rely on the tuple in the page to regenerate the undo tuple during recovery as the tuple state must be same as now, otherwise, we need to store it explicitly. But, in current code it is not stored externally even if the page image in included in the WAL. Introduced new API XLogInsertExtended. Unlike XLogInsert, this function will not retry for WAL insert if the page image inclusion decision got changed instead it will return immediately. Also, it will not calculate the latest value of the RedoRecPtr like XLogInsert does, instead it will take as input from caller so that if the caller has decided to not to include the tuple info (because page image is not present in the WAL) it can start over again (if including page image decision got changed during WAL insertion). And for the zheap wherever we need to include tuple info for generating the undo record, we will call this new function instead of XLogInsert. Patch by Dilip Kumar. Review and modified by Ashutosh Sharma and Amit Kapila commit 9b3f81dd68751cd8fc61a588e07c0299c52b1e7b Author: dilip kumar Date: 2018-04-03 01:44:20 -0700 Remove undo for INVALID_XACT_SLOT Previously we needed this undo to identify the exact xid which was there in the slot before reusing it. And, that was stored as uur_prevxid of the undo record. Now, we are already including uur_xid (transaction id which has inserted undo record) so we can find the actual slot xid just by traversing the undo chain. This will reduce the complexity of the code. And, this will also remove the limitation that one transaction is writing the undo in other slots. So, by removing this limitation we can rewind the insert location during rollback from the backend. Patch by Dilip Kumar Review and Defect fixes by Amit Kapila commit a354f4e7ad32176b57a47251fb8ac2366d0bff3c Author: ashu Date: 2018-04-03 13:44:33 +0530 Include undolog.h in undorecord.h to fix compilation error on MSVC. Ashutosh Sharma commit 44f5d0336add411302e9de94329ac7cc8808508c Author: Thomas Munro Date: 2018-03-20 00:26:36 +1300 Fix uur_next corruption by replacing global variables with UndoLogControl. Previously we could corrupt our uur_next chain when the undo log you're attached to changes. This broke some later commits. Get last_xact_start and prevlen in UndoLogControl instead of maintaining a local copy. They can be read directly from shmem without locking (but not written) by the backend that is currently attached. prev_txid remains as a global variable, and it needs to be cleared whenever the undo log changes. Perhaps it could be stored in UndoLogControl too, but that leads to some circularities. Thomas Munro, RM43421, reviewed by Rafia Sabih commit a0536401ad8f228ea7bffa08ccbb9f69d62fb4d1 Author: Thomas Munro Date: 2018-03-12 15:29:05 +1300 Remove UndoDiscard and move its state into UndoLogControl. Previously, a per-undo log object "UndoDiscard" was used to track the progress of the undo worker machinery. It was in a shared memory array of fixed size, which isn't going to work. Move that state into the UndoLogControl object for each undo log. Since this change requires undodiscard.c to have direct access to UndoLogControl objects and to iterate over them, create new functions UndoLogNext(), UndoLogGet() with extern linkage to do that. A better interface is probably needed -- to review later when we work out the type of access that multi-process undo worker infrastructure will need. Register the LWLock tranches. Thomas Munro, RM43420, reviewed by Dilip Kumar and Amit Kapila commit 709a41dc526b9eecda1d4891f758b3d268af2650 Author: Thomas Munro Date: 2018-03-08 16:01:03 +1300 Move UndoLogControl struct into header to prepare for wider use. A later commit will make use of it from other translation units so let's not define it in undolog.c. This requires hiding the definition when included from FRONTEND code. Perhaps this should have a new header of its own or some other reorganization, but for now let's just do it conditionally. Thomas Munro, RM43420 commit 8b74af923855cd72e1fca40b8822d546fc07a1ab Author: Kuntal Ghosh Date: 2018-03-22 11:07:35 +0530 For calculating latest removable xid, don't use raw xid from page This commit also modifies ZHeapTupleGetTransInfo to work with deleted item pointers when we've buffer lock on the page. Reviewed by Amit Kapila commit 38345d269e4bb4908b9e48cb3a4a955d2579a0cf Author: Amit Khandekar Date: 2018-03-26 12:23:43 +0530 Support Tid Scan for tables with zheap storage. Use zheap equivalent of heap_fetch() to fetch the next tid. To support WHERE CURRENT OF, have a zheap-equivalent of heap_get_latest_tid() function. This function in turns uses the zheap visibility function. For following the ctid chain for non-in-place-updated tuples, have the MVCC visibility function pass back the ctid of the new tuple. Although zheap_get_latest_tid() accepts a snapshot, it may not work with snapshots other than MVCC snapshot, because the corresponding visibility functions for those snapshots are not modified to return the new tuple ctid. This would be done in later commits, since it is not necessary for Tid Scan. Furthermore, the callers of heap_get_latest_tid() (e.g. currtid_byreloid) should be modified to call zheap_get_latest_tid() for zheap tables. This also would be done in later commits. Patch by Amit Khandekar, reviewed by Amit Kapila and Kuntal Ghosh. commit ec068b0903c8dc9788b06e0405f6d4c94ba24dcc Author: dilip kumar Date: 2018-03-25 22:14:13 -0700 Handling the rollback if error in commit path During the commit if there is error occure before updating the status in the clog then we need to track the undo pointers and apply the undo actions. Patch by Dilip Kumar review by Amit Khandekar. commit 37994ff7060acc5317ccfd8c48c2ef440e4a3504 Author: Kuntal Ghosh Date: 2018-03-23 13:55:50 +0530 Fix README.md for better readability commit 698f1fb1bb35f423891c7ae64b4f6f7f1f1f2074 Author: Kuntal Ghosh Date: 2018-03-21 16:22:49 +0530 Make transaction slots per zheap page as compile-time parameter One can specify the same using --with-trans_slots_per_page= while configuring postgres installer. Allowed values are 1,2,4,8,16,31. By default, it is assigned to 4 slots per page. The changes required for Windows have been done by Ashutosh Sharma. Patch by me, reviewed by Amit Kapila and Ashutosh Sharma commit acbc6636929b21c6df12e4c69964a2c4035652c6 Author: Kuntal Ghosh Date: 2018-03-23 12:33:34 +0530 Bugfix in btree and hash delete items Sizes of xl_btree_delete and xl_hash_vacuum_one_page were not properly calculated. commit a8bae1f3400d9e98e46268040521f517d556efe9 Author: Rafia Sabih Date: 2018-03-23 12:02:35 +0530 Handling rollbacks in prepared transactions Reviewed by Dilip Kumar and Amit Kapila commit 2fa48d89c290f309fa92b719d1070b5f329f4daa Author: ashu Date: 2018-03-23 10:58:40 +0530 For speculative insertion, store a dummy speculative token in the REDO function (zheap_xlog_insert()) so that, the size of undorecord in DO and REDO function matches with each other. Ashutosh Sharma, Reported by Neha Sharma. commit 574f673a1cf6d494b2ef0817962aee93bb0659a1 Author: Rafia Sabih Date: 2018-03-22 10:51:23 +0530 Fix for warnings introduced by commit 70f35f756834028c Reported by Amit Kapila commit 89cb28898369622578be42f7a131c024b7bb5f05 Author: Kuntal Ghosh Date: 2018-03-19 18:13:52 +0530 Skip unnecessary palloc for deleted item pointers Reviewed by Amit Kapila commit 4765f6669408f9e13e3cebd1c33a668d718b5b98 Author: Rafia Sabih Date: 2018-03-20 16:19:00 +0530 Bug fixes in rollback 1. Check for running transactions using top transaction instead of current transaction state in XactPerfromUndoActionsIfPending 2. Insert start undo record if we rewound the previous start undo record pointer while rollbacking the subtransaction containing the start undo record of the transaction. Issues reported by Neha Sharma, reviewed by Dilip Kumar, Amit Kapila commit 1830381aa7c44d0ba24fb1fc6fc94cab87165705 Author: dilip kumar Date: 2018-03-19 23:44:10 -0700 Removed unwanted function "SetUndoPageLSNs" commit 6a2b8d3f80829987908005941de9c9583577430e Author: ashu Date: 2018-03-19 16:36:54 +0530 Place the if-check used to decide whether the last tuple in a page can be inplace updated or not outside the check to know if a page needs to be pruned, otherwise, even if the tuple to be updated is a last tuple in a page, it won't go for inplace updated if the page pruning is not required. Ashutosh Sharma, Reported by Neha Sharma, Reviewed by Mithun CY. commit 6953e07fcf5bd914c31beb6adcc756ea34974940 Author: Kuntal Ghosh Date: 2018-03-16 14:28:39 +0530 Fix zheap_xlog_multi_insert to release the buffer properly In zheap_xlog_multi_insert, we should unlock and release the buffer even after restoring the block from backup image. Issue reported by Tushar Ahuja commit 350f34b92fcac83aa9447cb8b7685f1f5e3655e4 Author: Mithun CY Date: 2018-03-15 22:09:38 -0700 Bugfix advance latest RemovedXid only if tuple is dead. By Mithun C Y, Review Dilip Kumar commit e63e2365e1ccc78424032fafa8c3a5a012955c1e Author: Thomas Munro Date: 2018-03-08 11:50:50 +1300 Fix incorrect worker name displayed if undo launcher/worker dies. Previously bgw_type was uninitialized, causing at least some systems to show a value that confusingly just happened to be "logical replication launcher" to be displayed by the postmaster in error messages. commit 5541c5927873c34aecbf5bc82b7610fc8b035e94 Author: Kuntal Ghosh Date: 2018-03-14 14:17:48 +0530 Fix UndoFetchRecord for fetching UNDO_MULTI_INSERT For UNDO_MULTI_INSERT undorecords, we've to check whether our offset number falls in the offset range stored in the payload of undorecord. We've added a callback function in UndoFetchRecord to check whether an undorecord satisfies any blocknumber, offset number and xid. Reviewed by Amit Kapila commit 986cfc9f789e51fa796b2afdb8ee093c740172a3 Author: Mithun CY Date: 2018-03-13 22:12:30 -0700 Bug Fix in CopyTupleFromUndoRecord For certain undo record type we will not have tuple data associated with them. For Copying such tuple we need an input tuple. Above patch Asserts and address those issues. Patch By Mithun C Y Review By Amit Kapila commit 28078c9b70d8d6d7a9af9f5f5789b3a739af0ca9 Author: Kuntal Ghosh Date: 2018-03-07 16:03:23 +0530 In zheap_prepare_insert, reset visibility bits in infomask/infomask2 Dilip Kumar and Kuntal Ghosh, reviewed by Amit Kapila commit 1cd9b192209084bcea39c8603a1de54e8481e5ee Author: Kuntal Ghosh Date: 2018-03-08 14:26:14 +0530 Enable WAL consistency check for zheap We've added zheap_mask function to mask unimportant fields in zheap page before consistency check. Patch by me, reviewed by Mithun CY commit d1d0d9de831695f218f6390f0fea742ddce98b46 Author: Rafia Sabih Date: 2018-03-07 13:17:25 +0530 Fix compiler warnings commit ac3838b9ab274d2e18e7f4ec19b02ece442d043b Author: Amit Kapila Date: 2018-03-06 17:30:13 +0530 Support Insert .. On Conflict The design is similar to current heap such that we use the speculative token to detect conflicts. We store the speculative token in undo instead of in the tuple header (CTID) simply because zheap’s tuple header doesn’t have CTID. Additionally, we set a bit in tuple header to indicate speculative insertion. ZheapTupleSatisfiesDirty routine checks this bit and fetches a speculative token from undo. Amit Kapila and Ashutosh Sharma commit 96eeccc2d9c095fcea050e3c797b9934c9d1cfcc Author: Kuntal Ghosh Date: 2018-03-06 12:31:21 +0530 Fix MinZHeapTupleSize definition commit 7ba6c3c85e913faa02cb7507093abf648d05a3b2 Author: dilip Date: 2018-03-02 01:17:23 -0800 Commit (db9d0b2988c829b8e0599ebf087a10b98cb9690d) calculated the latest_urec_ptr in case of SUB_INPROGRESS but it should have done that for SUB_ABORT case. Pointed out by Rafia Sabih commit 89cb2efebb0b039786a1ffde7b48ff16b2ece128 Author: Kuntal Ghosh Date: 2018-03-01 13:47:14 +0530 Bug fix in zheap_xlog_multi_insert This fixes a type casting error. It also fixes the condition for changing the offset range. commit 0eb31eb589c1430eecf430a9c04f988b98914d8e Author: dilip Date: 2018-03-01 05:55:22 -0800 Bugfix in rollback Latest_urec_ptr is not calculated before executing undo actions, fixed the same. Patch by me, Reviewed by Amit Kapila. commit 8043baef73e74311eb399305e99e12bb8e86f868 Author: akapila16 Date: 2018-03-01 17:26:52 +0530 Create README.md This document is to help users understand how to use zheap and open issues. Amit Kapila, reviewed by Robert Haas commit 82320f78eec7eb44584d00728ed84e6a89d5d55b Author: ashu Date: 2018-03-01 17:12:41 +0530 Remove memory leak in various zheap related functions. Ashutosh Sharma, reviewed by Amit Kapila. commit 0618b93ac8f431de813f23d5c8a7b68812ef729c Author: ashu Date: 2018-03-01 17:02:12 +0530 Initialize the new zheap page allocated during update operation on zheap tables correctly in zheap_xlog_update(). Ashutosh Sharma, reviewed by Amit Kapila, reported by Tushar Ahuja. commit 729d4cb6b09c98a5ecc57a670f55b468ab598d45 Author: Amit Kapila Date: 2018-03-01 16:34:44 +0530 Design of zheap This readme covers overall design of zheap. This is to help developers and or users to understand the zheap. Later, we might split this into multiple README's. Amit Kapila, Robert Haas and Dilip Kumar commit 17ae6f76fb85a1036111eb62f6bf0a2375531306 Author: Kuntal Ghosh Date: 2018-02-28 11:27:23 +0530 Fix memory leak in zheap_multi_insert wal replay Reported by Ashutosh Sharma commit 0d03216692b1d38b646126fd6300de1e619bc2db Author: ashu Date: 2018-02-27 16:30:11 +0530 Allow zheap_lock_tuple() to release lock on a zheap page if the tuple may be updated but the desired lock on a tuple is already acquired. Patch by me, as per the suggestions from Amit Kapila. commit 37911c71c174dd3b95befe2d5f7b44d3dc421f62 Author: dilip Date: 2018-02-25 20:58:02 -0800 Currently, on standby, we don't have DiscardUndoInfo like we have on the master side. So, on master before accessing any undo buffer we hold lock on DiscardUndoInfo in shared mode and undoworker hold that lock in exclusive mode. But on standby side we discard undo directly by WAL. So even though we check that undo is not discarded, but by the time we try to access the buffer undo may get discarded by the wal. Patch fixes the problem by checking the standby recovery conflict with other snapshot. Patch by Dilip Kumar, Reviewed by Ashutosh Sharma and Amit Kapila commit c6457f72c4e7695680118902f4c662bedba3a965 Author: Mithun CY Date: 2018-02-21 01:34:42 -0800 Remove unrelated code of previous commit In commit 5ba1e2f4c605f63a6deca278f39c9bfa05afb239 we have committed some unrelated code this patch removes same. By Mithun C Y commit c04a4c83e19f0e9e10cb6aab0ab12ff8a241880f Author: Kuntal Ghosh Date: 2018-02-19 21:18:29 +0530 Fix zheap insert options for COPY Patch by me with help from Dilip Kumar commit 17aaf8247ae3ef5763f5a164528f5c9f38dbe345 Author: Amit Kapila Date: 2018-02-20 16:37:37 +0530 Remove redundant header inclusions. commit 82681700ad5bd7b976438138d1c970eb6c55720a Author: Mithun CY Date: 2018-02-20 02:26:27 -0800 Check ItemIdIsDeleted on reacquire of BufferLocks In some cases after reacquiring the BufferLocks ItemId's might have been pruned so check if it is deleted before accessing ItemId's. Mithun C Y, reviewed by Amit Kapila commit cf4614bf7a7b294eae264d4123d76584d66f0384 Author: Mithun CY Date: 2018-02-16 05:06:28 -0800 Bugfix in ItemId status check. We have used tuple header flag ZHEAP_INVALID_XACT_SLOT instead of itemId flag ITEMID_XACT_INVALID while checking its status this caused undefined behavior. And, fixed a condition in zheap_search_buffer. Patch By Mithun C Y reviewed by Amit Kapila commit 418c2e1ac7c46497b3d994bb3db5ea6850d66c50 Author: dilip Date: 2018-02-16 02:12:37 -0800 bugfix in FetchTransInfoFromUndo The actual condition to break the while was if undo type is UNDO_INVALID_XACT_SLOT and undo_xid is input xid. This was broken in some of the previous commit. Patch by me reviewed by Amit Kapila commit 051a7a2a3849a1140627de7414f79a868a292c38 Author: Beena Emerson Date: 2018-02-16 14:59:51 +0530 Provide zheap support in check_default_allows_bounds For a zheap partitioned tables, check if the default partition has rows that meet the constraints of the new partition. Beena Emerson, reviewed by Rafia Sabih and Amit Kapila commit a9c5724aca95877300414606bec648e8f512e045 Author: Rafia Sabih Date: 2018-02-13 16:02:37 +0530 To add the storage_engine option in conf.sample file. commit d61ff363ce28733fb518ccfa5fe9d32f71b7994d Author: dilip Date: 2018-02-12 15:29:33 +0530 wal log oldestxid having undo Currently oldestxid having undo is not durable and value is also not sent to standby. As part of this patch this value is included in checkpoint record so after server restart also value will be valid. Patch by Dilip Kumar Reviewed by Kuntal Ghosh and Amit Kapila commit 16fa749cf1c025efb744f1ebd246690a541fe0ac Author: Mithun CY Date: 2018-02-12 02:02:57 -0800 Set page prunable on Insert undo, inplace update with reduced lengths. Patch by Mithun C Y Reviewed by Amit Kapila. commit 50f9fb43a3c302cf4fc206227786fba26ce72167 Author: Thomas Munro Date: 2018-02-09 23:52:42 +1300 Create missing undo log segments during recovery. During recovery, we might discover that the segment files that existed at the time of the checkpoint don't exist, because they'll be deleted by later WAL traffic. We'll create zero-filled files to avoid errors, and trust that the contents of the files will never be needed, because later WAL records discard them. commit bc2ed1a1cefcb56749b3de382c6d6056e7837bd4 Author: Thomas Munro Date: 2018-02-09 23:10:22 +1300 Make undolog meta-data checkpoints consistent. The earlier prototype code captured undo log meta-data at an arbitrary point in time somewhere after the redo point. That worked only for clean shutdowns, at which point it was consistent and correct. To do the job properly, this commit keeps track of two copied of each undo log's meta-data in memory: the current meta-data, and a snapshot as of the last checkpoint. In order to maintain the checkpoint snapshot, every operation that modifies an undo log's meta-data must check if we are now on the other side of a redo point. Since the shared memory access and lock contention would be expensive, we only actually do that while a checkpoint is in in progress, from a moment just before the redo point is chosen up until we discover that we are now on the other side of a redo point, which should ideally mean that it happens only once. RM43038, Thomas Munro commit df81f34c6d62ec7601314d5561b8bf1be1596f53 Author: Mithun CY Date: 2018-02-08 23:37:10 -0800 Fix crashes in page access after page pruning . Issue is in many places while fetching or updating tuple we failed to check if tuple has been eleted/updated and pruned after we have released the Buffer lock. This caused invalid access of pruned tuples from the page. Now we now check if itemid is deleted before accessing the tuple data in page. commit b26c26c7e0651cbf4cd5fbfe865ee2226b776ee0 Author: Thomas Munro Date: 2018-02-09 10:29:42 +1300 Fix an ordering bug when discarding undo buffers. We need to forget about undo buffers before we remove or recycle undo segment files, since otherwise a concurrent backend might try to write a buffer in order to evict it and discover that the file is gone. We also need need the same logic during recovery, so let's refactor that code into a function called from both UndoLogDiscard() and undolog_xlog_discard(). Thomas Munro, based on report from Neha Sharma and diagnosis by Kuntal Ghosh commit 25c38e52afb9d4252117fcb4bc575173b43f7492 Author: ashu Date: 2018-02-08 17:57:44 +0530 Validate CHECK constraints on zheap relations. Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila. commit 1ef3bcd30c337456386ccfbafff1a913f111782a Author: ashu Date: 2018-02-08 17:47:55 +0530 Implement Table Rewrite performed during execution of ALTER TABLE command in zheap. Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila. commit 652911807ff33f35a7c297fdde709fbfc30f43be Author: ashu Date: 2018-02-08 17:44:50 +0530 Bugfixes in zheap_to_heap and heap_to_zheap APIs Allow zheap_to_heap and heap_to_zheap to allocate the values and nulls array based on the number of attributes specified in tuple descriptor rather than tuple header. Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila. commit fc6d0379c922e8c7cf797ec7c7247eb845c6ddb4 Author: ashu Date: 2018-02-08 17:43:55 +0530 Restrict clustering of zheap tables Ashutosh Sharma, reviewed by Dilip Kumar and Amit Kapila commit fe470e36e6ef990861d210cde6fc659657554ac3 Author: Thomas Munro Date: 2018-02-07 21:56:43 +1300 Don't try to use DSM segments for undo logs in single-user mode. Thomas Munro, per bug report from Kuntal Ghosh commit 5f10daf1c6a1a91b4311f0a578fb9f86eb1fd5cf Author: Amit Kapila Date: 2018-02-06 14:13:29 +0530 Avoid stack overflow in visibility routines Till now, we were traversing the undo chain in a recursive way which could easily lead to stack overflow for very large transactions. Traverse the undo chains in a non-recursive way. Amit Kapila, reviewed by Dilip Kumar commit 2b3c4b122d756e4645c69709ed069eb754705a07 Author: Amit Kapila Date: 2018-02-05 18:28:47 +0530 Implementation of 'For Update and For Share' tuple lock modes This requires tuple to be locked in Exclusive or Shared mode and it will conflict update, delete and other modes of locks. Currently, multiple lockers for shared mode are not supported, that can be done as a separate patch. For other types of lock modes, user will get error "unsupported lock mode". Amit Kapila and Kuntal Ghosh commit 57a987ada64779896e4c5653d7e65a628f51233c Author: Mithun CY Date: 2018-02-05 03:28:56 -0800 Bug fix in bitmap scan of zheap Consider zheap pages when calculating MAX_TUPLES_PER_PAGE Patch by Mithun C Y Review by Amit Kapila commit 35c926be66fb6d991335e93ded5ae59e1919bf2b Author: dilip Date: 2018-02-02 13:37:59 +0530 Warning fix commit 1d455e7504fd2a662c0624f36a42f1b303afdb20 Author: dilip Date: 2018-02-02 11:14:50 +0530 Minor check fix while recovery from worker Reviewed by Rafia Sabih commit 167901ea80db5e86c64e68630ac5c1082acf464b Author: dilip Date: 2018-02-02 09:56:44 +0530 Bugfix in wal recovery Flag, is_first_rec is not reset after allocating the first undolog for the transaction and it was considering transaction header for subsequent allocation for the transaction. Patch by Dilip Kumar Reviewed by Amit Kapila. commit 1ddec06687a1c662d898c865d0a51fe85587040e Author: Rafia Sabih Date: 2018-02-01 17:53:05 +0530 Fix for Xmax value of unmodified zheap tuples Now, output InvalidTransactionId as the value of xmax for the zheap tuples which are unmodified. Previously, it was set to FrozenTransactionId which was incoherent with the behaviour of heap. Reported and reviewed by Ashutosh Sharma commit d8c3e25148045a50a0f0a767cf803dad588e42d1 Author: Rafia Sabih Date: 2018-01-31 18:11:18 +0530 Bug fix for undo actions If backend tries to apply undo actions for a record which is already discarded by undo worker, then exit quietly. Reported by Neha Sharma commit 13ecdf57369df7a70087ce1a814fd5586dc420cb Author: Kuntal Ghosh Date: 2018-01-29 15:18:32 +0530 During recovery, set correct urecptr for non-inplace updates commit 7f8c4546c7456a233b7317a08d730c797e23032f Author: Thomas Munro Date: 2018-01-26 15:18:35 +1300 Fix recovery of undo logs after a standby crash. A standby should never delete undo log meta data files that are referenced by its own control file, or it would fail to start up. RM43132, analysis and patch by Ashutosh Sharma, tweaked by me commit 83159870bccd343672cb228580ec5f3c779c6b34 Author: Rafia Sabih Date: 2018-01-25 18:55:09 +0530 Recovery of zheap relations when rollbacks are pending Now, the undo actions are performed at the time of restart for all those transactions that were in progress at the time of system crash or the last time when system was up. This ensures proper recovery of transactions involving zheap relations. A caveat to note here is that undo worker is initialised with the connection to default database 'postgres'. Reviewed by Dilip Kumar and Amit Kapila commit c308be7d90a8a8ffb068203303cf0c1c99ff4fb2 Author: Kuntal Ghosh Date: 2018-01-25 11:00:10 +0530 Fix definition of SizeOfZHeapMultiInsert Reported by Amit Kapila commit 34619026014c203f2bd9a2d151ffae2778140ea4 Author: Kuntal Ghosh Date: 2018-01-24 17:43:10 +0530 Allow ANALYZE on zheap table using VACUUM ANALYZE command Also, we forward the warning to LOG to avoid regression failure of some test cases. commit d5454ef8d5e0556a1fceff48dcf13d55ecc8cffa Author: Kuntal Ghosh Date: 2018-01-24 13:47:56 +0530 Assign table oid while constructing heap tuple from zheap tuple commit 81e55b488c707be50437484d21a1e92b553b291d Author: Amit Kapila Date: 2018-01-24 12:39:39 +0530 Move zheap_insert's buffer modifications inside critical section Previously, it was not done because we thought we need to add the tuple in page before forming the undo record as undo record requires blockid and offset which we can only get after adding the tuple. However, on closer inspection, actually, we only need blockid which we can get from the buffer. Patch by me, reviewed by Rafia Sabih commit 9351f042a09197bcae827889e3b78ddea2e7b68d Author: Kuntal Ghosh Date: 2018-01-02 12:07:15 +0530 Concurrent index creation on zheap tables Add the ability to create indexes 'concurrently' on zheap relations, that is, without blocking concurrent writes to the table. Patch by me, Reviewed and tested by Ashutosh Sharma, Amit Kapila commit 4780ba565205c638a890f053908157d128ea2470 Author: Beena Emerson Date: 2018-01-23 14:44:20 +0530 Support storage_engine option for partitioned relations Allow partitioned table to have the storage_engine option and throw error when user tries to create a partition with storage_engine different from the parent. Reviewed by Rafia Sabih, Ashutosh Sharma commit dec2957cbddadefda7b49fbf9c3a0d920d01a08b Author: Rafia Sabih Date: 2018-01-23 12:28:41 +0530 Bug fix for GiST indexes on ZHeap relations. commit 9444dc2d14dc3f11d4a79156577475afe6e3f393 Author: Kuntal Ghosh Date: 2018-01-23 12:00:34 +0530 Fix warning in nodeSamplescan Reported by Amit Kapila commit bc3ff0e94567b0e85de7830113e84468fa003ed8 Author: Amit Kapila Date: 2018-01-23 11:11:13 +0530 Add comments to elaborate why we always use Top Transaction Id in zheap. commit a1b34455e7e1ca3b48373c192c25773e84b33b95 Author: Mithun CY Date: 2018-01-22 02:46:07 -0800 Bug fix in space reuse Fix Null pointer access. Patch by Mithun C Y reported by Ashutosh Sharma. commit 34be2349dca4d238ec3fe99fbe4bbc299ca0c615 Author: Kuntal Ghosh Date: 2018-01-22 12:28:25 +0530 Fix definition of SizeOfZHeapMultiInsert Reported by Amit Kapila commit 0257c1a99b1c1958d6c1ed535d219ab7ad4e73f6 Author: Rafia Sabih Date: 2018-01-19 15:55:49 +0530 Add zheap relevant functions in DefineQueryRewrite Reviewed by Ashutosh Sharma commit 38cef475cb6c6b74765be07b2e1f419661c002f6 Author: dilip Date: 2018-01-19 13:18:40 +0530 Bug fix in space reuse Item is getting acceseed without checking whether its deleted or not. Patch by Dilip Kumar reviewed by Mithun C.Y. commit ce9e110fb857e050aea92c5f1affbf7b7246d6b1 Author: ashu Date: 2018-01-19 11:54:19 +0530 Bugfix in zheap_update(). pfree memory allocated for payload bytes during non-inplace update in zheap_update(). Ashutosh Sharma, With some help from Dilip and Kuntal. commit 079443ce69d8425b42bbd217579753f6ce8d8ee6 Author: dilip Date: 2018-01-17 13:59:30 +0530 Perform undo action in case of error. If there is some error while executing some sql statement, the patch take care of applying the undo actions required for rolling back the work done. Patch by Dilip Kumar Reviewed by Rafia Sabih and Amit Kapila commit 9229daa6df42bc3eb60de0b4ef0728ea3a67a011 Author: dilip Date: 2018-01-18 11:08:17 +0530 Block TidScan for the zheap Currently, tidscan is not implemented for zheap so give error. commit 878ea5ed9c76ce7481a86d636728cc3e7b8e288a Author: Kuntal Ghosh Date: 2018-01-17 15:47:05 +0530 Bugfix in IndexBuildZHeapRangeScan We should avoid freeing zheap tuple from the slot in IndexBuildZHeapRangeScan when pageatatime scan mode is used for scanning the underlying relation. Reported by Ashutosh Sharma commit 4ce7628b00bec6b0058ec9aeb4eade82075fd0b5 Author: Rafia Sabih Date: 2018-01-17 17:19:29 +0530 Extend the support of exclusion constraints to zheap relations Reviewed by Ashutosh Sharma commit a842d28fcb9de980a290268d8c557e1749bb232b Author: dilip Date: 2018-01-17 13:58:37 +0530 Bug fix in rewind If we are under a subtransaction then just reuse one slot, because during the rollback of the subtransaction we will rewind the undo insert location and the undo written for invalidating the slot will be overwritten. So, it is better to invalidate only one slot which our transaction is going to use. Patch by Dilip Kumar reviewed by Amit Kapila commit e5e03186920b5a79025956258b17fe87b3d50699 Author: Kuntal Ghosh Date: 2018-01-11 17:48:06 +0530 Fix mask value used in ZHeapTupleHeaderGetNatts ZHeapTupleHeaderGetNatts should use ZHEAP_NATTS_MASK to fetch number of attributes from t_infomask2. commit 4c6d4c685f52b0a7d998097734d11b5387ee05c3 Author: Kuntal Ghosh Date: 2018-01-11 11:49:25 +0530 Implement TableSampleScan for zheap Reviewed by Mithun CY commit 68bee6f61f146ef71630d3809482a1944ef40ddd Author: Mithun CY Date: 2018-01-10 03:26:41 -0800 Fix cursor fetch backward scan for zheap Reviewed by Kuntal Ghosh commit ed720bd223e3ce9842681ccacd4de84b052c619c Author: Mithun CY Date: 2018-01-10 03:01:21 -0800 get rid of KeyTest in zheap scan zheap do not support catalog tables hence no need of keytest as it is in heap. So adding asserts to acknowledge same. Reviewed by Kuntal Ghosh commit cb19200ec6837fa2fb1aef37ad045db5661c892e Author: Kuntal Ghosh Date: 2018-01-10 15:31:23 +0530 Fix brin summarize for zheap Reported by Mithun CY commit 9d6ee4a45df4911854ac32d205eeaab648ce4fb0 Author: Kuntal Ghosh Date: 2018-01-10 11:01:34 +0530 Implement bulk insert strategy for zheap_insert Reported by Ashutosh Sharma commit 8253166c9f8100e6ce75049004787215f5befe37 Author: Kuntal Ghosh Date: 2018-01-08 17:43:42 +0530 Fix sysattributes fetching for the ZHeap Make sysattributes fetching work with 64-bit transaction ids. commit e9a4184a96df7f96b53f1a849cdb75cae6b4ee2a Author: Kuntal Ghosh Date: 2018-01-09 17:27:39 +0530 Handl COPY FROM for non-multi-insert mode commit 78d680eb0deef07ead5b93c70168284a73ea30ec Author: Kuntal Ghosh Date: 2018-01-09 16:11:59 +0530 For triggers, tuple table slot shouldn't try to free memory itself commit 250fef496972efec042ec83a490d377b1621f4a1 Author: Amit Kapila Date: 2018-01-08 14:00:45 +0530 Support Epoch in Zheap pages The idea is to make the change related to 64-bit transaction ids only for the zheap pages whereas heap pages still operate with 32-bit transaction ids. We will need wraparound and freeze vacuums for metadata (system tables) stored in heap, but not for data stored in zheap tables. The way to make 64-bit transaction ids in zheap is to store epoch along with transaction id in each transaction slot (which will make each transaction slot as 16 bytes (4 bytes transaction id, 4 bytes epoch, 8 bytes undo pointer)). Now, if we somehow ensure that there is no undo for any transaction whose age is 2-billion years old (wraparound limit), then we can easily make out the visibility using current epoch and oldest_xid_having_undo. The way to achieve it is to stop the system if it reaches such a situation. We piggybacked on the existing wraparound machinery to raise different warning messages once the system reaches that stage. If the epoch+xid in the page is lesser than oldestXidWithEpochHavingUndo then it is all visible, otherwise, the transaction will belong to current epoch and all the current rules of current transaction system will work. The reason for relying on 2-billion transaction age limit is that current system (Transaction related functions like TransactionIdPrecedes) relies on that and we don't want to change it. We won't need any vacuum for freezing the transaction ids or wraparound vacuums for zheap pages after this commit. Amit Kapila, with some contribution by Dilip Kumar, reviewed and verified by Kuntal Ghosh and Dilip Kumar. commit 0cc6dbc97f9320a21cad0d7deb86a9867b51e97f Author: dilip Date: 2018-01-08 11:19:01 +0530 Bugfix in rollback to savepoint When rollback (partial or complete) is done from the backend then rewind the insert location of the undo log. commit 455bb1d00b8258f0e76b88cbae9705e308c8900a Author: dilip Date: 2018-01-08 10:22:20 +0530 Bugfix in parallel query Added function for parallel begin scan on the zheap table. Path by me, reported by Mithun C.Y reviewed by Amit Kapila commit e9d75ef1410647ef223ebf52528d8010ede6f118 Author: Kuntal Ghosh Date: 2017-12-29 12:11:00 +0530 Implement ANALYZE on zheap tables Reviewed by Amit Kapila commit a74fac0c485211a102ffe4208a9549f0f5f5de52 Author: Mithun CY Date: 2018-01-04 19:16:00 -0800 Bug fix for zheap page pruning When pruning we should ignore deleted itemid's as there is no tuple space assosiated with them. Also if tuple is non-inplace updated then consider the previous space assiated with them for pruning. commit 3d4a58d41d19cb296633801169c31eeb7caf4c23 Author: Rafia Sabih Date: 2018-01-04 17:51:48 +0530 Bug fixes for storage_engine option Ignore the storage_engine option for views or partitioned relations. Avoid adding storage_engine option multiple times when it's given in conf file as well as in CREATE statements. Reported by Ashutosh Sharma Reviewed by Amit Kapila and Ashutosh Sharma commit 34dfda185019a93abdb77894558e37df9cd2b9b1 Author: Kuntal Ghosh Date: 2018-01-02 14:02:53 +0530 Implement SatisfiesNonVacuumable for zheap Patch by me, Reviewed by Amit Kapila commit 63e4455ec4009eae16fc2d4bbcc8c14767f0c5dd Author: Amit Kapila Date: 2017-12-22 13:57:59 +0530 Remove redundant function definition added by commit 8aeb255f. commit fb16b15dd108f1769587a5f81d57ac5e9a9478ab Author: Amit Kapila Date: 2017-12-22 13:51:13 +0530 Support space reuse within a page Allow reuse of space for transactions that deletes or non-in-place updates the tuples or updates the tuples to shorter tuples. It has the capability to reclaim the space for deleted tuples as soon as the transaction that has performed the operation is committed. There is some difference between the way space is reclaimed for transactions that are committed and all-visible vs. the transactions that are committed but still not all-visible. In the former case, we can just indicate in line pointer that the corresponding item is dead whereas for later we need the capability to fetch the prior version of tuple for transactions to which delete is not visible. As of now, we have copied the transaction slot information in line pointer so that we can easily reach prior version of tuple. We try to prune the page for non-in-place updates when there is insufficient space on the page. The other times where we can try to prune the page could be while Inserting if there is insufficient space on a page or when buffer is being evicted or read, but we can do that as a separate patch. Mithun C Y and Amit Kapila. commit a1bce7facfd467c804ccab660f9befb3df8eae1f Author: dilip Date: 2017-12-22 10:55:30 +0530 Avoid using UndoDiscardInfo if it is not initialized If UndoDiscardInfo is not yet initialized, use UndoLogIsDiscarded to check whether an undo record is discarded. It also initialize UndoDiscardInfo for the same log. Reported by Mithun. patch by me commit 4baf44c765524baf9833308af1d3807987f6889e Author: dilip Date: 2017-12-18 17:47:48 +0530 Fix warnings in zheap code Reported by Thomas Munro commit 92c2b94de69c49504f3b7d79d676b364870ff279 Author: Rafia Sabih Date: 2017-12-12 14:53:54 +0530 Buf fix for prevlen of the first record of the log Reviewed by Dilip Kumar commit 6616d18f3080f2d80b21c891d2376ffe58b51614 Author: Kuntal Ghosh Date: 2017-12-08 18:23:09 +0530 Restrict transition table creation for zheap commit 97119d13a22de3759574c30ebc442cf7e00e3a49 Author: Kuntal Ghosh Date: 2017-12-08 16:55:25 +0530 Fix defect in applying undo action in case of abort commit c8287a5eaac5bc934645e4cc9b51da12d8f3f71e Author: Kuntal Ghosh Date: 2017-12-08 18:25:57 +0530 Fix dirtysnapshot behaviour for inserted tuples by aborted transactions Reviewed by Amit Kapila commit 1369b1c6dc27d7eeac63791725f9e010ee09e495 Author: Kuntal Ghosh Date: 2017-12-07 17:50:30 +0530 Restrict foreign key triggers on zheap tables commit 1c1befcf0e8a08ff1f5cdd49110a0f4a0140afbc Author: Kuntal Ghosh Date: 2017-12-07 17:49:51 +0530 Restrict INSERT ON CONFLICT on zheap tables commit e135508ebd7292042d8d12b820198025e569b9b8 Author: dilip Date: 2017-12-07 17:40:47 +0530 Fix issue is PageFreezeSlot Memory was freed without checking NULL, fixed the same Reported By Kuntal Gosh commit 1ba3aebd1b0d1789c99ba816237b487f950e40ff Author: dilip Date: 2017-12-07 16:07:34 +0530 Fix issue in regression test Avoid executing undo action when transaction is not in progress Reported by Kuntal Ghosh commit 787fa1ff9697dcf872546812ad606a7e41950287 Author: Kuntal Ghosh Date: 2017-12-06 15:46:37 +0530 During recovery of inplace-update, fix old tuple header Reviewed by Amit Kapila commit 2eec4974f8da5a7db17524250fd466fbe8995dc7 Author: Kuntal Ghosh Date: 2017-12-06 15:43:52 +0530 Update tuple length during inplace-update During inplace-update, the tuple should be updated to the new tuple. Reviewed by Amit Kapila commit 03014036363d2c494ee80f95d64f5abd5bdae48a Author: Kuntal Ghosh Date: 2017-12-06 15:26:15 +0530 Store block number before preparing the undo record In undorecord, we should store block number before preparing the undo record. Otherwise, expected size of the undo record won't consider the size for including block info. Reviewed by Amit Kapila commit e7fb026e79d9b724edc54d1b78d2725651d77757 Author: dilip Date: 2017-12-07 11:19:04 +0530 Fix defect in applying undo action in case of abort Reviewed by Kuntal Ghosh commit c646ff19e7fdaddba0b8757ce14e8ab0e824329b Author: dilip Date: 2017-12-06 10:35:03 +0530 Fix Typo Reported by Rafia commit dac6bdc623bd7b8502dd6642f784024a50abfb1d Author: dilip Date: 2017-12-05 18:55:41 +0530 Fix warnings commit 59eb32500326cfb4db8b1336d4ab9d97aa6741f4 Author: Amit Kapila Date: 2017-12-05 18:22:01 +0530 Rewind the undo pointer after applying undo actions After applying all the undo actions for a page, we clear the transaction slot on a page if the undo chain for block is complete, otherwise rewind the undo pointer to the last record for that block that precedes the last undo record for which action is replayed. This will help us in knowing whether the undo action is already replayed during recovery. Patch by Amit Kapila, reviewed by Dilip Kumar commit 43c6e1c541a4e9d6cc310c1e4b46e514ddfc9f98 Author: Kuntal Ghosh Date: 2017-12-05 14:30:26 +0530 UndoDiscardInfo should be updated in exclusive lock on the mutex Reviewed by Dilip Kumar commit 41c6d9d337e2b09448ff324991c639793e61889e Author: Kuntal Ghosh Date: 2017-12-05 14:28:00 +0530 UndoGetOneRecord should be called under shared lock on UndoDiscardInfo Reviewed by Dilip Kumar commit 49b755159affcc0f0e23f3591ddc46d2a9a1e7fc Author: dilip Date: 2017-12-01 15:27:16 +0530 Remove genarating the subtransaction id in zheap Currently in heap we need to maintain the subxac id so that we can identify the changes done within suxact and properly identify the visible tuple if subxact status is not same as main transaction. But, in zheap we have undo and with help of that we can immediately revert the effect of the subxact and we never need to track the status of the subxact. This will also help in maintaining the limited slot inside the zheap page. Because if we assign a separate xid to subxact then we may need to provide extra slot for each subxact. Patch by Dilip Kumar reviewed by Amit Kapila commit c0c59ddea242f191eaeb55ddb2598ec31e77b96a Author: Kuntal Ghosh Date: 2017-12-04 17:50:36 +0530 During slot-reuse replay, use physical tuple if not included in wal record If tuple is not included in wal record, access the physical tuple to fetch the corresponding slot number for the same tuple. Reported by me, patch by Dilip Kumar. commit a83904b8bb7e5aa81e8906d2daefed70afdacd80 Author: Mithun CY Date: 2017-12-04 20:48:01 -0800 Zheap tuples are locally allocated hence set tts_shouldFree as true. Reviewed by kuntal commit 5d561693d232b766ed618f48498b9a667cc1ac56 Author: dilip Date: 2017-12-04 17:36:45 +0530 Fix issue in prevlen Return without releasing the lock Reported by Kuntal Ghosh commit b5071ac66cf206801b0de4f0cc9de4390bb59cf8 Author: Kuntal Ghosh Date: 2017-12-04 14:34:27 +0530 Avoid using UndoDiscardInfo just after undoshem initialization If UndoDiscardInfo is not yet initialized, use UndoLogIsDiscarded to check whether an undo record is discarded. It also initialize UndoDiscardInfo for the same log. Reported by Neha Sharma, patch by me, reviewed by Dilip Kumar commit f916f158f39676de2ab755118924518e3c6f45b1 Author: dilip Date: 2017-11-28 09:16:00 +0530 Make prevlen crash safe Currently transaction previous recrod's length is not crash safe but there is possibility that transaction can be spread across checkpoints, in such case while preparing the first undo record we do need to have the length of the previous record which was inserted before checkpoint. For fixing the same we are maintaining this value in undo meta and wal logging this along with other undo meta. As part of this patch also removed one unused structure member in undo meta. Patch by Dilip Kumar Reviewed by Amit Kapila commit f20d78fe0d048ba600162e0f5d794399e0dbee13 Author: Kuntal Ghosh Date: 2017-11-22 16:27:25 +0530 Write WAL for zheap_multi_insert Reviewed by Dilip Kumar, Amit Kapila commit ed0f47495ef7de6c0e0e8940b6ac630704b0bf77 Author: Kuntal Ghosh Date: 2017-11-22 16:08:25 +0530 Set uur_type for multi_insert at begining Reviewed by Amit Kapila, Dilip Kumar commit e13cb875476963959cea79edd114fa8d701909e0 Author: Kuntal Ghosh Date: 2017-11-14 14:49:30 +0530 Fix assertion failure in zheap_delete Reported by Neha Sharma commit ee06e282b6bb1c64d6c0820fb2c690778af71f71 Author: Kuntal Ghosh Date: 2017-11-21 20:32:49 +0530 Derive latestRemovedXid for hash deletes by reading zheap pages Reviewed by Amit Kapila commit bf78457e2e7706338d00f60fbe487b315b626ff5 Author: Kuntal Ghosh Date: 2017-11-13 21:02:12 +0530 Derive latestRemovedXid for btree deletes by reading zheap pages Reviewed by Amit Kapila commit 53b6d509674a0aa1b59c00306bdec63c9cad3110 Author: Kuntal Ghosh Date: 2017-11-22 10:26:35 +0530 Implement exclusion constraints for zheap Patch by Ashutosh Sharma. Reviewed by Kuntal Ghosh, Amit Kapila. commit a4f00f80ba4745c45be7d747bdd7d4a6363a78f7 Author: Amit Kapila Date: 2017-11-22 09:33:47 +0530 Implement SnapshotSelf for zheap Returns the visible version of tuple (including effects of previous commands in current transaction) if any, NULL otherwise. This will be required for features like exclusion constraints to work with zheap. Ashutosh Sharma and Amit Kapila commit a1f49ef37d8bbfd19741bc76b0c749000c9d281d Author: Rafia Sabih Date: 2017-11-21 12:13:44 +0530 Bug fix in rollbacks in zheap The patch fixes the case where ending undo record pointer is not provided. This is particularly required in rollback to savepoint type of scenarios. RM 42958, reviewed by Amit Kapila. commit 64ecef026b765bba8262b3acce21b3c308e2b02a Author: Amit Kapila Date: 2017-11-21 08:54:58 +0530 Implement the API to advance the latest xid that has removed the tuple. This will be required by the future patches to implement free space management and xlog for indexes like btree and hash. commit b6d4a835ee52a6c0eec1e2e29c0dc84f866ab42a Author: Amit Kapila Date: 2017-11-20 17:43:12 +0530 Implementation of undo actions for zheap operations Implemented undo actions for delete, update, lock tuple, multi-insert operations. Patch by Amit Kapila, reviewed by Rafia Sabih commit 6828caf8f4d53788756079e0059a0d117e92afe3 Author: Amit Kapila Date: 2017-11-20 17:27:37 +0530 Infrastructure to execute undo actions at page-level Starting from the last undo record of a transaction read the undo records in a backward direction till first undo record of a transaction and accumulate the actions per page. As soon as the action is for a different page, we execute the accumulated actions of previous page. We are logging the complete page for undo actions, so we don't need to record the data for individual operations. We can optimize it by recording the data for individual operations, but again if there are multiple operations, then it might be better to log the complete page. Patch by Amit Kapila, reviewed by Rafia and Dilip commit c00df8c30bd5baa1eb1ea14f89747ee5ea7d7fe0 Author: Kuntal Ghosh Date: 2017-11-20 09:56:20 +0530 Write WAL for zheap_lock_tuple Patch by Ashutosh Sharma and Kuntal Ghosh. Reviewed by Amit Kapila. commit c00823a0787680eb8c0f02b8d998c6cf3bad39b1 Author: Rafia Sabih Date: 2017-11-17 15:59:40 +0530 Bug fix for rollback inserts The page header size was not included in the calculation of prevlen for undo records when record starts from a new page. The patch fixes the issue. commit b7bda571bb9c01fe2cb9355f9fd498cb6aa06078 Author: dilip Date: 2017-11-16 18:10:42 +0530 Write WAL for invalidating the transaction slot The important consideration for logging invalidating slot operation is that we need to identify the tuples pointing to these slots and also the transaction id in the slot while performing redo operation for undo log. So, we are conditionally (when full page writes are off) writing tuple offsets and slot xid in WAL as well. In case full page writes are on, we can rely on the tuples and slots from page. Path by Dilip Kumar, Reviewed by Kuntal Ghosh and Amit Kapila commit 9c9a8f88ad2b8fea8a319e0d9aad40924e1f5684 Author: dilip Date: 2017-11-16 18:09:18 +0530 Write WAL for slot Freeze operation Write the information of the slots which got frozen and during recovery we can generate the information of the tuple which are pointing to these slot and mark them frozen same as we are doing in DO operation. Path by Dilip Kumar, Reviewed by Kuntal Ghosh and Amit Kapila commit f1be5834b52f5023522af5ce862f9c2c40d7048e Author: Amit Kapila Date: 2017-11-15 17:17:17 +0530 Store old version of tuple in WAL for delete Patch by me, reported and reviewed by Ashutosh Sharma commit e1a47f763d17e4bc33af7ae64fa234f7fefbed96 Author: Rafia Sabih Date: 2017-11-14 13:26:53 +0530 Support for SELECT INTO statements for zheap This enables the creation of tables in zheap format through SELECT INTO statements. Reviewed by Amit Kapila. commit e1afa7d5eacde17d322dadb659f1d4c78d628cdb Author: Rafia Sabih Date: 2017-11-14 11:56:43 +0530 GUC for storage_engine Now, we have a GUC -- storage_engine to specify the storage_engine option, which can be modified before server start. Currently, we have two options for this GUC, viz. heap and zheap, with it's default value being heap. Note that once this GUC is set to zheap, relations are created in zheap format, irrespective of whether with (storage_engine ='zheap') is specified in CREATE TABLE statement or not. Reviewed by Amit Kapila. commit 2587d533e90aa99701d3125f2dc0ac00c9a54029 Author: Amit Kapila Date: 2017-11-13 14:11:58 +0530 Use better way to set item pointer. Suggested by Kuntal Ghosh. commit 2c3e87e4e9efdef30663c93ee606c12de10c78de Author: Amit Kapila Date: 2017-11-13 13:09:03 +0530 WAL log update operation in zheap The important consideration for logging update operation is that we need to generate entire old tuple while performing redo operation for undo log. So, we are conditionally (when full page writes are off) writing complete tuple in WAL as well. In case full page writes are on, we can rely on the tuple from page. For the new tuple, we are writing just the diff tuple as we are doing for heap. Patch by me, reviewed by Kuntal Ghosh. commit a6eb456784db76b444121ef171d635e4c4a01e13 Author: Amit Kapila Date: 2017-11-13 08:56:41 +0530 Fix type of undo record in zheap_multi_insert commit d1859b51bd51ac39d8915f40df7ff679c3df0461 Author: Amit Kapila Date: 2017-11-10 18:31:54 +0530 Fix the calculation of previous undo record length When the undo record crosses page boundary, we need to consider pageheader length which was being ignored. Fix the same. commit 4159a0175ce16bd5684314b63a23c0cf9fc8733e Author: Thomas Munro Date: 2017-11-09 22:27:44 +1300 Show unattached undo logs as null xid and pid in pg_stat_undo_logs. commit fce53d149e2b6e67e40eb33b92517cd58673cd4b Author: dilip Date: 2017-11-08 17:13:31 +0530 Enhance zheap_lock_tuple Handled the case when same tuple is locked multiple time by the same transaction. Patch by me, reviewed by Ashutosh Sharma and Amit Kapila. commit 91cd6e14fa9e2a2a9a4bb929750951ed7cfa68ef Author: dilip Date: 2017-11-08 17:03:49 +0530 Bug fix in trigger for zheap Currently we are always getting the older as well as newer version of tuple from the zheap. But, in case of inplace update both are having same tid and will fetch the tuple. Patch will fix the bug by getting the older version of the tuple from the undolog. Patch by me, reviewed by Ashutosh Sharma and Amit Kapila. commit 0ef921f6fc46aa6a3fd20eac9d199a3e173c1fc1 Author: Kuntal Ghosh Date: 2017-11-06 16:00:43 +0530 Restrict ExecLockRows on zheap relations As of now, we don't support FOR SHARE/FOR UPDATE on zheap tables. So, throw appropriate error for the same. Reviewed by Dilip Kumar commit 3e73bc3ccd71c28c962749dcda4e54bc8f245022 Author: Amit Kapila Date: 2017-11-08 11:18:50 +0530 WAL log delete operation in zheap The important consideration for logging delete operation is that we need to generate entire tuple while performing redo operation for undo log. So, we are conditionally (when full page writes are off) writing complete tuple in WAL as well. In case full page writes are on, we can rely on the tuple from page. Patch by me, reviewed by Ashutosh Sharma. commit eaf3708c605930073dcb0d64caad1c9dd13efb01 Author: Kuntal Ghosh Date: 2017-11-07 15:42:10 +0530 Fix assert in bitgetzpage commit 01104c4a5af83cff71704bb399172e9c050929c0 Author: dilip Date: 2017-10-27 13:58:08 +0530 Support sysattributes fetching for the ZHeap Properly fetch the xmin,xmax,cmin and cmax from the undo. Patch by Rafia some defect fixed by me reviewed by me and Amit. commit 50f4297716c242fee4cb8e34c2903ccef297983f Author: Amit Kapila Date: 2017-10-25 18:11:37 +0530 Undo Action and Replay for Insert Operation The undo insert action is to mark the corresponding item as dead if relation has any index and unused otherwise. The need to item as dead is that we can't completely clear the item till the corresponding index is marked as dead. Currently undo actions are replayed on Rollback or Rollback to savepoint. Ideally the action should be replayed on error as well, but we will deal with it in a separate patch. Amit Kapila, reviewed and tested by Rafia Sabih commit a461ec04e06d1675d635904caa3d993370c7ef0f Author: Amit Kapila Date: 2017-10-25 17:44:57 +0530 Update prevlen in undo record header with the length of previous record. This is required to traverse the undo chains from last undo record to first. Dilip Kumar. commit 66c00153e2de48f9b8ab5c8d2454f87cb2efea6d Author: Kuntal Ghosh Date: 2017-10-23 17:57:39 +0530 Bug fix in UndoLogAdvance During recovery, MyUndoLogState is uninitialized. Hence, It can't be used to fetch undo control pointer while replaying undo records. Reviewed by Amit Kapila, Dilip Kumar commit 08870389179383469196b243b6b5662bc8d41cb6 Author: Kuntal Ghosh Date: 2017-10-18 14:21:19 +0530 Restrict alterting non-empty zheap tables During ALTER command, sometimes we scan and rewrite a table. For zheap, we've to implement the same in ATRewriteTable. For now, just throw an appropriate error. Reviewed by Amit Kapila commit 1207f60fb21542e8b6675335848d373e6aafed3d Author: Kuntal Ghosh Date: 2017-10-18 15:04:01 +0530 ALTER should restrict partitioning/inheritance to same storage_engine commit d43e896ba1e66d0a8246bbb3df58c4f38010fcbd Author: Rafia Sabih Date: 2017-10-17 17:07:05 +0530 Hibernate undo-worker when the system is inactive The undo-worker hibernates for minimum of 100ms and maximum of 10s, based on the time the system remained idle. RM42373, reviewed by Mithun Cy commit 9c612f44ad2204510ab13a620a003460889c41db Author: dilip Date: 2017-10-17 16:42:38 +0530 Bug fix in multi-insert uur_xid is not set properly in multi-insert that was making undo-discard operate based on invalid xid and was causing segmentation fault. Reported by Ashutosh, fixed by me Reviewed by Kuntal commit 0023cc82949adc74fdb18c7825cf7e5d1988da49 Author: Kuntal Ghosh Date: 2017-10-04 15:20:38 +0530 Restrict partitioning/inheritance to same storage_engine The descendant/partitioned table inherits the same storage engine as its ancestors. User should not specify any storage_engine for the descendant/partitioned table. Also, this commit doesn't include any sanity checks for the scenarios where heap table inherits zheap table and vice versa. Reviewed by Amit Kapila, Rafia Sabih commit 99623dfb48a055ae2b43db76d9e87747915fec09 Author: Rafia Sabih Date: 2017-10-17 10:39:51 +0530 Avoid iteration over all the backends in UndoDiscard Instead of iterating over all the backends, now iterate only the active logs to determine the next log for undoDiscard. RM42373, reviewed by Dilip Kumar commit 8f40d7b7802c1904f4c845c744f62a366d2f9a8a Author: Thomas Munro Date: 2017-10-17 13:58:11 +1300 Fix undolog wordsize thinkos that broke 32 bit builds. commit 927c309586f38da7c737594c3beef65f7c51e5cc Author: dilip Date: 2017-10-16 21:11:17 +0530 Bug fix in update transaction start record In recovery start transaction info was not updated properly, it was fetching prev_xact_start from local variable which in fine during normal run, but InRecovery there is only one process so we can not depend upon the local variable instead we should always fetch that from log meta data. commit ea6bf762f2a6f5673e8978118edfee99767b844f Author: Thomas Munro Date: 2017-10-12 05:53:59 +1300 Implement UndoLogNextActiveLog(). Provide a simple way to iterate through the set of active undo log numbers. This interface might need some refinement to support UndoPersistence levels in future. RM42373, Rafia Sabih commit 712f332b5ac98b22110929ae2136e4b0c38dffab Author: Kuntal Ghosh Date: 2017-10-11 14:47:58 +0530 Don't discard undo if a transaction rolls back commit 1c2daa5b5a47856a227ad66c7cc08e761e147871 Author: dilip Date: 2017-10-11 09:48:37 +0530 bug fix in Synchronizing undo update with undo discard update undo record is just checking whether undo is discarded or not, instead it should acquire the discard lock and read the undo under the lock. Fixed the same. commit 8f17025103fc2c344f56b05770550e50e79b5bb3 Author: Kuntal Ghosh Date: 2017-10-05 16:48:57 +0530 Wrong undorecord expected size for first record of a undolog When we attach a new undo log, a new transaction is definitely started. Hence, we should set is_first_rec flag in undo meta and calculate the undorecord expected size. Reviewed by Dilip Kumar commit 23be6c5939487c691e77c03daffbdb04dff9afb4 Author: Kuntal Ghosh Date: 2017-10-04 11:36:04 +0530 Fix some compiler warnings commit 21c53521b5a1e98ba21a83ec0e8abd5fc97fa866 Author: ashu Date: 2017-09-28 15:11:33 +0530 Add missing null terminator to the dest string after memcpy(). memcpy() does copy the specified bytes of string from memory area src to memory area dest, but, doesn't add null terminator to the dest string. This patch handles to same in FindLatestUndoCheckpointFile(). Patch by me, reported by Amit Kapila. commit cee1d83ad2888cfe451ec6bf7c31c31ae79b694b Author: dilip Date: 2017-09-28 14:16:05 +0530 Fixed pending issues for WAL recovery Fixed the recovery for the subtransaction and properly update the start transaction header record. commit 4df8c94e347ee6291f9310840164125e3c57a5fe Author: Amit Kapila Date: 2017-09-28 09:25:43 +0530 WAL log Insert operation in zheap We need to ensure that we log enough information that during recovery, we can construct the unpackedundo record and then insert it in same way as we insert for the DO operation. Note, that we don't write full page image for undo logs as those are written serially and we always apply all the WAL records for undo log unlike for zheap pages where we don't apply it, if the LSN on a page is greater than the LSN of WAL record. This is to avoid problems like if the page header is written after last checkpoint, but the other part is not then the LSN in page header will be latest but not all the data. Patch by me, reviewed by Dilip Kumar. commit 604100e9ff28172ef1c4372745fafd5501e03a98 Author: ashu Date: 2017-09-27 16:45:24 +0530 Throw an error message if index-only scan is performed for zheap tables. Patch by me, per suggestions from Amit Kapila. commit 91630e7f7129b76fe46f29637f4117fca7ef3963 Author: ashu Date: 2017-09-26 16:30:52 +0530 Make the macro 'RelationStorageIsZHeap' (used to determine if the relation storage format is zheap or not) more robust. Patch by me, quickly reviewed by Dilip. commit 31a1c21f3e38e1935a8e8d9061ea307273ead50c Author: ashu Date: 2017-09-26 10:34:17 +0530 Add macro 'UndoCheckPointFilenamePrecedes' to find the oldest undo checkpoint file in pg_undo directory. RM42357, Patch by me, as per the suggestions from Thomas Munro. commit fb101c53f1c750c6ca37b049694db50118637b61 Author: ashu Date: 2017-09-26 10:29:28 +0530 pg_upgrade: Rename undo checkpoint file after pg_resetwal. During pg_upgrade, pg_resetwal module is invoked which resets all WAL related information in the Controlfile including the last checkpoint LSN and then a server process is started. But, as the undo checkpoint filename is based on last checkpoint LSN, the server startup process fails as it is unable to find the undo checkpoint file corresponding to the last checkpoint lsn as mentioned in the Controlfile. This patch fixes that issue and also copies the undo data and checkpoint files from old to new cluster during pg_upgrade. TODO: Currently, this patch just adds the logic to copy undo data files from default location i.e 'base/undo' which means if undo data files are present in non-default location it won't work. In future, when the support for storing undo data files in the non-default location is added, that needs to be handled in pg_upgrade as well. RM42357, Patch by me, reviewed by Thomas Munro. commit c95f0fda591c3d1e1b3e520570c8e616d4fb42ea Author: dilip Date: 2017-09-25 11:51:06 +0530 Added trigger supports for zheap Currently zheap is not supporting the triggers because all trigger mechanism is dependent on HeapTuple, this patch provide a conversion between heap and zheap tuple and also a api to fetch tuple from zheap for trigger. commit 13b7d1735e5444fca61368df9ab515267eadf4be Author: dilip Date: 2017-09-22 15:00:13 +0530 Bugfix in undo discard If log is not attached to any session it will have invalid transaction id. In such case just discard till the current insert location. commit ef93ec6350767d863b30cd8a11e632a1442a22af Author: Kuntal Ghosh Date: 2017-09-21 16:50:30 +0530 Fix update of zheap columns with null For in-place updates, we should copy non-visibility flags of infomask in the updated tuple. Reviewed by Amit Kapila commit 19c54fb49faecbde387be5ed6bde2bc94cbf67bf Author: dilip Date: 2017-09-22 09:49:29 +0530 Bugfix in undo discard mechanishm Prior to this undo was not discarded properly for the last transaction in the undolog hence oldestxidhavingundo was also not getting updated properly. commit 5e45ce2025a7024937763e71dcde007941057c3f Author: ashu Date: 2017-09-21 20:03:44 +0530 Replace unwanted Assert statement in ZHeapTupleSatisfiesDirty with if-check. Patch by me, reviewed by Amit Kapila, reported by Neha Sharma. commit 943e365fc06f42c264b9c39d78d9a8a94b3754b6 Author: Amit Kapila Date: 2017-09-21 15:00:10 +0530 Use oldexXidHavingUndo to determine if the record is visible Previously we were using RecentGlobalXmin to determine if the record is all visible, that is okay till we have rollbacks. Use the more appropriate xid i.e oldexXidHavingUndo to determine if record is all visible. Patch by me, reviewed by Dilip and Kuntal. commit 6715131fd875709a5b4d89808117b2e17bb7cc32 Author: Amit Kapila Date: 2017-09-21 14:02:57 +0530 Code refactoring to eliminate duplicate code. We were using the similar code to fetch transaction information from undo records in multiple places. Expose a new function to fetch that information and use it in all places. Patch by Amit Kapila, reviewed and tested by Dilip and Kuntal. commit a69d8bb6cc49b4bfb3ef0ad8321508007f7e7d66 Author: ashu Date: 2017-09-20 16:18:19 +0530 Allow INSERT/UPDATE/DELETE RETURNING to work with zheap relations. Patch by me, reviewed by Amit Kapila, reported by Tushar Ahuja. commit 9c886fcb2e34d1ec073ce260a032c5fe3a484695 Author: Kuntal Ghosh Date: 2017-09-19 18:53:55 +0530 Implement Bitmap Scan for zheap Reviewed by Amit Kapila commit 113966f8755edabf26ce62eb54798a5bdb419911 Author: Kuntal Ghosh Date: 2017-09-19 23:44:47 +0530 Implement COPY-TO for zheap relations Reviewed by Amit Kapila commit 414f84fd6026b0e282c10b9fe01f40898d5a22a4 Author: Kuntal Ghosh Date: 2017-09-20 00:12:16 +0530 Index Scan for spgist,gist,hash Reviewed by Amit Kapila commit 95e893242410969de851bdf746b54b69292134ec Author: Kuntal Ghosh Date: 2017-09-20 13:57:24 +0530 Skip vacuum for zheap relations Tell autovacuum worker to skip zheap relations for vacuum. Also, throw appropriate error if one intends to vacuum zheap relations with VACUUM command. Reviewed by Amit Kapila, Dilip Kumar commit e33f25249fc55d24c919d345b233a12b27e00db2 Author: Kuntal Ghosh Date: 2017-09-18 16:24:18 +0530 Handle MinimalTuple for ztuple Reviewed by Amit Kapila, Ashutosh Sharma commit f0d0850e99b3303a913a3f06998e120f052b0440 Author: Kuntal Ghosh Date: 2017-09-19 22:33:23 +0530 Fix Oid crash in ExecInsert commit aa0eefadeee806f7691cbe2374a942e771ca8d30 Author: dilip Date: 2017-09-19 12:14:37 +0530 zheap mvcc routine was not set properly in RestoreSnapshot which was causing problem in parallel query. This commit fixes the issue commit e43b9eeeb33b7cce358e4a1bede6623151bb4101 Author: ashu Date: 2017-09-18 22:06:22 +0530 Correct the XLOG record type used in UndoLogSetLastXactStartPoint(). RM42384, Ashutosh Sharma, reviewed by Dilip Kumar and reported by Mithun CY. commit 2edda475a0c19b293d61f657c02a4b38221765f5 Author: Amit Kapila Date: 2017-09-18 14:49:31 +0530 Retrieve CTID from undo record when requested. Reported by Neha Sharma. commit f34620869854458a3a671c2112b38109fc240af2 Author: Kuntal Ghosh Date: 2017-09-15 12:11:23 +0530 Move BulkInsertStateData def to genham.h commit 7b30e2d13b1ef23d19ba0d1c521e901d50be21ec Author: Kuntal Ghosh Date: 2017-09-14 14:41:27 +0530 Fix comment style in zheap_multi_insert commit 91094f3776e5aa18b71fb0360d5e754b128998ae Author: Kuntal Ghosh Date: 2017-09-14 13:51:35 +0530 Implement COPY FROM command for zheap relations Reviewed by Amit Kapila, tested by Ashutosh Sharma commit e7de2741292bfce5ddb7c69c102010f1dc81e2ed Author: Kuntal Ghosh Date: 2017-07-21 12:00:38 +0530 Add options in zheap_prepare_insert like heap_prepare_insert. Reviewed by Amit Kapila commit 1d9f79cf415be06824f8893c5e6cdb58bdde3f18 Author: Kuntal Ghosh Date: 2017-09-13 13:40:44 +0530 Fix warnings in zheapgettup_pagemode commit fba570c012c02b96bb3817e33763f03dde1248a1 Author: Kuntal Ghosh Date: 2017-09-12 15:37:26 +0530 Isolation tests for non-inplace updates Reviewed by Amit Kapila commit 51fb8188e05d92a973ee14a057c956fec2caab93 Author: Kuntal Ghosh Date: 2017-09-12 15:42:22 +0530 Regression tests for non-inplace updates Reviewed by Amit Kapila commit 039efbc45bb68a11db71ed0137d619af07951b54 Author: Amit Kapila Date: 2017-09-11 17:40:21 +0530 Fix usage of tuple in visibility API. ZHeapTupleSatisfiesVisibility API doesn't guarantee that the tuple passed won't be freed, so we can't reuse it. commit ff452720310a3a8cbb27ff85625a5276110f9de2 Author: Amit Kapila Date: 2017-09-11 17:26:27 +0530 Support non-inplace-updates Allow tuples to be updated such that a newer version of tuple will be stored separately if any index column is updated or length of new tuple is greater than old tuple. For undo generation, we always generate two undo records one for the deletion of old tuple and another for addition of new tuple. We need separate undo record for new tuple because during visibility checks we sometimes need commandid and that is always stored in undo record. To reach new tuple from old tuple, we need ctid which is stored in old tuples undo record. Patch by me, review and test by Dilip Kumar and Kuntal Ghosh. commit 918aea9d904dd4f2ea82c8e4b0aefd7006cedce8 Author: dilip Date: 2017-09-11 16:44:26 +0530 Support multi-prepare for undo record This will support preparing multiple undo record and all of them can be inserted with one call of InsertPreparedUndo. Reviewed by Amit Kapila. commit 4df6058209a1028f328b5976315b9373d8385afc Author: dilip Date: 2017-09-07 10:57:14 +0530 Fix compilation warning commit 3d9616d314008d2c34942d50d81d40558a771eda Author: Kuntal Ghosh Date: 2017-09-06 18:19:48 +0530 CREATE INDEX on non-empty zheap relations This extends the implementation for index creation on zheap relations to non-empty relations as well. Patch by me, reviewed by Amit Kapila commit 295f464ede2c243c7872cd3bc48b4d022b4cf9d8 Author: Amit Kapila Date: 2017-09-06 15:15:30 +0530 Implement and ZHeapTupleSatisfiesAny and ZHeapTupleSatisfiesOldestXmin Both these API's are required by upcoming patch to create index. These API's implement functionality similar to the heap API's HeapTupleSatisfiesAny and HeapTupleSatisfiesVacuum, but for zheap tuples. commit a1bbc57456a2c0985aad6ca14042ff8690275b9b Author: Kuntal Ghosh Date: 2017-09-01 17:07:30 +0530 storage_engine can not be modified after table creation storage_engine can be set during CREATE TABLE. But, it can't be set/reset using ALTER TABLE later. Patch by me, reviewed by Mithun CY commit 01e757ed5ff8e86fdece673c2dcee9d737628c5f Author: dilip Date: 2017-09-04 16:59:52 +0530 RM42314: Fixes one part of the problem Implements better way to know whether the undo record is discarded or not instead of adding the TRY_CATCH as done in current code. commit fc5e48996af7ea9398aed4adfce093ba7bba3c53 Author: mithun Date: 2017-09-04 11:58:16 +0530 RM41618: Asynchronous Undo Management - Undo BgWorker Part. Improved the log messages. commit 7231d04084c1d305c374c91c8ac9328186ae90e2 Author: Amit Kapila Date: 2017-09-01 15:30:39 +0530 Fix typo in error message commit b9746b046d25e8db6c36cd28bd49e72f650e3552 Author: mithun Date: 2017-09-01 11:39:53 +0530 RM41618: Asynchronous Undo Management - Undo BgWorker Part. registered a SITERM handler for undoworker. Review by Kuntal. commit fec2d893806c3b163d4c96e6c2864361ad2eccae Author: ashu Date: 2017-08-31 17:15:20 +0530 Implement PRIMARY KEY for zheap relations. This patch allows PRIMARY KEY to be created on zheap relations. For now, all it does is, throw an error message, if user tries to add a NULL or duplicate value on primary-key column (i.e. when primary key constraint is violated) but it doesn't protect it from getting inserted into a zheap table as currently ROLLBACK is not implemented in zheap. Therefore, if users tries to violate a primary key constraint on zheap table, it will result into a data corruption. RM42103, Ashutosh Sharma, reviewed by Kuntal Ghosh and Amit Kapila. Few adjustments done by Amit Kapila. commit 5b704a091e300d12f18e0bba4d6e2083fd652d0b Author: mithun Date: 2017-08-30 08:25:40 +0530 RM41618: Asynchronous Undo Management - Undo BgWorker Part. One undoworker's launcher process which calls a loop to periodically examine and discard undos of transactions which are visible to all. This is a very basic form of undoworker which can be used by undo subsystem for discarding undos. This code will later evolve into a more complex undoworker subsystem. commit 1c16e16158d83850c20602acce9b6ab5e8833b7c Author: dilip Date: 2017-08-29 18:20:28 +0530 Discard Interfaces for undo worker Provide an interface to discard all the undo which are inserted by the transaction id which is smaller than input xid. This will also calculate the oldestXidHavingUndo. Reviewed by Amit Kapila commit 5e7384fc7d2684ca0714147a0a1c60fdeed8d7fd Author: dilip Date: 2017-08-29 18:19:07 +0530 Inserting a transaction header in the undo record. Insert a transaction header in the first undo record by the transaction. This header will contains the undo record ptr of the next transactions first undo log. This will be used by discard api’s to process a undolog transaction by transaction. Reviewed by Amit Kapila commit 72ca11a0fe330abd69958c361ecf8fde2369802f Author: dilip Date: 2017-08-29 17:58:11 +0530 Protection against accessing the discarded undo. This is a dirty way to handle the problem, we may need to find some better way for the same. Reviewed by Amit Kapila commit 35c917a1e21c1c17c7de0d026c0820ac4086c0a2 Author: mithun Date: 2017-08-29 16:16:55 +0530 RM42266: Support no movement scans for zheap. It seems the NoMovementScanDirection is a dead code for heap itself. So got rid of related code in zheap and added an Assert(false) to indicate same. commit 38a4d0037a5636e08e00a4e3a352a4b1ded8ba61 Author: Kuntal Ghosh Date: 2017-08-29 15:27:29 +0530 Make PageSetUNDO less chatty commit 44c19cc1a6db1c8bd6a4d21b17d1549c059c77d1 Author: Thomas Munro Date: 2017-08-28 23:00:23 +1200 Fix build problem on Windows. On Windows we have a macro wrapping open() that always needs three arguments. Per complaint from Amit Kapila. commit 1302ec61e5bb482789db809c344b3fed5a9d6dae Author: Amit Kapila Date: 2017-08-28 15:57:26 +0530 Change ValidateTuplesXact so that it always get the fresh copy of tuple. If the tuple is updated such that its transaction slot has been changed, then we will never be able to get the correct tuple from undo. To avoid that we get the latest tuple from page rather than relying on it's in-memory copy. Amit Kapila and Dilip Kumar commit 531f8bf739266d5d0e40512a983cc5b757689810 Author: Amit Kapila Date: 2017-08-28 15:11:39 +0530 Add comments in code to explain the usage of commandid while traversing undo chain. commit bd9d5215d2deff90615d8108cbdad07d8d4894df Author: Amit Kapila Date: 2017-08-28 14:46:32 +0530 Improve comments in code. commit 91416a25cd2f950a7de078d7661f18e411962c15 Author: Amit Kapila Date: 2017-08-28 14:43:11 +0530 Always write complete tuple in undo of delete. This will ensure that we can reuse the space of deleted tuples as soon as the transaction commits. commit e63edb162c3a18acc7dc10e9510b24b53d2ddfda Author: Amit Kapila Date: 2017-08-28 14:23:54 +0530 Fixed few warnings. commit ed04b0bc3a955f0abfe712626c13dd3734a3676c Author: ashu Date: 2017-08-25 13:18:05 +0530 Remove isolation test for REPEATABLE READ mode and rename mvcc.spec file with zheap prefix. Commit 0fd69d86 unintentionally added isolation test for REPEATABLE READ mode. But unfortunately, zheap is currently just restricted to READ COMMITTED mode. Hence, this commit removes the test added for REPEATABLE READ mode and it also renames the test file with zheap prefix. Patch by Kuntal Ghosh, reviewed by me. commit 953868285ef6d57c83a246603d5e3b615b837b3a Author: Amit Kapila Date: 2017-08-24 21:17:31 +0530 Allow transaction slots to be reused after transaction is finished. To reuse the slot, it needs to ensure that (a) the xid is committed and all-visible (b) if the xid is committed, then write an undo records for all the tuples that belong to that slot. The undo record will cover the transaction information of committed transaction. The basic idea is that after reuse of slot, if someone still needs to find the information of prior transaction that has modified the tuple, the same can be retrieved from undo. We can reuse the slot even after the transaction has aborted, but for that we need to have Rollback facility which is currently not implemented, so we can't reuse slots of aborted transactions. Amit Kapila and Dilip Kumar. commit 13de424924220e65ec7348687210e3752ad0fd50 Author: Kuntal Ghosh Date: 2017-08-22 17:13:52 +0530 Fix crash in EvalPlanQualStart during isolation check We can't use es_result_relation_info in EvalPlanQualStart to decide the type of the underlying relation. Instead, We can use es_range_table. Reported by Amit Kapila. commit 10c6bff9873d3ce203eb0368ee262a67d6f43d79 Author: Kuntal Ghosh Date: 2017-08-11 15:54:28 +0530 Deform zheap tuples for aggregate nodes Aggregate nodes don't recognize zheap tuples. To fix this, we deform a zheap tuple in ExecCopySlotTuple and form a heap tuple from the same. commit 61d3d5e7178203d3070a11872bcdc64d365fce16 Author: Kuntal Ghosh Date: 2017-08-21 17:25:12 +0530 Fix type of urec_fork in UndoRecordRelationDetails commit 3d390e874b93b8f265520d2381cd5227f6f8f60c Author: ashu Date: 2017-08-18 16:26:43 +0530 Add some isolation tests to validate MVCC model with zheap. Thomas Munro and Ashutosh Sharma (The original patch here was by Thomas Munro and it was for heap storage, but I ported it to zheap.) commit 89466c9b213f939809de73ae21e8ec939cec1925 Author: dilip Date: 2017-08-07 14:35:11 +0530 Mark buffer dirty after inserting the undo record in undo buffer. Reported by Thomes and Amit. commit 8fec492113e3f03eeea42063b23b530b689bfd8e Author: Thomas Munro Date: 2017-08-05 21:10:32 +1200 Correct comment. commit e5818f2fd0b55cab123926db0afc52e832d44dd6 Author: Thomas Munro Date: 2017-08-05 21:08:17 +1200 Define a macro UndoRecPtrFormat for use in printf-style functions. commit 5eedbd8378b0b32cee25ff882a78082a273ac9a6 Author: Thomas Munro Date: 2017-08-04 16:53:50 +1200 Git rid of last_size tracking from undo storage layer. This problem belongs higher up. commit 93d94231ef7ff2b30a2c8ae5d795ca9129a0f00d Author: Thomas Munro Date: 2017-08-04 16:47:41 +1200 Report all UndoRecPtr values as hex in pg_stat_undo_logs. Previously raw offsets were shown. Let's not confuse ourselves with more than one representation. commit 1bc609f9f3943c70e94a3cd409c32c536871f309 Author: Thomas Munro Date: 2017-08-04 16:47:14 +1200 Raise error if you try to access non-existent undo segment. commit 799820348f683ff7b4120fdd3356ea75e98b757f Author: Thomas Munro Date: 2017-08-04 16:34:31 +1200 Make test_undo module more cromulent. Also change type used for space allocation size_t rather than anything smaller so that we can test insertion of large amounts of data at once. commit def8e64b77cb8e50ac12768f930ee6dbc3785a38 Author: Amit Kapila Date: 2017-07-26 17:21:29 +0530 Update the transaction id in undorecord header Separately store the latest transaction id that has changed the tuple in the undo record header to ensure that we don't try to process the tuple in undo chain that is already discarded. As of now, we are using RecentGlobalXmin to prevent it, but up-coming patches that have Discard undo mechanism will replace it with oldest_xid_that_has_undo. commit 5126ec94af589fb56cfcad9f014f6a8c20b16fff Author: dilip Date: 2017-07-26 15:30:05 +0530 Adding xid in undo header Xid will be used by zheap to traverse the undo chain. It will not traverse the chain once it get the undo which has the xid smaller than OldestXidHavingUndo commit 7f994a6806858bcc2069004fa16f50d37263dd51 Author: Amit Kapila Date: 2017-07-26 15:06:49 +0530 Support EvalPlanQual mechanism for zheap To support evalplanqual mechanism we need to support locking a tuple (as of now only now only in LockTupleExclusive mode) to ensure that nobody else can update it once the tuple is qualified by this mechanism. We also need dirty snapshots to fetch the tuple by tupleid. We also need to update ExecScanFetch APIs so that it can recognise zheaptuples. We also need to adjust other visibility routines so that they can understand locked tuples, similarly we need to adjust zheap_update and zheap_delete to account for locked tuples. Patch by me, reviewed and tested by Dilip. commit a4ac11945ae4201bc1c45ddd4ea676841f281cd3 Author: Thomas Munro Date: 2017-07-25 05:16:51 +1200 Improve loop. Nobody likes degenerate for loops. commit 4980d572117fc14583af00006e741eedc269a7b3 Author: Thomas Munro Date: 2017-07-25 05:13:52 +1200 Fix UndoLogDiscard(). Commit d12ea1cec176c3793cb53b6133d413418c338f62 changed the way that the offset part of an UndoRecPtr maps to a block number from header exclusive to header inclusive, but I failed to change UndoLogDiscard(). Thanks to Dilip Kumar for diagnosis & fix. commit fbd05d66d941cb5410c8cc4b3c0bfe1fa4b9f3a6 Author: Kuntal Ghosh Date: 2017-07-17 14:49:36 +0530 Implement Index Scan using btree on zheap relations This doesn't include the implementation required for fetching multiple versions of a tuple in case of a non-MVCC snapshot. commit 5d9386b049a6dc61286c4140457ee1d728938fa6 Author: Kuntal Ghosh Date: 2017-07-18 10:43:19 +0530 Implement insertion of zheap tuples in btree index It doesn't include the implementation for handling updates on key column. commit f849ed2274727a5c04988fcace73361f79b0346f Author: Kuntal Ghosh Date: 2017-07-17 14:18:16 +0530 Implement CREATE INDEX on zheap relations USING BTREE. This allows creation of btree index on *empty* zheap relations. commit 0bab411eeb75256cbcb6e58cf99b493d59158ce3 Author: Amit Kapila Date: 2017-07-14 10:12:17 +0530 Define and use InvalidUndoRecPtr. commit 64433f17dd08fa2db21af689ff673abf16d863ff Author: Kuntal Ghosh Date: 2017-07-11 17:04:22 +0530 Fix an assert failure in PrepareUndoInsert commit 80878d1097ae747de6b257e8be911c16fd57196b Author: Kuntal Ghosh Date: 2017-07-04 12:32:00 +0530 Fix 'storage_engine' reloption to support case-insensitive inputs commit 9d4097fad8db35add630b7da9b98b3d0e95dcfbe Author: Thomas Munro Date: 2017-06-30 23:56:14 +1200 Change the way that undolog.c accounts for page headers. Previously, the offset component of UndoRecPtr was a counter of usable bytes within the undo log. Now it's straight physical offset into the undo log. This means that it's now possible to have a (corrupt) UndoRecPtr that points at header data, which is nonsensical, but it makes a lot of things simpler. Now segment filenames, insert, discard, end offsets and UndoRecPtr values are all based on the same scale: number of raw bytes from the start of the undo log, regardless of whether those are header or data bytes. Previously, UndoLogAllocate advanced the insert pointer by the exact number of bytes requested in the 'size' argument. That was unusable because it required the caller to allocate header space this way too, and do a bunch of math that required knowing where the insert point already was. Now 'size' means usable bytes; header bytes are automatically accounted for. Based on a complaint from Dilip. Let's try it this way and see if it makes more sense. commit 717961ad5ae25a6cfad73f7843e718a36b0f889b Author: dilip Date: 2017-06-30 10:29:32 +0530 Fix defect in undo record API In the earlier version there was an assumption that at least undorecor header will fit into the buffer, but actually that was not true undo record header can also split across 2 buffers, the same has been fixed. commit c3cd40c69a91cee5d71fb04432393e49b8b62302 Author: dilip Date: 2017-06-29 16:26:48 +0530 Fix compilation warning in undolog.c commit a56bc2ce68de378fd73f31e48a127f0620c8fc48 Author: Amit Kapila Date: 2017-06-29 16:13:38 +0530 Store command id in undo records. Till now, there was a primitive implementation of command id. We were storing the command id in transaction slots, but that won't work for tuples stored in undo records. Now, store the command id in undo record and fetch it from undo records during visibility checks. commit abef75d8885d21857afbc9d4487628807e913cb4 Author: dilip Date: 2017-06-29 15:44:44 +0530 Support for command ID in undorecord header commit 779cb3080bf4e4862cd008aa902b55029f545f4e Author: dilip Date: 2017-06-27 10:09:24 +0530 Bug fix in undo buffer list: Reset the undo buffer list commit 99f1adcf642e102df9f87f4763ca299ff0a57733 Author: Amit Kapila Date: 2017-06-23 20:45:43 +0530 Support reuse of transaction slots Currently the zheap page has four transaction slots which means that there can't be more than four active transactions operating on a page. After four transactions the system will start waiting, this commit will allow to reuse the transaction slots. The transaction slot can reuse if the transaction is committed and all-visible or if it is aborted then the undo has been applied. As of now, we don't have rollback support so we just rely on the first check to reuse the slot. commit 3d850d24e8aa04087ca55ebf9319e6b56ec5ec63 Author: Amit Kapila Date: 2017-06-23 20:32:51 +0530 Support Zheap in-place update and delete operations This allows zheap tuples to be updated in-place and marked as deleted. This also supports visibility checking for snapshots and traversing the undo chains for non-visible tuples. As of this commit, the support for snapshot visibility with respect to command id is not complete and can give wrong answers. The future commit will support the same. commit 9fe80770ed32020cac9487f93cdf542cd070c4fa Author: dilip Date: 2017-06-23 16:53:26 +0530 Fix warning in undolog and undorecord commit 3075d2186e45ca0fe14add50d1a8bdf82de7e85f Author: dilip Date: 2017-06-23 16:48:07 +0530 Undo API for inserting and fetching the undo records from undo storage Undo API is an interface on top of the undo log storage which provides a way to insert and fetch records from undo logs. The API internally manages the buffers and performs encoding and decoding the undo records. commit 21d2703d218f05c28638e1ec98559d357e4b94f8 Author: Thomas Munro Date: 2017-06-22 10:27:23 +1200 Changed interface of UndoLogDiscard based on feedback. Previously it took old pointer and size. Now it just takes the new pointer. In other words you call it with the address of the oldest byte you would like to keep. Also updated various comments, changed the pg_stat_undo_logs view a bit and fixed the rules.out to reflect that. commit 10ccdce79fed7a4dd42321198c8f8a1e9bd26dda Author: Thomas Munro Date: 2017-06-21 13:59:31 +1200 Implemented UndoLogDiscard and related things. Now able to discard and recycle segment files on master and standby. Assorted other changes: using an LWLock per undo log, instead of a SpinLock. Using base/undo rather than base/9 to hold undo segment files. Renamed segment files as logno.offset, so that they also tell you the UndoRecPtr of the first byte in the segment. Got rid of the separate tracking of 'mvcc' and 'rollback' discard pointers; I don't think that's my job, instead there is just a single 'discard' pointer. Renamed 'capacity' to 'end'. There is plenty more work to be done here... commit 502d0afd95e3b7a6bb089be6efed35ed24cec96e Author: Thomas Munro Date: 2017-06-19 15:45:41 +1200 Fix double lock release bug in ForgetBuffer Don't try to release the partition lock twice if we failed to find the buffer. commit 32ca66260db1675f7fce94e59dbc55d9057bdfdd Author: Thomas Munro Date: 2017-06-01 17:07:43 +1200 Early prototype code for undo log storage. Includes a backport of 767bc028e5f001351feb498acef9a87c123093d6 because we need to be able to create pinned DSM segments without creating a bogus ResourceOwner first. commit 80bb5103f41d28052c8cd11f5393de6eaf1522cd Author: Kuntal Ghosh Date: 2017-06-06 21:47:46 +0530 Fix return type for zheap_insert commit 51f5293a475d6e2175f4e5d87978308e3d7b0a60 Author: Kuntal Ghosh Date: 2017-06-04 19:36:14 +0530 Fix NULL pointer access in RelationStorageIsZHeap commit 7c4f86a14b93151c9281a7cab3408b9d040fee88 Author: Amit Kapila Date: 2017-06-02 16:16:23 +0530 Change UNDO Insert tuple contents. As per current understanding, we don't need to store any information about in undo insert as the command id (cid) is also stored in transaction slot. commit 31142f2fcde9e84baf531feb4187505dbaf2259d Author: Amit Kapila Date: 2017-06-02 16:12:42 +0530 Few defines for zheap tuple Define infomask flags required for zheap tuples. One of these flags is required for the already committed code in commit a3bb8c3411d1a17fb190b5757bf181d38900800a. commit feea62deb7528b7525090e6623cfc2482c6d20e0 Author: Amit Kapila Date: 2017-06-02 15:40:44 +0530 Implement getsysattribute function for zheap. This is required for upcomimng zheap delete operation patch. Attributes related to transaction information won't give correct information for all cases. I have added Fixme in code which needs to be fixed at later time when we need to use those attributes. commit 7d7bb25644ad0a3fd234e6185d21151e7a7e8af0 Author: Amit Kapila Date: 2017-06-02 07:22:28 +0530 Initialize Zheap page with appropriate special space size. commit 7598316c1645db0e8d7eee55c45a529d5d426544 Author: Kuntal Ghosh Date: 2017-06-01 18:13:11 +0530 Add a 'storage_engine' reloption. The code that decides which storage should be used for a relation. Allowed values are 'heap' and 'zheap'. The default value for this options is 'heap'. Also, it adds a macro 'RelationStorageIsZHeap(relation)' which can be used to decide whether a relation uses zheap or not. Patch by me, reviewed by Mithun C Y commit 2d8452096418348c6f406b595cc75903fde38b61 Author: Kuntal Ghosh Date: 2017-05-31 12:00:05 +0530 Support query quals for ZHeap It implements the qualification check of column names for a ZHeap tuple. commit a0ae8ceb08757b0cdce77780c87267b6dbc7de1f Author: Amit Kapila Date: 2017-05-30 16:31:54 +0530 ZHeap Tuple visibility checks Implements the basic zheap tuple visibility skeleton wherein we can verify if the inserted tuple is visible to our command. The basic idea used is to have few transaction slots (as of now four, needs some testing to determine the exact number) in special space of zheap page. Each write operation on page first needs to reserve transaction slot in the page and update the same in tuple header and then proceed with the actual operation. We need to reserve two bits in tuple header to remember the information of transaction slot. There is a need of additional bit to indicate no tranasction slot in which case tuple will be all-visible. This usage of additional bit will be implemented as a separate commit. As of now, each transaction slot contains xid, cid and undo_rec_ptr which helps us to determine tuple visibility. We might not need cid in transaction slot as that can be retrieved from undo, however keeping it in page saves us from fetching undo record pointer in many cases and due to alignment, it doesn't cost us much space wise. commit 74983024bedce424427bec359f8f0c0f0615a774 Author: Amit Kapila Date: 2017-05-17 10:59:55 +0530 Handle cases for data_alignment = 4. Currently for four-byte alignment, we align everything to four-byte boundary which is not what we want for typalign 'c' or 's'. So align the given offset as per attalign for char and short alignment and at four-byte boundary for other values of attalign. Patch by me, reported by Robert Haas. commit a802bb024f8041c0122419729c86611f592a262e Author: Amit Kapila Date: 2017-05-11 12:12:49 +0530 Support basic seqscan operation for ZHeap. Make HeapScanDesc aware of ZHeapTuple, this is required to scan. We could create a separate ZHeapScanDesc, but at this stage, I don't see the need of same. I have written Zheap scan API's like zheap_getnext, zheap_beginscan, etc. As for inserts, it doesn't seem advisable to add a lot of if..else checks in heap scan API's to support Zheap tuple format. Apart from tuple format, we can't directly use heap visibility routines (fetching transaction info from tuple will be different), although I have yet to implement the same. Also, we might want to copy the zheap tuple before releasing buffer lock as zheap tuples can be updated in-place. Note that with this patch only select * from can work, to make where clause work, we need to change few other *get_attr api's so that they can understand zheaptuple format. commit ffbc3b0a84451e365457e7a1d1d31a15cd1366cd Author: Amit Kapila Date: 2017-05-02 08:41:19 +0530 Set Tag needs to use zheap tuple when zheap is enabled. commit 63ba474f0d69a4123129ffa4849548e9d53a0146 Author: Amit Kapila Date: 2017-05-02 08:22:29 +0530 Mark data_alignment_zheap as PGDLLIMPORT. This is so that extensions can use it. commit 058618921b6f6319eb4f86909058a5a274fcabd0 Author: Amit Kapila Date: 2017-05-02 08:17:16 +0530 Remove spurious whitespace. commit 171bfaeeca96c2186ba45e5f9930df2cf7eeb5ee Author: amitkapila Date: 2017-05-02 07:54:16 +0530 Add missing Makefile in zheap directory. Commit 53643b6589a6b963b8621e3173d505383d510bb5 forgot to add Makefile for zheap directory. commit e70c0c4404021ece54e8bdcb9ce2ec6d3c2d87eb Author: amitkapila Date: 2017-04-28 15:38:56 +0530 Change max heap tuples per page. With shorter tuple headers, the maximum number of tuples that can fit on a page have significantly increased. So changed the calculation in API's required for insert operation. I have choosen to expose new API's as doing checks for type of heap in such lower level API's doesn't seem sensible. For now, I have added new page level api's in zheapam.c. I think later we might want to split them into a separate file. commit 22449b025419c176ed74464ee26bbcc46125a63e Author: amitkapila Date: 2017-04-28 14:56:44 +0530 Support unaligned inserts in zheap. A new guc data_alignment has been introduced to decide the alignment of tuple data. The data_alignment value 0 indicates no alignment, value 4 indicates align everything at 4 byte boundary and any other value indicates align as per typalign of the attribute. The main objective of this commit is test the size of data with different alignments. Based on test results we might want to retain one type of alignment and remove this GUC and associated code. commit eaf9f4b054e9d8e7bfd2fc86bf7c7f4e51a04e20 Author: amitkapila Date: 2017-04-28 12:24:29 +0530 Support Zheap Insert Operation. The main idea of this patch is to support a short tuple header (3 bytes instead of 24 bytes). As of this patch, I have kept tuple header aligned to 8-bytes and tuple data alignment works as for main heap. This doesn't include support for toast inserts or speculative inserts (Insert .. On Conflict). Also, we don't support selects. As of now, the support for undo record is dummy which means records are formed, but not stored. Similarly for WAL, there is support of XLOG record insertion, but replay is not done as that also needs some support from undo. I have added a guc enable_zheap to perform zheap inserts which needs to be changed to reloption or something else, but it serves the purpose as of now. commit ba5b2b31bbd9d8053be2416fb0cad0fda64df053 Author: Robert Haas Date: 2017-03-05 10:18:45 +0530 Throw-away test code for UndoRecordInsert. commit daadef317fa3b8453115ac24c351faed5f277ffd Author: Robert Haas Date: 2017-03-05 08:07:52 +0530 Implement UndoRecordExpectedSize and InsertUndoRecord. commit fe886f35e8b257c6a24fa3729df26432dfb52881 Author: Amit Kapila Date: 2017-02-24 11:42:16 +0530 update comment to reflect function name changed in commit 29f4db6d7a158c75fc152351f874b8d4e6af63a0. commit 0c79dde27afa0d432ed53ad7c353204560500611 Author: Thomas Munro Date: 2017-02-23 16:19:10 +0530 Added bootstrap interface; tidied commit b05429f48e3e7104e84428a81884055569e1006d Author: Thomas Munro Date: 2017-02-23 14:57:08 +0530 An early draft of undolog.h. commit cc1c4927fbc236559c6831ab583f1600c96a543d Author: Robert Haas Date: 2017-02-23 14:41:05 +0530 Rename DropBuffer to ForgetBuffer, change API a bit, implement. Also, implement ForgetLocalBuffer. commit 954bf571a3e332e5d0dd16f6ec2eb4c11d7962cf Author: Amit Kapila Date: 2017-02-23 14:02:03 +0530 Expose the DropBuffer API. We don't need a new header file just to expose one API. So move the API to bufmgr.h and remove the undobuf.h. commit 28ca1b57f1d509657d6c77eaf07fd0af395e2e5b Author: Amit Kapila Date: 2017-02-23 12:04:24 +0530 draft header files for undoworker, undoloop and undobuf. --- diff --git a/README.md b/README.md new file mode 100644 index 0000000000..f677606ec3 --- /dev/null +++ b/README.md @@ -0,0 +1,69 @@ +The purpose of this document is to let users know how they can use zheap (a new +storage format for PostgreSQL) and the work that is still pending. This new +storage format provides a better control over bloat, reduces the tuple size +and reduces the write amplification. The detail design of zheap is present in +zheap design document (src/backend/access/zheap/README). + +How do I use zheap? +=================== + +We have provided a storage engine option which you can set when creating a table. +For example: + +create table t_zheap(c1 int, c2 varchar) USING zheap; + +Index creation for zheap tables doesn't need any special syntax. + +You can also set the GUC parameter default_table_access_method. The +default value is “heap", but you can set it to “zheap”. If you do, +all subsequently-created tables will use zheap. + +These interfaces will probably change once the storage format API work is +integrated into PostgreSQL. We’ll adjust this code to use whatever interfaces +are agreed by the PostgreSQL community. + +We have also provided a GUC called data_alignment, which sets the alignment +used for zheap tuples. 0 indicates no alignment, 4 uses a maximum of 4 byte +alignment, and any other value indicates align as per attalign. This also +controls the padding between tuples. This parameter is just for some +experiments to see the impact of alignment on database size. This parameter +will be removed later; we’ll align as described in the zheap design document. + +Each zheap page has fixed set of transaction slots each of which contains the +transaction information (transaction id and epoch) and the latest undo record +pointer for that transaction. By default, we have four transaction slots per +page, but this can be changed by setting --with-trans_slots_per_zheap_page=value +while configuring zheap. + +What doesn’t work yet? +====================== +- Logical decoding +- Snapshot too old - We might want to implement this after first version is +committed as this will work differently for zheap. +- Alter Table Set Tablesapce - For this feature to work +correctly in zheap, while copying pages, we need to ensure that pending aborts +gets applied before copying the page. + +Tools +- pg_undo_dump similar to pg_wal_dump: We would like to develop this utility +as it can be used to view undo record contents and can help us debug problems +related to undo chains. +- We also want to develop tools like pgstattuple, pgrowlocks that +allow us to inspect the contents of database pages at a low level. +- wal consistency checker: This will be used to check for bugs in the WAL redo +routines. Currently, it is quite similar to what we have in current heap, but +we want to extend it to check the consistency of undo pages similar to how it +checks for data and index pages. + +Open Issues +=========== +- Currently, the TPD pages are not added to FSM even if they can be completely +reused. +- Single user mode: This needs some investigation as to what exactly is required. +I think we need to ensure that undo gets applied without the need to invoke undo +worker. + +The other pending code related items are tracked on zheap wiki page: +https://wiki.postgresql.org/wiki/Zheap + +You can find overall design of zheap in the README: src/backend/access/zheap/README diff --git a/configure b/configure index dce6d98cf6..db72375f20 100755 --- a/configure +++ b/configure @@ -838,6 +838,7 @@ enable_tap_tests with_blocksize with_segsize with_wal_blocksize +with_trans_slots_per_zheap_page with_CC with_llvm enable_depend @@ -1540,6 +1541,8 @@ Optional Packages: --with-segsize=SEGSIZE set table segment size in GB [1] --with-wal-blocksize=BLOCKSIZE set WAL block size in kB [8] + --with-trans_slots_per_zheap_page=SLOTS + set transaction slots per zheap page [4] --with-CC=CMD set compiler (deprecated) --with-llvm build with LLVM based JIT support --with-icu build with ICU support @@ -3800,6 +3803,51 @@ cat >>confdefs.h <<_ACEOF _ACEOF +# +# transaction slots per zheap page +# +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for transaction slots per zheap page" >&5 +$as_echo_n "checking for transaction slots per zheap page... " >&6; } + + + +# Check whether --with-trans_slots_per_zheap_page was given. +if test "${with_trans_slots_per_zheap_page+set}" = set; then : + withval=$with_trans_slots_per_zheap_page; + case $withval in + yes) + as_fn_error $? "argument required for --with-trans_slots_per_zheap_page option" "$LINENO" 5 + ;; + no) + as_fn_error $? "argument required for --with-trans_slots_per_zheap_page option" "$LINENO" 5 + ;; + *) + trans_slots_per_page=$withval + ;; + esac + +else + trans_slots_per_page=4 +fi + + +case ${trans_slots_per_page} in + 2) ZHEAP_PAGE_TRANS_SLOTS=2;; + 4) ZHEAP_PAGE_TRANS_SLOTS=4;; + 8) ZHEAP_PAGE_TRANS_SLOTS=8;; + 16) ZHEAP_PAGE_TRANS_SLOTS=16;; + 31) ZHEAP_PAGE_TRANS_SLOTS=31;; + *) as_fn_error $? "Invalid transaction slots per zheap page. Allowed values are 2,4,8,16,31." "$LINENO" 5 +esac +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${trans_slots_per_page}" >&5 +$as_echo "${trans_slots_per_page}" >&6; } + + +cat >>confdefs.h <<_ACEOF +#define ZHEAP_PAGE_TRANS_SLOTS ${ZHEAP_PAGE_TRANS_SLOTS} +_ACEOF + + # # C compiler # diff --git a/configure.in b/configure.in index e5123ac122..e411047550 100644 --- a/configure.in +++ b/configure.in @@ -343,6 +343,29 @@ AC_DEFINE_UNQUOTED([XLOG_BLCKSZ], ${XLOG_BLCKSZ}, [ Changing XLOG_BLCKSZ requires an initdb. ]) +# +# transaction slots per zheap page +# +AC_MSG_CHECKING([for transaction slots per zheap page]) +PGAC_ARG_REQ(with, trans_slots_per_zheap_page, [SLOTS], [set transaction slots per zheap page [4]], + [trans_slots_per_page=$withval], + [trans_slots_per_page=4]) +case ${trans_slots_per_page} in + 2) ZHEAP_PAGE_TRANS_SLOTS=2;; + 4) ZHEAP_PAGE_TRANS_SLOTS=4;; + 8) ZHEAP_PAGE_TRANS_SLOTS=8;; + 16) ZHEAP_PAGE_TRANS_SLOTS=16;; + 31) ZHEAP_PAGE_TRANS_SLOTS=31;; + *) AC_MSG_ERROR([Invalid transaction slots per zheap page. Allowed values are 2,4,8,16,31.]) +esac +AC_MSG_RESULT([${trans_slots_per_page}]) + +AC_DEFINE_UNQUOTED([ZHEAP_PAGE_TRANS_SLOTS], ${ZHEAP_PAGE_TRANS_SLOTS}, [ + transaction slots per zheap page. By default, it is set to 4. + + Changing ZHEAP_PAGE_TRANS_SLOTS requires an initdb. +]) + # # C compiler # diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index e5a581f141..18fca394c9 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -2,17 +2,17 @@ MODULE_big = pageinspect OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ - brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) + brinfuncs.o ginfuncs.o hashfuncs.o zheapfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.6--1.7.sql \ +DATA = pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \ pageinspect--1.5.sql pageinspect--1.5--1.6.sql \ pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" -REGRESS = page btree brin gin hash +REGRESS = page btree brin gin hash zheap ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out index 3fcd9fbe6d..9c8b26709c 100644 --- a/contrib/pageinspect/expected/page.out +++ b/contrib/pageinspect/expected/page.out @@ -1,5 +1,5 @@ CREATE EXTENSION pageinspect; -CREATE TABLE test1 (a int, b int); +CREATE TABLE test1 (a int, b int) USING heap; INSERT INTO test1 VALUES (16777217, 131584); VACUUM test1; -- set up FSM -- The page contents can vary, so just test that it can be read diff --git a/contrib/pageinspect/expected/zheap.out b/contrib/pageinspect/expected/zheap.out new file mode 100644 index 0000000000..e740f648bc --- /dev/null +++ b/contrib/pageinspect/expected/zheap.out @@ -0,0 +1,64 @@ +CREATE TABLE test_zheap (a int, b int) USING zheap; +INSERT INTO test_zheap VALUES (16777217, 131584); +-- The page contents can vary, so just test that it can be read +-- successfully, but don't keep the output. +SELECT pagesize, version FROM page_header(get_raw_page('test_zheap', 1)); + pagesize | version +----------+--------- + 8192 | 4 +(1 row) + +SELECT page_checksum(get_raw_page('test_zheap', 1), 1) IS NOT NULL AS silly_checksum_test; + silly_checksum_test +--------------------- + t +(1 row) + +DROP TABLE test_zheap; +-- check that using any of these functions with a partitioned table would fail +create table test_partitioned (a int) partition by range (a) USING zheap; +select get_raw_page('test_partitioned', 1); -- error about partitioned table +ERROR: cannot get raw page from partitioned table "test_partitioned" +-- a regular table which is a member of a partition set should work though +create table test_part1 partition of test_partitioned for values from ( 1 ) to (100) USING zheap; +select get_raw_page('test_part1', 1); -- get farther and error about empty table +ERROR: block number 1 is out of range for relation "test_part1" +drop table test_partitioned; +-- The tuple contents can vary, so we perform some basic testing of zheap_page_items. +-- We perform all the tuple modifications in a single transaction so that t_slot +-- doesn't change if we change trancsation slots in page during compile time. +-- Because of the same reason, we cannot check for all possibile output for +-- t_infomask_info (for example: slot-reused, multilock, l-nokey-ex etc). +create table test_zheap (a int, b text) USING zheap WITH (autovacuum_enabled=false); +begin; +insert into test_zheap (a) select generate_series(1,6); +update test_zheap set a=10 where a=2; +update test_zheap set b='abcd' where a=3; +delete from test_zheap where a=4; +select * from test_zheap where a=5 for share; + a | b +---+--- + 5 | +(1 row) + +select * from test_zheap where a=6 for update; + a | b +---+--- + 6 | +(1 row) + +commit; +select lp,lp_flags,t_slot,t_infomask2,t_infomask,t_hoff,t_bits, + t_infomask_info from zheap_page_items(get_raw_page('test_zheap', 1)); + lp | lp_flags | t_slot | t_infomask2 | t_infomask | t_hoff | t_bits | t_infomask_info +----+----------+--------+-------------+------------+--------+----------+----------------- + 1 | 1 | 1 | 2050 | 1 | 6 | 10000000 | + 2 | 1 | 1 | 2050 | 33 | 6 | 10000000 | {in-updated} + 3 | 1 | 1 | 2050 | 65 | 6 | 10000000 | {updated} + 4 | 1 | 1 | 2050 | 1041 | 6 | 10000000 | {deleted,l-ex} + 5 | 1 | 1 | 2050 | 897 | 6 | 10000000 | {l-share} + 6 | 1 | 1 | 2050 | 1153 | 6 | 10000000 | {l-ex} + 7 | 1 | 1 | 2050 | 2 | 5 | | +(7 rows) + +drop table test_zheap; diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql new file mode 100644 index 0000000000..b538af7508 --- /dev/null +++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql @@ -0,0 +1,39 @@ +/* contrib/pageinspect/pageinspect--1.7--1.8.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pageinspect UPDATE TO '1.8'" to load this file. \quit + +-- +-- zheap functions +-- + +-- +-- zheap_page_items() +-- +CREATE FUNCTION zheap_page_items(IN page bytea, + OUT lp smallint, + OUT lp_off smallint, + OUT lp_flags smallint, + OUT lp_len smallint, + OUT t_slot smallint, + OUT t_infomask2 integer, + OUT t_infomask integer, + OUT t_hoff smallint, + OUT t_bits text, + OUT t_data bytea, + OUT t_infomask_info text[]) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'zheap_page_items' +LANGUAGE C STRICT PARALLEL SAFE; + +-- +-- zheap_page_slots() +-- +CREATE FUNCTION zheap_page_slots(IN page bytea, + OUT slot_id smallint, + OUT epoch int4, + OUT xid int4, + OUT undoptr int8) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'zheap_page_slots' +LANGUAGE C STRICT PARALLEL SAFE; diff --git a/contrib/pageinspect/pageinspect.control b/contrib/pageinspect/pageinspect.control index dcfc61f22d..f8cdf526c6 100644 --- a/contrib/pageinspect/pageinspect.control +++ b/contrib/pageinspect/pageinspect.control @@ -1,5 +1,5 @@ # pageinspect extension comment = 'inspect the contents of database pages at a low level' -default_version = '1.7' +default_version = '1.8' module_pathname = '$libdir/pageinspect' relocatable = true diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql index 8ac9991837..8f0ef62cdc 100644 --- a/contrib/pageinspect/sql/page.sql +++ b/contrib/pageinspect/sql/page.sql @@ -1,6 +1,6 @@ CREATE EXTENSION pageinspect; -CREATE TABLE test1 (a int, b int); +CREATE TABLE test1 (a int, b int) USING heap; INSERT INTO test1 VALUES (16777217, 131584); VACUUM test1; -- set up FSM diff --git a/contrib/pageinspect/sql/zheap.sql b/contrib/pageinspect/sql/zheap.sql new file mode 100644 index 0000000000..0b7b6bd4ae --- /dev/null +++ b/contrib/pageinspect/sql/zheap.sql @@ -0,0 +1,38 @@ +CREATE TABLE test_zheap (a int, b int) USING zheap; +INSERT INTO test_zheap VALUES (16777217, 131584); + +-- The page contents can vary, so just test that it can be read +-- successfully, but don't keep the output. + +SELECT pagesize, version FROM page_header(get_raw_page('test_zheap', 1)); + +SELECT page_checksum(get_raw_page('test_zheap', 1), 1) IS NOT NULL AS silly_checksum_test; + +DROP TABLE test_zheap; + +-- check that using any of these functions with a partitioned table would fail +create table test_partitioned (a int) partition by range (a) USING zheap; +select get_raw_page('test_partitioned', 1); -- error about partitioned table + +-- a regular table which is a member of a partition set should work though +create table test_part1 partition of test_partitioned for values from ( 1 ) to (100) USING zheap; +select get_raw_page('test_part1', 1); -- get farther and error about empty table +drop table test_partitioned; + +-- The tuple contents can vary, so we perform some basic testing of zheap_page_items. +-- We perform all the tuple modifications in a single transaction so that t_slot +-- doesn't change if we change trancsation slots in page during compile time. +-- Because of the same reason, we cannot check for all possibile output for +-- t_infomask_info (for example: slot-reused, multilock, l-nokey-ex etc). +create table test_zheap (a int, b text) USING zheap WITH (autovacuum_enabled=false); +begin; +insert into test_zheap (a) select generate_series(1,6); +update test_zheap set a=10 where a=2; +update test_zheap set b='abcd' where a=3; +delete from test_zheap where a=4; +select * from test_zheap where a=5 for share; +select * from test_zheap where a=6 for update; +commit; +select lp,lp_flags,t_slot,t_infomask2,t_infomask,t_hoff,t_bits, + t_infomask_info from zheap_page_items(get_raw_page('test_zheap', 1)); +drop table test_zheap; diff --git a/contrib/pageinspect/zheapfuncs.c b/contrib/pageinspect/zheapfuncs.c new file mode 100644 index 0000000000..834bfce917 --- /dev/null +++ b/contrib/pageinspect/zheapfuncs.c @@ -0,0 +1,429 @@ +/*------------------------------------------------------------------------- + * + * zheapfuncs.c + * Functions to investigate zheap pages + * + * We check the input to these functions for corrupt pointers etc. that + * might cause crashes, but at the same time we try to print out as much + * information as possible, even if it's nonsense. That's because if a + * page is corrupt, we don't know why and how exactly it is corrupt, so we + * let the user judge it. + * + * These functions are restricted to superusers for the fear of introducing + * security holes if the input checking isn't as water-tight as it should be. + * You'd need to be superuser to obtain a raw page image anyway, so + * there's hardly any use case for using these without superuser-rights + * anyway. + * + * Copyright (c) 2007-2018, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pageinspect/zheapfuncs.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "pageinspect.h" + +#include "access/htup_details.h" +#include "access/zheap.h" +#include "funcapi.h" +#include "catalog/pg_type.h" +#include "miscadmin.h" +#include "utils/array.h" +#include "utils/builtins.h" +#include "utils/rel.h" + +static void decode_infomask(ZHeapTupleHeader ztuphdr, Datum *values, bool *nulls); + +/* + * bits_to_text + * + * Converts a bits8-array of 'len' bits to a human-readable + * c-string representation. + */ +static char * +bits_to_text(bits8 *bits, int len) +{ + int i; + char *str; + + str = palloc(len + 1); + + for (i = 0; i < len; i++) + str[i] = (bits[(i / 8)] & (1 << (i % 8))) ? '1' : '0'; + + str[i] = '\0'; + + return str; +} + +/* + * decode_infomask + * + * Converts tuple infomask into an array describing the flags marked in + * tuple infomask. + */ +static void +decode_infomask(ZHeapTupleHeader ztuphdr, Datum *values, bool *nulls) +{ + ArrayBuildState *raw_attrs; + raw_attrs = initArrayResult(TEXTOID, CurrentMemoryContext, false); + if (ZHeapTupleHasMultiLockers(ztuphdr->t_infomask) || + IsZHeapTupleModified(ztuphdr->t_infomask) || + ZHeapTupleHasInvalidXact(ztuphdr->t_infomask)) + { + if (ZHeapTupleHasInvalidXact(ztuphdr->t_infomask)) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("slot-reused"), + false, TEXTOID, CurrentMemoryContext); + } + if (ZHeapTupleHasMultiLockers(ztuphdr->t_infomask)) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("multilock"), + false, TEXTOID, CurrentMemoryContext); + } + if (ztuphdr->t_infomask & ZHEAP_DELETED) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("deleted"), + false, TEXTOID, CurrentMemoryContext); + } + if (ztuphdr->t_infomask & ZHEAP_UPDATED) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("updated"), + false, TEXTOID, CurrentMemoryContext); + } + if (ztuphdr->t_infomask & ZHEAP_INPLACE_UPDATED) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("in-updated"), + false, TEXTOID, CurrentMemoryContext); + } + if ((ztuphdr->t_infomask & ZHEAP_XID_SHR_LOCK) == ZHEAP_XID_SHR_LOCK) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-share"), + false, TEXTOID, CurrentMemoryContext); + } + else if (ztuphdr->t_infomask & ZHEAP_XID_NOKEY_EXCL_LOCK) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-nokey-ex"), + false, TEXTOID, CurrentMemoryContext); + } + else if (ztuphdr->t_infomask & ZHEAP_XID_KEYSHR_LOCK) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-keyshare"), + false, TEXTOID, CurrentMemoryContext); + } + if (ztuphdr->t_infomask & ZHEAP_XID_EXCL_LOCK) + { + raw_attrs = accumArrayResult(raw_attrs, CStringGetTextDatum("l-ex"), + false, TEXTOID, CurrentMemoryContext); + } + *values = makeArrayResult(raw_attrs, CurrentMemoryContext); + } + else + *nulls = true; +} + +/* + * zheap_page_items + * + * Allows inspection of line pointers and tuple headers of a zheap page. + */ +PG_FUNCTION_INFO_V1(zheap_page_items); + +typedef struct zheap_page_items_state +{ + TupleDesc tupd; + Page page; + uint16 offset; +} zheap_page_items_state; + +Datum +zheap_page_items(PG_FUNCTION_ARGS) +{ + bytea *raw_page = PG_GETARG_BYTEA_P(0); + zheap_page_items_state *inter_call_data = NULL; + FuncCallContext *fctx; + int raw_page_size; + + if (!superuser()) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + (errmsg("must be superuser to use raw page functions")))); + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + int num_trans_slots; + + if (raw_page_size < SizeOfPageHeaderData) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small (%d bytes)", raw_page_size))); + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + inter_call_data = palloc(sizeof(zheap_page_items_state)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + inter_call_data->tupd = tupdesc; + + inter_call_data->offset = FirstOffsetNumber; + inter_call_data->page = VARDATA(raw_page); + + fctx->max_calls = PageGetMaxOffsetNumber(inter_call_data->page); + fctx->user_fctx = inter_call_data; + + /* + * We cannot check whether this is a zheap page or not. But, we can + * check whether pd_special is set correctly so that it contains the + * expected number of transaction slots in the special space. + */ + num_trans_slots = (raw_page_size - ((PageHeader) + (inter_call_data->page))->pd_special) + / sizeof(ZHeapPageOpaqueData); + + if (num_trans_slots != ZHEAP_PAGE_TRANS_SLOTS) + elog(ERROR, "zheap page contains unexpected number of transaction" + "slots: %d, expecting %d", num_trans_slots, ZHEAP_PAGE_TRANS_SLOTS); + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + inter_call_data = fctx->user_fctx; + + if (fctx->call_cntr < fctx->max_calls) + { + Page page = inter_call_data->page; + HeapTuple resultTuple; + Datum result; + ItemId id; + Datum values[11]; + bool nulls[11]; + uint16 lp_offset; + uint16 lp_flags; + uint16 lp_len; + + memset(nulls, 0, sizeof(nulls)); + + /* Extract information from the line pointer */ + + id = PageGetItemId(page, inter_call_data->offset); + + lp_offset = ItemIdGetOffset(id); + lp_flags = ItemIdGetFlags(id); + lp_len = ItemIdGetLength(id); + + values[0] = UInt16GetDatum(inter_call_data->offset); + values[1] = UInt16GetDatum(lp_offset); + values[2] = UInt16GetDatum(lp_flags); + values[3] = UInt16GetDatum(lp_len); + + /* + * We do just enough validity checking to make sure we don't reference + * data outside the page passed to us. The page could be corrupt in + * many other ways, but at least we won't crash. + */ + if (ItemIdHasStorage(id) && + lp_len >= MinZHeapTupleSize && + lp_offset + lp_len <= raw_page_size) + { + ZHeapTupleHeader ztuphdr; + bytea *tuple_data_bytea; + int tuple_data_len; + + /* Extract information from the tuple header */ + ztuphdr = (ZHeapTupleHeader) PageGetItem(page, id); + + values[4] = UInt16GetDatum(ZHeapTupleHeaderGetXactSlot(ztuphdr)); + + values[5] = UInt32GetDatum(ztuphdr->t_infomask2); + values[6] = UInt32GetDatum(ztuphdr->t_infomask); + values[7] = UInt8GetDatum(ztuphdr->t_hoff); + + /* + * We already checked that the item is completely within the raw + * page passed to us, with the length given in the line pointer. + * Let's check that t_hoff doesn't point over lp_len, before using + * it to access t_bits and oid. + */ + if (ztuphdr->t_hoff >= SizeofZHeapTupleHeader && + ztuphdr->t_hoff <= lp_len) + { + if (ztuphdr->t_infomask & ZHEAP_HASNULL) + { + int bits_len; + + bits_len = + BITMAPLEN(ZHeapTupleHeaderGetNatts(ztuphdr)) * BITS_PER_BYTE; + values[8] = CStringGetTextDatum( + bits_to_text(ztuphdr->t_bits, bits_len)); + } + else + nulls[8] = true; + + } + else + { + nulls[8] = true; + } + + /* Copy raw tuple data into bytea attribute */ + tuple_data_len = lp_len - ztuphdr->t_hoff; + tuple_data_bytea = (bytea *) palloc(tuple_data_len + VARHDRSZ); + SET_VARSIZE(tuple_data_bytea, tuple_data_len + VARHDRSZ); + memcpy(VARDATA(tuple_data_bytea), (char *) ztuphdr + ztuphdr->t_hoff, + tuple_data_len); + values[9] = PointerGetDatum(tuple_data_bytea); + + decode_infomask(ztuphdr, &values[10], &nulls[10]); + } + else + { + /* + * The line pointer is not used, or it's invalid. Set the rest of + * the fields to NULL + */ + int i; + + for (i = 4; i <= 11; i++) + nulls[i] = true; + } + + /* Build and return the result tuple. */ + resultTuple = heap_form_tuple(inter_call_data->tupd, values, nulls); + result = HeapTupleGetDatum(resultTuple); + + inter_call_data->offset++; + + SRF_RETURN_NEXT(fctx, result); + } + else + SRF_RETURN_DONE(fctx); +} + +/* + * zheap_page_slots + * + * Allows inspection of transaction slots of a zheap page. + */ +PG_FUNCTION_INFO_V1(zheap_page_slots); + +typedef struct zheap_page_slots_state +{ + TupleDesc tupd; + Page page; + uint16 slot_id; +} zheap_page_slots_state; + +Datum +zheap_page_slots(PG_FUNCTION_ARGS) +{ + bytea *raw_page = PG_GETARG_BYTEA_P(0); + zheap_page_slots_state *inter_call_data = NULL; + FuncCallContext *fctx; + int raw_page_size; + + if (!superuser()) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + (errmsg("must be superuser to use raw page functions")))); + + raw_page_size = VARSIZE(raw_page) - VARHDRSZ; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + int num_trans_slots; + + if (raw_page_size < SizeOfPageHeaderData) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("input page too small (%d bytes)", raw_page_size))); + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + inter_call_data = palloc(sizeof(zheap_page_slots_state)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + inter_call_data->tupd = tupdesc; + + inter_call_data->slot_id = 0; + inter_call_data->page = VARDATA(raw_page); + + fctx->user_fctx = inter_call_data; + + /* + * We cannot check whether this is a zheap page or not. But, we can + * check whether pd_special is set correctly so that it contains the + * expected number of transaction slots in the special space. + */ + num_trans_slots = (raw_page_size - ((PageHeader) + (inter_call_data->page))->pd_special) + / sizeof(ZHeapPageOpaqueData); + + if (num_trans_slots != ZHEAP_PAGE_TRANS_SLOTS) + elog(ERROR, "zheap page contains unexpected number of transaction" + "slots: %d, expecting %d", num_trans_slots, ZHEAP_PAGE_TRANS_SLOTS); + + /* + * If the page has tpd slot, last slot is used as tpd slot. In that case, + * it will not have any informations about transaction. + */ + if (ZHeapPageHasTPDSlot((PageHeader) inter_call_data->page)) + num_trans_slots--; + fctx->max_calls = num_trans_slots; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + inter_call_data = fctx->user_fctx; + + if (fctx->call_cntr < fctx->max_calls) + { + Page page = inter_call_data->page; + HeapTuple resultTuple; + Datum result; + Datum values[4]; + bool nulls[4]; + ZHeapPageOpaque opaque; + TransInfo transinfo; + + memset(nulls, 0, sizeof(nulls)); + + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + transinfo = opaque->transinfo[inter_call_data->slot_id]; + + /* Fetch transaction and undo information from slot */ + values[0] = UInt16GetDatum(inter_call_data->slot_id + 1); + values[1] = UInt32GetDatum(transinfo.xid_epoch); + values[2] = UInt32GetDatum(transinfo.xid); + values[3] = UInt64GetDatum(transinfo.urec_ptr); + + /* Build and return the result tuple. */ + resultTuple = heap_form_tuple(inter_call_data->tupd, values, nulls); + result = HeapTupleGetDatum(resultTuple); + + inter_call_data->slot_id++; + + SRF_RETURN_NEXT(fctx, result); + } + else + SRF_RETURN_DONE(fctx); +} diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4a7121a51f..ca15739f73 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7245,6 +7245,41 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; + + undo_tablespaces (string) + + undo_tablespaces configuration parameter + + tablespacetemporary + + + + This variable specifies tablespaces in which to store undo data, when + undo-aware storage managers (initially "zheap") perform writes. + + + + The value is a list of names of tablespaces. When there is more than + one name in the list, PostgreSQL chooses an + arbitrary one. If the name doesn't correspond to an existing + tablespace, the next name is tried, and so on until all names have + been tried. If no valid tablespace is specified, an error is raised. + The validation of the name doesn't happen until the first attempt to + write undo data. + + + + The variable can only be changed before the first statement is + executed in a transaction. + + + + The default value is an empty string, which results in all temporary + objects being created in the default tablespace. + + + + check_function_bodies (boolean) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b3336ea9be..a3614e7394 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18463,6 +18463,11 @@ SELECT collation for ('foo' COLLATE "de_DE"); timestamp with time zone + + oldest_xid_with_epoch_having_undo + xid + + diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 96bcc3a63b..eeeb3f9391 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -332,6 +332,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser + + pg_stat_undo_logspg_stat_undo_logs + One row for each undo log, showing current pointers, + transactions and backends. + See for details. + + + @@ -549,7 +557,6 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser into the kernel's handling of I/O. - <structname>pg_stat_activity</structname> View @@ -1646,6 +1653,30 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser TwophaseFileWriteWaiting for a write of a two phase state file. + + UndoCheckpointRead + Waiting for a read from an undo checkpoint file. + + + UndoCheckpointSync + Waiting for changes to an undo checkpoint file to reach stable storage. + + + UndoCheckpointWrite + Waiting for a write to an undo checkpoint file. + + + UndoFileRead + Waiting for a read from an undo data file. + + + UndoFileSync + Waiting for changes to an undo data file to reach stable storage. + + + UndoFileWrite + Waiting for a write to an undo data file. + WALBootstrapSync Waiting for WAL to reach stable storage during bootstrapping. @@ -1722,6 +1753,80 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i +
+ <structname>pg_stat_undo_logs</structname> View + + + + + Column + Type + Description + + + + + + log_number + oid + Identifier of this undo log + + + persistence + text + Persistence level of data stored in this undo log; one of + permanent, unlogged or + temporary. + + + tablespace + text + Tablespace that holds physical storage of this undo log. + + + discard + text + Location of the oldest data in this undo log. + + + insert + text + Location where the next data will be written in this undo + log. + + + end + text + Location one byte past the end of the allocated physical storage + backing this undo log. + + + xid + xid + Transaction currently attached to this undo log + for writing. + + + pid + integer + Process ID of the backend currently attached to this undo log + for writing. + + + +
+ + + The pg_stat_undo_logs view will have one row for + each undo log that exists. Undo logs are extents within a contiguous + addressing space that have their own head and tail pointers. + Each backend that has written undo data is associated with one or more undo + log, and is the only backend that is allowed to write data to those undo + logs. Backends can be associated with up to three undo logs at a time, + because different undo logs are used for the undo data associated with + permanent, unlogged and temporary relations. + + <structname>pg_stat_replication</structname> View diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml index 8ef2ac8010..a693182f72 100644 --- a/doc/src/sgml/storage.sgml +++ b/doc/src/sgml/storage.sgml @@ -141,6 +141,11 @@ Item Subdirectory containing state files for prepared transactions + + pg_undo + Subdirectory containing undo log meta-data files + + pg_wal Subdirectory containing WAL (Write Ahead Log) files @@ -686,6 +691,57 @@ erased (they will be recreated automatically as needed). + + +Undo Logs + + + Undo Logs + + + +Undo logs hold data that is used for rolling back and for implementing +MVCC in access managers that are undo-aware (currently "zheap"). The storage +format of undo logs is optimized for reusing existing files. + + + +Undo data exists in a 64 bit address space broken up into numbered undo logs +that represent 1TB extents, for efficient management. The space is further +broken up into 1MB segment files, for physical storage. The name of each file +is the address of of the first byte in the file, with a period inserted after +the part that indicates the undo log number. + + + +Each undo log is created in a particular tablespace and stores data for a +particular persistence level. +Undo logs are global in the sense that they don't belong to any particular +database and may contain undo data from relations in any database. +Undo files backing undo logs in the default tablespace are stored under +PGDATA/base/undo, and for other +tablespaces under undo in the appropriate tablespace +directory. The system view can be +used to see the cluster's current list of undo logs along with their +tablespaces and persistence levels. + + + +Just as relations can have one of the three persistence levels permanent, +unlogged or temporary, the undo data that is generated by modifying them must +be stored in an undo log of the same persistence level. This enables the +undo data to be discarded at appropriate times along with the relations that +reference it. + + + +Undo log files contain standard page headers as described in the next section, +but the format of the rest of the page is determined by the undo-aware +access method that reads and writes it. + + + + Database Page Layout diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile index 0880e0a8bb..42f1beedff 100644 --- a/src/backend/access/Makefile +++ b/src/backend/access/Makefile @@ -9,6 +9,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \ - table tablesample transam + table tablesample transam undo zheap include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c index 06dd628a5b..a04bfb7d3b 100644 --- a/src/backend/access/common/heaptuple.c +++ b/src/backend/access/common/heaptuple.c @@ -64,16 +64,6 @@ #include "utils/expandeddatum.h" -/* Does att's datatype allow packing into the 1-byte-header varlena format? */ -#define ATT_IS_PACKABLE(att) \ - ((att)->attlen == -1 && (att)->attstorage != 'p') -/* Use this if it's already known varlena */ -#define VARLENA_ATT_IS_PACKABLE(att) \ - ((att)->attstorage != 'p') - -static Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull); - - /* ---------------------------------------------------------------- * misc support routines * ---------------------------------------------------------------- @@ -82,7 +72,7 @@ static Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull); /* * Return the missing value of an attribute, or NULL if there isn't one. */ -static Datum +Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull) { diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c index fc88aa376a..6daa76893e 100644 --- a/src/backend/access/common/tupconvert.c +++ b/src/backend/access/common/tupconvert.c @@ -21,6 +21,7 @@ #include "access/htup_details.h" #include "access/tupconvert.h" #include "executor/tuptable.h" +#include "access/zheap.h" #include "utils/builtins.h" diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index ab5aaff156..bd603d9069 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -21,6 +21,7 @@ #include "access/xlogutils.h" #include "access/xlog.h" #include "access/transam.h" +#include "access/zheap.h" #include "storage/procarray.h" #include "miscadmin.h" @@ -1070,26 +1071,72 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid)); hitemid = PageGetItemId(hpage, hoffnum); - /* - * Follow any redirections until we find something useful. - */ - while (ItemIdIsRedirected(hitemid)) + if (!(xlrec->flags & XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP)) { - hoffnum = ItemIdGetRedirect(hitemid); - hitemid = PageGetItemId(hpage, hoffnum); - CHECK_FOR_INTERRUPTS(); + /* + * Follow any redirections until we find something useful. + */ + while (ItemIdIsRedirected(hitemid)) + { + hoffnum = ItemIdGetRedirect(hitemid); + hitemid = PageGetItemId(hpage, hoffnum); + CHECK_FOR_INTERRUPTS(); + } } /* * If the heap item has storage, then read the header and use that to * set latestRemovedXid. * + * We have special handling for zheap tuples that are deleted and + * don't have storage. + * * Some LP_DEAD items may not be accessible, so we ignore them. */ - if (ItemIdHasStorage(hitemid)) + if ((xlrec->flags & XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP) && + ItemIdIsDeleted(hitemid)) { - htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid); - HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid); + TransactionId xid; + ZHeapTupleData ztup; + + ztup.t_self = itup->t_tid; + ztup.t_len = ItemIdGetLength(hitemid); + ztup.t_tableOid = InvalidOid; + ztup.t_data = NULL; + ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid, NULL, NULL, + false); + if (TransactionIdDidCommit(xid) && + TransactionIdFollows(xid, latestRemovedXid)) + latestRemovedXid = xid; + } + else if (ItemIdHasStorage(hitemid)) + { + if ((xlrec->flags & XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP) != 0) + { + ZHeapTupleHeader ztuphdr; + ZHeapTupleData ztup; + + ztuphdr = (ZHeapTupleHeader) PageGetItem(hpage, hitemid); + ztup.t_self = itup->t_tid; + ztup.t_len = ItemIdGetLength(hitemid); + ztup.t_tableOid = InvalidOid; + ztup.t_data = ztuphdr; + + if (ztuphdr->t_infomask & ZHEAP_DELETED + || ztuphdr->t_infomask & ZHEAP_UPDATED) + { + TransactionId xid; + + ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid, + NULL, NULL, false); + ZHeapTupleHeaderAdvanceLatestRemovedXid(ztuphdr, xid, &latestRemovedXid); + } + } + else + { + htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid); + HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid); + } } else if (ItemIdIsDead(hitemid)) { diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c index 3eb722ce26..a2f9693cce 100644 --- a/src/backend/access/hash/hashinsert.c +++ b/src/backend/access/hash/hashinsert.c @@ -25,7 +25,7 @@ #include "storage/predicate.h" static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf, - RelFileNode hnode); + Relation heapRel); /* * _hash_doinsert() -- Handle insertion of a single index tuple. @@ -138,7 +138,7 @@ restart_insert: if (IsBufferCleanupOK(buf)) { - _hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node); + _hash_vacuum_one_page(rel, metabuf, buf, heapRel); if (PageGetFreeSpace(page) >= itemsz) break; /* OK, now we have enough space */ @@ -337,7 +337,7 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups, static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf, - RelFileNode hnode) + Relation heapRel) { OffsetNumber deletable[MaxOffsetNumber]; int ndeletable = 0; @@ -394,8 +394,10 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf, xl_hash_vacuum_one_page xlrec; XLogRecPtr recptr; - xlrec.hnode = hnode; + xlrec.hnode = heapRel->rd_node; xlrec.ntuples = ndeletable; + xlrec.flags = RelationStorageIsZHeap(heapRel) ? + XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP : 0; XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_STANDARD); diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index f769d828ff..8c3427c476 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -89,9 +89,6 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf, static Bitmapset *HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols, HeapTuple oldtup, HeapTuple newtup); -static bool heap_acquire_tuplock(Relation relation, ItemPointer tid, - LockTupleMode mode, LockWaitPolicy wait_policy, - bool *have_tuple_lock); static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask, uint16 old_infomask2, TransactionId add_to_xmax, LockTupleMode mode, bool is_update, @@ -127,36 +124,7 @@ static bool ProjIndexIsUnchanged(Relation relation, HeapTuple oldtup, HeapTuple * Don't look at lockstatus/updstatus directly! Use get_mxact_status_for_lock * instead. */ -static const struct -{ - LOCKMODE hwlock; - int lockstatus; - int updstatus; -} - tupleLockExtraInfo[MaxLockTupleMode + 1] = -{ - { /* LockTupleKeyShare */ - AccessShareLock, - MultiXactStatusForKeyShare, - -1 /* KeyShare does not allow updating tuples */ - }, - { /* LockTupleShare */ - RowShareLock, - MultiXactStatusForShare, - -1 /* Share does not allow updating tuples */ - }, - { /* LockTupleNoKeyExclusive */ - ExclusiveLock, - MultiXactStatusForNoKeyUpdate, - MultiXactStatusNoKeyUpdate - }, - { /* LockTupleExclusive */ - AccessExclusiveLock, - MultiXactStatusForUpdate, - MultiXactStatusUpdate - } -}; /* Get the LOCKMODE for a given MultiXactStatus */ #define LOCKMODE_from_mxstatus(status) \ @@ -169,8 +137,6 @@ static const struct */ #define LockTupleTuplock(rel, tup, mode) \ LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) -#define UnlockTupleTuplock(rel, tup, mode) \ - UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) #define ConditionalLockTupleTuplock(rel, tup, mode) \ ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) @@ -433,7 +399,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber page) else valid = HeapTupleSatisfies(&loctup, snapshot, buffer); - CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, &loctup, + CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, + (void *) &loctup, buffer, snapshot); if (valid) @@ -648,7 +615,8 @@ heapgettup(HeapScanDesc scan, */ valid = HeapTupleSatisfies(tuple, snapshot, scan->rs_cbuf); - CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, tuple, + CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, + (void *) tuple, scan->rs_cbuf, snapshot); if (valid && key != NULL) @@ -1768,9 +1736,10 @@ heap_fetch(Relation relation, valid = HeapTupleSatisfies(tuple, snapshot, buffer); if (valid) - PredicateLockTuple(relation, tuple, snapshot); + PredicateLockTid(relation, &(tuple->t_self), snapshot, + HeapTupleHeaderGetXmin(tuple->t_data)); - CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot); + CheckForSerializableConflictOut(valid, relation, (void *) tuple, buffer, snapshot); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); @@ -1908,7 +1877,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, /* If it's visible per the snapshot, we must return it */ valid = HeapTupleSatisfies(heapTuple, snapshot, buffer); - CheckForSerializableConflictOut(valid, relation, heapTuple, + CheckForSerializableConflictOut(valid, relation, (void *) heapTuple, buffer, snapshot); /* reset to original, non-redirected, tid */ heapTuple->t_self = *tid; @@ -1916,7 +1885,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, if (valid) { ItemPointerSetOffsetNumber(tid, offnum); - PredicateLockTuple(relation, heapTuple, snapshot); + PredicateLockTid(relation, &(heapTuple)->t_self, snapshot, + HeapTupleHeaderGetXmin(heapTuple->t_data)); if (all_dead) *all_dead = false; return true; @@ -2082,7 +2052,7 @@ heap_get_latest_tid(Relation relation, * result candidate. */ valid = HeapTupleSatisfies(&tp, snapshot, buffer); - CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot); + CheckForSerializableConflictOut(valid, relation, (void *) &tp, buffer, snapshot); if (valid) *tid = ctid; @@ -3036,7 +3006,7 @@ l1: * being visible to the scan (i.e., an exclusive buffer content lock is * continuously held from this point until the tuple delete is visible). */ - CheckForSerializableConflictIn(relation, &tp, buffer); + CheckForSerializableConflictIn(relation, &(tp.t_self), buffer); /* replace cid with a combo cid if necessary */ HeapTupleHeaderAdjustCmax(tp.t_data, &cid, &iscombo); @@ -3962,7 +3932,7 @@ l2: * will include checking the relation level, there is no benefit to a * separate check for the new tuple. */ - CheckForSerializableConflictIn(relation, &oldtup, buffer); + CheckForSerializableConflictIn(relation, &(oldtup.t_self), buffer); /* * At this point newbuf and buffer are both pinned and locked, and newbuf @@ -5114,7 +5084,7 @@ out_unlocked: * Returns false if it was unable to obtain the lock; this can only happen if * wait_policy is Skip. */ -static bool +bool heap_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode, LockWaitPolicy wait_policy, bool *have_tuple_lock) { diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 95513dfec8..80c6e5ed82 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -1399,9 +1399,10 @@ heapam_scan_bitmap_pagescan(TableScanDesc sscan, if (valid) { scan->rs_vistuples[ntup++] = offnum; - PredicateLockTuple(scan->rs_scan.rs_rd, &loctup, snapshot); + PredicateLockTid(scan->rs_scan.rs_rd, &loctup.t_self, snapshot, + HeapTupleHeaderGetXmin(loctup.t_data)); } - CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, &loctup, + CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &loctup, buffer, snapshot); } } @@ -1627,7 +1628,7 @@ heapam_scan_sample_next_tuple(TableScanDesc sscan, struct SampleScanState *scans /* in pagemode, heapgetpage did this for us */ if (!pagemode) - CheckForSerializableConflictOut(visible, scan->rs_scan.rs_rd, tuple, + CheckForSerializableConflictOut(visible, scan->rs_scan.rs_rd, (void *) tuple, scan->rs_cbuf, scan->rs_scan.rs_snapshot); /* Try next tuple from same page. */ diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c index 1ac1a20c1d..eb03fae9a4 100644 --- a/src/backend/access/heap/heapam_visibility.c +++ b/src/backend/access/heap/heapam_visibility.c @@ -760,6 +760,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot, Assert(htup->t_tableOid != InvalidOid); snapshot->xmin = snapshot->xmax = InvalidTransactionId; + snapshot->subxid = InvalidSubTransactionId; snapshot->speculativeToken = 0; if (!HeapTupleHeaderXminCommitted(tuple)) @@ -1842,3 +1843,73 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer) return false; /* keep compiler quiet */ } + +/* + * This is a helper function for CheckForSerializableConflictOut. + * + * Check to see whether the tuple has been written to by a concurrent + * transaction, either to create it not visible to us, or to delete it + * while it is visible to us. The "visible" bool indicates whether the + * tuple is visible to us, while HeapTupleSatisfiesVacuum checks what else + * is going on with it. The caller should have a share lock on the buffer. + */ +bool +HeapTupleHasSerializableConflictOut(bool visible, HeapTuple tuple, Buffer buffer, + TransactionId *xid) +{ + HTSV_Result htsvResult; + htsvResult = HeapTupleSatisfiesVacuum(tuple, TransactionXmin, buffer); + switch (htsvResult) + { + case HEAPTUPLE_LIVE: + if (visible) + return false; + *xid = HeapTupleHeaderGetXmin(tuple->t_data); + break; + case HEAPTUPLE_RECENTLY_DEAD: + if (!visible) + return false; + *xid = HeapTupleHeaderGetUpdateXid(tuple->t_data); + break; + case HEAPTUPLE_DELETE_IN_PROGRESS: + *xid = HeapTupleHeaderGetUpdateXid(tuple->t_data); + break; + case HEAPTUPLE_INSERT_IN_PROGRESS: + *xid = HeapTupleHeaderGetXmin(tuple->t_data); + break; + case HEAPTUPLE_DEAD: + return false; + default: + + /* + * The only way to get to this default clause is if a new value is + * added to the enum type without adding it to this switch + * statement. That's a bug, so elog. + */ + elog(ERROR, "unrecognized return value from HeapTupleSatisfiesVacuum: %u", htsvResult); + + /* + * In spite of having all enum values covered and calling elog on + * this default, some compilers think this is a code path which + * allows xid to be used below without initialization. Silence + * that warning. + */ + *xid = InvalidTransactionId; + } + Assert(TransactionIdIsValid(*xid)); + Assert(TransactionIdFollowsOrEquals(*xid, TransactionXmin)); + + /* + * Find top level xid. Bail out if xid is too early to be a conflict, or + * if it's our own xid. + */ + if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny())) + return false; + *xid = SubTransGetTopmostTransaction(*xid); + if (TransactionIdPrecedes(*xid, TransactionXmin)) + return false; + if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny())) + return false; + + return true; +} diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index b8b5871559..856a14c90b 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -19,6 +19,8 @@ #include "access/hio.h" #include "access/htup_details.h" #include "access/visibilitymap.h" +#include "access/zheap.h" +#include "access/zhtup.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h" @@ -76,7 +78,7 @@ RelationPutHeapTuple(Relation relation, /* * Read in a buffer, using bulk-insert strategy if bistate isn't NULL. */ -static Buffer +Buffer ReadBufferBI(Relation relation, BlockNumber targetBlock, BulkInsertState bistate) { @@ -118,7 +120,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock, * must not be InvalidBuffer. If both buffers are specified, buffer1 must * be less than buffer2. */ -static void +void GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, BlockNumber block1, BlockNumber block2, Buffer *vmbuffer1, Buffer *vmbuffer2) @@ -174,7 +176,7 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, * amount which ramps up as the degree of contention ramps up, but limiting * the result to some sane overall value. */ -static void +void RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) { BlockNumber blockNum, @@ -216,7 +218,17 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) BufferGetBlockNumber(buffer), RelationGetRelationName(relation)); - PageInit(page, BufferGetPageSize(buffer), 0); + if (RelationStorageIsZHeap(relation)) + { + Assert(BufferGetBlockNumber(buffer) != ZHEAP_METAPAGE); + ZheapInitPage(page, BufferGetPageSize(buffer)); + freespace = PageGetZHeapFreeSpace(page); + } + else + { + PageInit(page, BufferGetPageSize(buffer), 0); + freespace = PageGetHeapFreeSpace(page); + } /* * We mark all the new buffers dirty, but do nothing to write them @@ -227,8 +239,6 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) /* we'll need this info below */ blockNum = BufferGetBlockNumber(buffer); - freespace = PageGetHeapFreeSpace(page); - UnlockReleaseBuffer(buffer); /* Remember first block number thus added. */ diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c index 486cde4aff..385f1b20f5 100644 --- a/src/backend/access/heap/tuptoaster.c +++ b/src/backend/access/heap/tuptoaster.c @@ -71,20 +71,10 @@ typedef struct toast_compress_header static void toast_delete_datum(Relation rel, Datum value, bool is_speculative); static Datum toast_save_datum(Relation rel, Datum value, struct varlena *oldexternal, int options); -static bool toastrel_valueid_exists(Relation toastrel, Oid valueid); -static bool toastid_valueid_exists(Oid toastrelid, Oid valueid); static struct varlena *toast_fetch_datum(struct varlena *attr); static struct varlena *toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, int32 length); static struct varlena *toast_decompress_datum(struct varlena *attr); -static int toast_open_indexes(Relation toastrel, - LOCKMODE lock, - Relation **toastidxs, - int *num_indexes); -static void toast_close_indexes(Relation *toastidxs, int num_indexes, - LOCKMODE lock); -static void init_toast_snapshot(Snapshot toast_snapshot); - /* ---------- * heap_tuple_fetch_attr - @@ -1787,7 +1777,7 @@ toast_delete_datum(Relation rel, Datum value, bool is_speculative) * toast rows with that ID; see notes for GetNewOidWithIndex(). * ---------- */ -static bool +bool toastrel_valueid_exists(Relation toastrel, Oid valueid) { bool result = false; @@ -1835,7 +1825,7 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid) * As above, but work from toast rel's OID not an open relation * ---------- */ -static bool +bool toastid_valueid_exists(Oid toastrelid, Oid valueid) { bool result; @@ -2289,7 +2279,7 @@ toast_decompress_datum(struct varlena *attr) * relation in this array. It is the responsibility of the caller of this * function to close the indexes as well as free them. */ -static int +int toast_open_indexes(Relation toastrel, LOCKMODE lock, Relation **toastidxs, @@ -2348,7 +2338,7 @@ toast_open_indexes(Relation toastrel, * Close an array of indexes for a toast relation and free it. This should * be called for a set of indexes opened previously with toast_open_indexes. */ -static void +void toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock) { int i; @@ -2367,7 +2357,7 @@ toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock) * just use the oldest one. This is safe: at worst, we will get a "snapshot * too old" error that might have been avoided otherwise. */ -static void +void init_toast_snapshot(Snapshot toast_snapshot) { Snapshot snapshot = GetOldestSnapshot(); diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 429f9ad52a..36139e39f4 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -44,6 +44,7 @@ #include "access/transam.h" #include "access/visibilitymap.h" #include "access/xlog.h" +#include "access/zhtup.h" #include "catalog/storage.h" #include "commands/dbcommands.h" #include "commands/progress.h" @@ -83,15 +84,6 @@ #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ #define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ -/* - * When a table has no indexes, vacuum the FSM after every 8GB, approximately - * (it won't be exact because we only vacuum FSM after processing a heap page - * that has some removable tuples). When there are indexes, this is ignored, - * and we vacuum FSM after each index/heap cleaning pass. - */ -#define VACUUM_FSM_EVERY_PAGES \ - ((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ)) - /* * Guesstimation of number of dead tuples per page. This is used to * provide an upper limit to memory allocated when vacuuming small @@ -111,35 +103,6 @@ */ #define PREFETCH_SIZE ((BlockNumber) 32) -typedef struct LVRelStats -{ - /* hasindex = true means two-pass strategy; false means one-pass */ - bool hasindex; - /* Overall statistics about rel */ - BlockNumber old_rel_pages; /* previous value of pg_class.relpages */ - BlockNumber rel_pages; /* total number of pages */ - BlockNumber scanned_pages; /* number of pages we examined */ - BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */ - BlockNumber frozenskipped_pages; /* # of frozen pages we skipped */ - BlockNumber tupcount_pages; /* pages whose tuples we counted */ - double old_live_tuples; /* previous value of pg_class.reltuples */ - double new_rel_tuples; /* new estimated total # of tuples */ - double new_live_tuples; /* new estimated total # of live tuples */ - double new_dead_tuples; /* new estimated total # of dead tuples */ - BlockNumber pages_removed; - double tuples_deleted; - BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ - /* List of TIDs of tuples we intend to delete */ - /* NB: this list is ordered by TID address */ - int num_dead_tuples; /* current # of entries */ - int max_dead_tuples; /* # slots allocated in array */ - ItemPointer dead_tuples; /* array of ItemPointerData */ - int num_index_scans; - TransactionId latestRemovedXid; - bool lock_waiter_detected; -} LVRelStats; - - /* A few variables that don't seem worth passing around as parameters */ static int elevel = -1; @@ -156,21 +119,12 @@ static void lazy_scan_heap(Relation onerel, int options, bool aggressive); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static bool lazy_check_needs_freeze(Buffer buf, bool *hastup); -static void lazy_vacuum_index(Relation indrel, - IndexBulkDeleteResult **stats, - LVRelStats *vacrelstats); -static void lazy_cleanup_index(Relation indrel, - IndexBulkDeleteResult *stats, - LVRelStats *vacrelstats); static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); -static bool should_attempt_truncation(LVRelStats *vacrelstats); -static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats); static BlockNumber count_nondeletable_pages(Relation onerel, - LVRelStats *vacrelstats); + LVRelStats *vacrelstats, + BufferAccessStrategy vac_strategy); static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks); -static void lazy_record_dead_tuple(LVRelStats *vacrelstats, - ItemPointer itemptr); static bool lazy_tid_reaped(ItemPointer itemptr, void *state); static int vac_cmp_itemptr(const void *left, const void *right); static bool heap_page_is_all_visible(Relation rel, Buffer buf, @@ -287,7 +241,7 @@ heap_vacuum_rel(Relation onerel, int options, VacuumParams *params, * Optionally truncate the relation. */ if (should_attempt_truncation(vacrelstats)) - lazy_truncate_heap(onerel, vacrelstats); + lazy_truncate_heap(onerel, vacrelstats, vac_strategy); /* Report that we are now doing final cleanup */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, @@ -746,7 +700,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], &indstats[i], - vacrelstats); + vacrelstats, + vac_strategy); /* * Report that we are now vacuuming the heap. We also increase @@ -1385,7 +1340,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], &indstats[i], - vacrelstats); + vacrelstats, + vac_strategy); /* Report that we are now vacuuming the heap */ hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP; @@ -1413,7 +1369,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, /* Do post-vacuum cleanup and statistics update for each index */ for (i = 0; i < nindexes; i++) - lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); + lazy_cleanup_index(Irel[i], indstats[i], vacrelstats, vac_strategy); /* If no indexes, make log report that lazy_vacuum_heap would've made */ if (vacuumed_pages) @@ -1682,10 +1638,11 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup) * Delete all the index entries pointing to tuples listed in * vacrelstats->dead_tuples, and update running statistics. */ -static void +void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, - LVRelStats *vacrelstats) + LVRelStats *vacrelstats, + BufferAccessStrategy vac_strategy) { IndexVacuumInfo ivinfo; PGRUsage ru0; @@ -1714,10 +1671,11 @@ lazy_vacuum_index(Relation indrel, /* * lazy_cleanup_index() -- do post-vacuum cleanup for one index relation. */ -static void +void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, - LVRelStats *vacrelstats) + LVRelStats *vacrelstats, + BufferAccessStrategy vac_strategy) { IndexVacuumInfo ivinfo; PGRUsage ru0; @@ -1790,7 +1748,7 @@ lazy_cleanup_index(Relation indrel, * called for before we actually do it. If you change the logic here, be * careful to depend only on fields that lazy_scan_heap updates on-the-fly. */ -static bool +bool should_attempt_truncation(LVRelStats *vacrelstats) { BlockNumber possibly_freeable; @@ -1808,8 +1766,9 @@ should_attempt_truncation(LVRelStats *vacrelstats) /* * lazy_truncate_heap - try to truncate off any empty pages at the end */ -static void -lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) +void +lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats, + BufferAccessStrategy vac_strategy) { BlockNumber old_rel_pages = vacrelstats->rel_pages; BlockNumber new_rel_pages; @@ -1889,7 +1848,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) * other backends could have added tuples to these pages whilst we * were vacuuming. */ - new_rel_pages = count_nondeletable_pages(onerel, vacrelstats); + new_rel_pages = count_nondeletable_pages(onerel, vacrelstats, + vac_strategy); if (new_rel_pages >= old_rel_pages) { @@ -1937,7 +1897,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) * Returns number of nondeletable pages (last nonempty page + 1). */ static BlockNumber -count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) +count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats, + BufferAccessStrategy vac_strategy) { BlockNumber blkno; BlockNumber prefetchedUntil; @@ -2050,6 +2011,17 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) * this page. We formerly thought that DEAD tuples could be * thrown away, but that's not so, because we'd not have cleaned * out their index entries. + * + * XXX - This function is used by both heap and zheap and the + * behavior must be same in both the cases. However, for zheap, + * there could be some unused items that contain pending xact + * information for the current transaction. It is okay to + * truncate such pages as even if the transaction rolled back + * after this point, we won't be reclaiming the truncated pages + * or making the unused items back to dead. We can add Assert + * to check if the pending xact is the current transaction, but to + * do that we need some storage engine specific check which seems + * too much for the purpose for which it is required. */ if (ItemIdIsUsed(itemid)) { @@ -2113,7 +2085,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) /* * lazy_record_dead_tuple - remember one deletable tuple */ -static void +void lazy_record_dead_tuple(LVRelStats *vacrelstats, ItemPointer itemptr) { diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 695567b4b0..0ed915a58e 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -87,6 +87,8 @@ #include "access/heapam_xlog.h" #include "access/visibilitymap.h" +#include "access/zheapam_xlog.h" +#include "access/zheap.h" #include "access/xlog.h" #include "miscadmin.h" #include "storage/bufmgr.h" @@ -285,7 +287,10 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, #endif Assert(InRecovery || XLogRecPtrIsInvalid(recptr)); - Assert(InRecovery || BufferIsValid(heapBuf)); + + /* For zheap we do not set heapBuf's status hence can be invalid */ + Assert(RelationStorageIsZHeap(rel) || + (InRecovery || BufferIsValid(heapBuf))); Assert(flags & VISIBILITYMAP_VALID_BITS); /* Check that we have the right heap page pinned, if present */ @@ -312,20 +317,32 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, if (XLogRecPtrIsInvalid(recptr)) { Assert(!InRecovery); - recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf, - cutoff_xid, flags); - - /* - * If data checksums are enabled (or wal_log_hints=on), we - * need to protect the heap page from being torn. - */ - if (XLogHintBitIsNeeded()) + if (RelationStorageIsZHeap(rel)) { - Page heapPage = BufferGetPage(heapBuf); - - /* caller is expected to set PD_ALL_VISIBLE first */ - Assert(PageIsAllVisible(heapPage)); - PageSetLSN(heapPage, recptr); + recptr = log_zheap_visible(rel->rd_node, heapBuf, vmBuf, + cutoff_xid, flags); + /* + * We do not have a page wise visibility flag in zheap. + * So no need to set LSN on zheap page. + */ + } + else + { + recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf, + cutoff_xid, flags); + + /* + * If data checksums are enabled (or wal_log_hints=on), we + * need to protect the heap page from being torn. + */ + if (XLogHintBitIsNeeded()) + { + Page heapPage = BufferGetPage(heapBuf); + + /* caller is expected to set PD_ALL_VISIBLE first */ + Assert(PageIsAllVisible(heapPage)); + PageSetLSN(heapPage, recptr); + } } } PageSetLSN(page, recptr); diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index fe5af31f87..23271d0cde 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -206,7 +206,7 @@ index_insert(Relation indexRelation, if (!(indexRelation->rd_amroutine->ampredlocks)) CheckForSerializableConflictIn(indexRelation, - (HeapTuple) NULL, + (ItemPointer) NULL, InvalidBuffer); return indexRelation->rd_amroutine->aminsert(indexRelation, values, isnull, diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index b2ad95f970..ed7c3163af 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -57,7 +57,7 @@ static TransactionId _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, Buffer buf, OffsetNumber offset, ScanKey itup_scankey, IndexUniqueCheck checkUnique, bool *is_unique, - uint32 *speculativeToken); + uint32 *speculativeToken, SubTransactionId *subxid); static void _bt_findinsertloc(Relation rel, Buffer *bufptr, OffsetNumber *offsetptr, @@ -250,10 +250,12 @@ top: { TransactionId xwait; uint32 speculativeToken; + SubTransactionId subxid = InvalidSubTransactionId; offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false); xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey, - checkUnique, &is_unique, &speculativeToken); + checkUnique, &is_unique, &speculativeToken, + &subxid); if (TransactionIdIsValid(xwait)) { @@ -267,9 +269,12 @@ top: */ if (speculativeToken) SpeculativeInsertionWait(xwait, speculativeToken); + else if (subxid != InvalidSubTransactionId) + SubXactLockTableWait(xwait, subxid, rel, &itup->t_tid, + XLTW_InsertIndex); else - XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex); - + XactLockTableWait(xwait, rel, &itup->t_tid, + XLTW_InsertIndex); /* start over... */ if (stack) _bt_freestack(stack); @@ -331,7 +336,7 @@ static TransactionId _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, Buffer buf, OffsetNumber offset, ScanKey itup_scankey, IndexUniqueCheck checkUnique, bool *is_unique, - uint32 *speculativeToken) + uint32 *speculativeToken, SubTransactionId *subxid) { TupleDesc itupdesc = RelationGetDescr(rel); int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); @@ -449,6 +454,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, _bt_relbuf(rel, nbuf); /* Tell _bt_doinsert to wait... */ *speculativeToken = SnapshotDirty.speculativeToken; + *subxid = SnapshotDirty.subxid; return xwait; } diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 4082103fe2..d893a00ed5 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -1067,6 +1067,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, xlrec_delete.hnode = heapRel->rd_node; xlrec_delete.nitems = nitems; + xlrec_delete.flags = RelationStorageIsZHeap(heapRel) ? + XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP : 0; XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_STANDARD); diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 67a94cb80a..395e642501 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -21,6 +21,7 @@ #include "access/transam.h" #include "access/xlog.h" #include "access/xlogutils.h" +#include "access/zheap.h" #include "storage/procarray.h" #include "miscadmin.h" @@ -543,7 +544,6 @@ btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record) ItemId iitemid, hitemid; IndexTuple itup; - HeapTupleHeader htuphdr; BlockNumber hblkno; OffsetNumber hoffnum; TransactionId latestRemovedXid = InvalidTransactionId; @@ -622,27 +622,75 @@ btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record) hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid)); hitemid = PageGetItemId(hpage, hoffnum); - /* - * Follow any redirections until we find something useful. - */ - while (ItemIdIsRedirected(hitemid)) + if (!(xlrec->flags & XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP)) { - hoffnum = ItemIdGetRedirect(hitemid); - hitemid = PageGetItemId(hpage, hoffnum); - CHECK_FOR_INTERRUPTS(); + /* + * Follow any redirections until we find something useful. + */ + while (ItemIdIsRedirected(hitemid)) + { + hoffnum = ItemIdGetRedirect(hitemid); + hitemid = PageGetItemId(hpage, hoffnum); + CHECK_FOR_INTERRUPTS(); + } } /* * If the heap item has storage, then read the header and use that to * set latestRemovedXid. * + * We have special handling for zheap tuples that are deleted and + * don't have storage. + * * Some LP_DEAD items may not be accessible, so we ignore them. */ - if (ItemIdHasStorage(hitemid)) + if ((xlrec->flags & XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP) && + ItemIdIsDeleted(hitemid)) { - htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid); + TransactionId xid; + ZHeapTupleData ztup; + + ztup.t_self = itup->t_tid; + ztup.t_len = ItemIdGetLength(hitemid); + ztup.t_tableOid = InvalidOid; + ztup.t_data = NULL; + ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid, NULL, NULL, + false); + if (TransactionIdDidCommit(xid) && + TransactionIdFollows(xid, latestRemovedXid)) + latestRemovedXid = xid; + } + else if (ItemIdHasStorage(hitemid)) + { + if ((xlrec->flags & XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP) != 0) + { + ZHeapTupleHeader ztuphdr; + ZHeapTupleData ztup; + + ztuphdr = (ZHeapTupleHeader) PageGetItem(hpage, hitemid); + ztup.t_self = itup->t_tid; + ztup.t_len = ItemIdGetLength(hitemid); + ztup.t_tableOid = InvalidOid; + ztup.t_data = ztuphdr; + + if (ztuphdr->t_infomask & ZHEAP_DELETED + || ztuphdr->t_infomask & ZHEAP_UPDATED) + { + TransactionId xid; + + ZHeapTupleGetTransInfo(&ztup, hbuffer, NULL, NULL, &xid, + NULL, NULL, false); + elog(DEBUG1, "TransactionId: %d",xid); + ZHeapTupleHeaderAdvanceLatestRemovedXid(ztuphdr, xid, &latestRemovedXid); + } + } + else + { + HeapTupleHeader htuphdr; + htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid); - HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid); + HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid); + } } else if (ItemIdIsDead(hitemid)) { diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile index 5514db1dda..9d3cdf8233 100644 --- a/src/backend/access/rmgrdesc/Makefile +++ b/src/backend/access/rmgrdesc/Makefile @@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \ gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \ mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \ - smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o + smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o tpddesc.o undoactiondesc.o \ + undologdesc.o xactdesc.o xlogdesc.o zheapamdesc.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/rmgrdesc/tpddesc.c b/src/backend/access/rmgrdesc/tpddesc.c new file mode 100644 index 0000000000..a41c2ddff6 --- /dev/null +++ b/src/backend/access/rmgrdesc/tpddesc.c @@ -0,0 +1,73 @@ +/*------------------------------------------------------------------------- + * + * tpddesc.c + * rmgr descriptor routines for access/undo/tpdxlog.c + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/rmgrdesc/tpddesc.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd_xlog.h" + +void +tpd_desc(StringInfo buf, XLogReaderState *record) +{ + char *rec = XLogRecGetData(record); + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + info &= XLOG_TPD_OPMASK; + if (info == XLOG_ALLOCATE_TPD_ENTRY) + { + xl_tpd_allocate_entry *xlrec = (xl_tpd_allocate_entry *) rec; + + appendStringInfo(buf, "prevblk %u nextblk %u offset %u", + xlrec->prevblk, xlrec->nextblk, xlrec->offnum); + } + else if (info == XLOG_TPD_FREE_PAGE) + { + xl_tpd_free_page *xlrec = (xl_tpd_free_page *) rec; + + appendStringInfo(buf, "prevblk %u nextblk %u", + xlrec->prevblkno, xlrec->nextblkno); + } +} + +const char * +tpd_identify(uint8 info) +{ + const char *id = NULL; + + switch (info & ~XLR_INFO_MASK) + { + case XLOG_ALLOCATE_TPD_ENTRY: + id = "ALLOCATE TPD ENTRY"; + break; + case XLOG_ALLOCATE_TPD_ENTRY | XLOG_TPD_INIT_PAGE: + id = "ALLOCATE TPD ENTRY+INIT"; + break; + case XLOG_TPD_CLEAN: + id = "TPD CLEAN"; + break; + case XLOG_TPD_CLEAR_LOCATION: + id = "TPD CLEAR LOCATION"; + break; + case XLOG_INPLACE_UPDATE_TPD_ENTRY: + id = "INPLACE UPDATE TPD ENTRY"; + break; + case XLOG_TPD_FREE_PAGE: + id = "TPD FREE PAGE"; + break; + case XLOG_TPD_CLEAN_ALL_ENTRIES: + id = "TPD CLEAN ALL ENTRIES"; + break; + } + + return id; +} diff --git a/src/backend/access/rmgrdesc/undoactiondesc.c b/src/backend/access/rmgrdesc/undoactiondesc.c new file mode 100644 index 0000000000..0a23fece4a --- /dev/null +++ b/src/backend/access/rmgrdesc/undoactiondesc.c @@ -0,0 +1,67 @@ +/*------------------------------------------------------------------------- + * + * undoactiondesc.c + * rmgr descriptor routines for access/undo/undoactionxlog.c + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/rmgrdesc/undoactiondesc.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/undoaction_xlog.h" + +void +undoaction_desc(StringInfo buf, XLogReaderState *record) +{ + char *rec = XLogRecGetData(record); + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + if (info == XLOG_UNDO_PAGE) + { + uint8 *flags = (uint8 *) rec; + + appendStringInfo(buf, "page_contains_tpd_slot: %c ", + (*flags & XLU_PAGE_CONTAINS_TPD_SLOT) ? 'T' : 'F'); + appendStringInfo(buf, "is_page_initialized: %c ", + (*flags & XLU_INIT_PAGE) ? 'T' : 'F'); + if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT) + { + xl_undoaction_page *xlrec = + (xl_undoaction_page *) ((char *) flags + sizeof(uint8)); + + appendStringInfo(buf, "urec_ptr %lu xid %u trans_slot_id %u", + xlrec->urec_ptr, xlrec->xid, xlrec->trans_slot_id); + } + } + else if (info == XLOG_UNDO_RESET_SLOT) + { + xl_undoaction_reset_slot *xlrec = (xl_undoaction_reset_slot *) rec; + + appendStringInfo(buf, "urec_ptr %lu trans_slot_id %u", + xlrec->urec_ptr, xlrec->trans_slot_id); + } +} + +const char * +undoaction_identify(uint8 info) +{ + const char *id = NULL; + + switch (info & ~XLR_INFO_MASK) + { + case XLOG_UNDO_PAGE: + id = "UNDO PAGE"; + break; + case XLOG_UNDO_RESET_SLOT: + id = "UNDO RESET SLOT"; + break; + } + + return id; +} diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c new file mode 100644 index 0000000000..5855b9b49e --- /dev/null +++ b/src/backend/access/rmgrdesc/undologdesc.c @@ -0,0 +1,104 @@ +/*------------------------------------------------------------------------- + * + * undologdesc.c + * rmgr descriptor routines for access/undo/undolog.c + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/rmgrdesc/undologdesc.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/undolog.h" +#include "access/undolog_xlog.h" + +void +undolog_desc(StringInfo buf, XLogReaderState *record) +{ + char *rec = XLogRecGetData(record); + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + if (info == XLOG_UNDOLOG_CREATE) + { + xl_undolog_create *xlrec = (xl_undolog_create *) rec; + + appendStringInfo(buf, "logno %u", xlrec->logno); + } + else if (info == XLOG_UNDOLOG_EXTEND) + { + xl_undolog_extend *xlrec = (xl_undolog_extend *) rec; + + appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat, + xlrec->logno, xlrec->end); + } + else if (info == XLOG_UNDOLOG_ATTACH) + { + xl_undolog_attach *xlrec = (xl_undolog_attach *) rec; + + appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid); + } + else if (info == XLOG_UNDOLOG_META) + { + xl_undolog_meta *xlrec = (xl_undolog_meta *) rec; + + appendStringInfo(buf, "logno %u xid %u insert " UndoLogOffsetFormat + " last_xact_start " UndoLogOffsetFormat + " prevlen=%d" + " is_first_record=%d", + xlrec->logno, xlrec->xid, xlrec->meta.insert, + xlrec->meta.last_xact_start, + xlrec->meta.prevlen, + xlrec->meta.is_first_rec); + } + else if (info == XLOG_UNDOLOG_DISCARD) + { + xl_undolog_discard *xlrec = (xl_undolog_discard *) rec; + + appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end " + UndoLogOffsetFormat, + xlrec->logno, xlrec->discard, xlrec->end); + } + else if (info == XLOG_UNDOLOG_REWIND) + { + xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec; + + appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d", + xlrec->logno, xlrec->insert, xlrec->prevlen); + } + +} + +const char * +undolog_identify(uint8 info) +{ + const char *id = NULL; + + switch (info & ~XLR_INFO_MASK) + { + case XLOG_UNDOLOG_CREATE: + id = "CREATE"; + break; + case XLOG_UNDOLOG_EXTEND: + id = "EXTEND"; + break; + case XLOG_UNDOLOG_ATTACH: + id = "ATTACH"; + break; + case XLOG_UNDOLOG_META: + id = "UNDO_META"; + break; + case XLOG_UNDOLOG_DISCARD: + id = "DISCARD"; + break; + case XLOG_UNDOLOG_REWIND: + id = "REWIND"; + break; + } + + return id; +} diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c index 00741c7b09..987e39c830 100644 --- a/src/backend/access/rmgrdesc/xlogdesc.c +++ b/src/backend/access/rmgrdesc/xlogdesc.c @@ -47,7 +47,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record) "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; " "oldest xid %u in DB %u; oldest multi %u in DB %u; " "oldest/newest commit timestamp xid: %u/%u; " - "oldest running xid %u; %s", + "oldest running xid %u; " + "oldest xid with epoch having undo " UINT64_FORMAT "; %s", (uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo, checkpoint->ThisTimeLineID, checkpoint->PrevTimeLineID, @@ -63,6 +64,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record) checkpoint->oldestCommitTsXid, checkpoint->newestCommitTsXid, checkpoint->oldestActiveXid, + checkpoint->oldestXidWithEpochHavingUndo, (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online"); } else if (info == XLOG_NEXTOID) diff --git a/src/backend/access/rmgrdesc/zheapamdesc.c b/src/backend/access/rmgrdesc/zheapamdesc.c new file mode 100644 index 0000000000..a5d88c21b0 --- /dev/null +++ b/src/backend/access/rmgrdesc/zheapamdesc.c @@ -0,0 +1,185 @@ +/*------------------------------------------------------------------------- + * + * zheapamdesc.c + * rmgr descriptor routines for access/zheap/zheapamxlog.c + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/rmgrdesc/zheapamdesc.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/zheapam_xlog.h" + +void +zheap_desc(StringInfo buf, XLogReaderState *record) +{ + char *rec = XLogRecGetData(record); + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + info &= XLOG_ZHEAP_OPMASK; + if (info == XLOG_ZHEAP_CLEAN) + { + xl_zheap_clean *xlrec = (xl_zheap_clean *) rec; + + appendStringInfo(buf, "remxid %u", xlrec->latestRemovedXid); + } + else if (info == XLOG_ZHEAP_INSERT) + { + xl_undo_header *xlundohdr = (xl_undo_header *) rec; + xl_zheap_insert *xlrec = (xl_zheap_insert *) ((char *) xlundohdr + SizeOfUndoHeader); + + appendStringInfo(buf, "off %u, blkprev %lu", xlrec->offnum, xlundohdr->blkprev); + } + else if(info == XLOG_ZHEAP_MULTI_INSERT) + { + xl_undo_header *xlundohdr = (xl_undo_header *) rec; + xl_zheap_multi_insert *xlrec = (xl_zheap_multi_insert *) ((char *) xlundohdr + SizeOfUndoHeader); + + appendStringInfo(buf, "%d tuples", xlrec->ntuples); + } + else if (info == XLOG_ZHEAP_DELETE) + { + xl_undo_header *xlundohdr = (xl_undo_header *) rec; + xl_zheap_delete *xlrec = (xl_zheap_delete *) ((char *) xlundohdr + SizeOfUndoHeader); + + appendStringInfo(buf, "off %u, trans_slot %u, hasUndoTuple: %c, blkprev %lu", + xlrec->offnum, xlrec->trans_slot_id, + (xlrec->flags & XLZ_HAS_DELETE_UNDOTUPLE) ? 'T' : 'F', + xlundohdr->blkprev); + } + else if (info == XLOG_ZHEAP_UPDATE) + { + xl_undo_header *xlundohdr = (xl_undo_header *) rec; + xl_zheap_update *xlrec = (xl_zheap_update *) ((char *) xlundohdr + SizeOfUndoHeader); + + appendStringInfo(buf, "oldoff %u, trans_slot %u, hasUndoTuple: %c, newoff: %u, blkprev %lu", + xlrec->old_offnum, xlrec->old_trans_slot_id, + (xlrec->flags & XLZ_HAS_UPDATE_UNDOTUPLE) ? 'T' : 'F', + xlrec->new_offnum, + xlundohdr->blkprev); + } + else if (info == XLOG_ZHEAP_FREEZE_XACT_SLOT) + { + xl_zheap_freeze_xact_slot *xlrec = (xl_zheap_freeze_xact_slot *) rec; + + appendStringInfo(buf, "latest frozen xid %u nfrozen %u", + xlrec->lastestFrozenXid, xlrec->nFrozen); + } + else if (info == XLOG_ZHEAP_INVALID_XACT_SLOT) + { + uint16 nCompletedSlots = *(uint16 *) rec; + + appendStringInfo(buf, "completed_slots %u", nCompletedSlots); + } + else if (info == XLOG_ZHEAP_LOCK) + { + xl_undo_header *xlundohdr = (xl_undo_header *) rec; + xl_zheap_lock *xlrec = (xl_zheap_lock *) ((char *) xlundohdr + SizeOfUndoHeader); + + appendStringInfo(buf, "off %u, xid %u, trans_slot_id %u", + xlrec->offnum, xlrec->prev_xid, xlrec->trans_slot_id); + } +} + +void +zheap2_desc(StringInfo buf, XLogReaderState *record) +{ + char *rec = XLogRecGetData(record); + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + info &= XLOG_ZHEAP_OPMASK; + if (info == XLOG_ZHEAP_CONFIRM) + { + xl_zheap_confirm *xlrec = (xl_zheap_confirm *) rec; + + appendStringInfo(buf, "off %u: flags %u", xlrec->offnum, xlrec->flags); + } + else if (info == XLOG_ZHEAP_UNUSED) + { + xl_undo_header *xlundohdr = (xl_undo_header *) rec; + xl_zheap_unused *xlrec = (xl_zheap_unused *) ((char *) xlundohdr + SizeOfUndoHeader); + + appendStringInfo(buf, "remxid %u, trans_slot_id %u, blkprev %lu", + xlrec->latestRemovedXid, xlrec->trans_slot_id, + xlundohdr->blkprev); + } + else if (info == XLOG_ZHEAP_VISIBLE) + { + xl_zheap_visible *xlrec = (xl_zheap_visible *) rec; + + appendStringInfo(buf, "cutoff xid %u flags %d", + xlrec->cutoff_xid, xlrec->flags); + } +} + +const char * +zheap_identify(uint8 info) +{ + const char *id = NULL; + + switch (info & ~XLR_INFO_MASK) + { + case XLOG_ZHEAP_CLEAN: + id = "CLEAN"; + break; + case XLOG_ZHEAP_INSERT: + id = "INSERT"; + break; + case XLOG_ZHEAP_INSERT | XLOG_ZHEAP_INIT_PAGE: + id = "INSERT+INIT"; + break; + case XLOG_ZHEAP_DELETE: + id = "DELETE"; + break; + case XLOG_ZHEAP_UPDATE: + id = "UPDATE"; + break; + case XLOG_ZHEAP_UPDATE | XLOG_ZHEAP_INIT_PAGE: + id = "UPDATE+INIT"; + break; + case XLOG_ZHEAP_FREEZE_XACT_SLOT: + id = "FREEZE_XACT_SLOT"; + break; + case XLOG_ZHEAP_INVALID_XACT_SLOT: + id = "INVALID_XACT_SLOT"; + break; + case XLOG_ZHEAP_LOCK: + id = "LOCK"; + break; + case XLOG_ZHEAP_MULTI_INSERT: + id = "MULTI_INSERT"; + break; + case XLOG_ZHEAP_MULTI_INSERT | XLOG_ZHEAP_INIT_PAGE: + id = "MULTI_INSERT+INIT"; + break; + } + + return id; +} + +const char * +zheap2_identify(uint8 info) +{ + const char *id = NULL; + + switch (info & ~XLR_INFO_MASK) + { + case XLOG_ZHEAP_CONFIRM: + id = "CONFIRM"; + break; + case XLOG_ZHEAP_UNUSED: + id = "UNUSED"; + break; + case XLOG_ZHEAP_VISIBLE: + id = "VISIBLE"; + break; + } + + return id; +} diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c index 9368b56c4c..00aa180aee 100644 --- a/src/backend/access/transam/rmgr.c +++ b/src/backend/access/transam/rmgr.c @@ -18,8 +18,12 @@ #include "access/multixact.h" #include "access/nbtxlog.h" #include "access/spgxlog.h" +#include "access/tpd_xlog.h" +#include "access/undoaction_xlog.h" +#include "access/undolog_xlog.h" #include "access/xact.h" #include "access/xlog_internal.h" +#include "access/zheapam_xlog.h" #include "catalog/storage_xlog.h" #include "commands/dbcommands_xlog.h" #include "commands/sequence.h" diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index e65dccc6a2..46ec852ca5 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -915,6 +915,13 @@ typedef struct TwoPhaseFileHeader uint16 gidlen; /* length of the GID - GID follows the header */ XLogRecPtr origin_lsn; /* lsn of this record at origin node */ TimestampTz origin_timestamp; /* time of prepare at origin node */ + + /* + * We need the locations of start and end undo record pointers when rollbacks + * are to be performed for prepared transactions using zheap relations. + */ + UndoRecPtr start_urec_ptr[UndoPersistenceLevels]; + UndoRecPtr end_urec_ptr[UndoPersistenceLevels]; } TwoPhaseFileHeader; /* @@ -989,7 +996,8 @@ save_state_data(const void *data, uint32 len) * Initializes data structure and inserts the 2PC file header record. */ void -StartPrepare(GlobalTransaction gxact) +StartPrepare(GlobalTransaction gxact, UndoRecPtr *start_urec_ptr, + UndoRecPtr *end_urec_ptr) { PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno]; PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno]; @@ -1020,6 +1028,11 @@ StartPrepare(GlobalTransaction gxact) hdr.database = proc->databaseId; hdr.prepared_at = gxact->prepared_at; hdr.owner = gxact->owner; + + /* save the start and end undo record pointers */ + memcpy(hdr.start_urec_ptr, start_urec_ptr, sizeof(hdr.start_urec_ptr)); + memcpy(hdr.end_urec_ptr, end_urec_ptr, sizeof(hdr.end_urec_ptr)); + hdr.nsubxacts = xactGetCommittedChildren(&children); hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels); hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels); @@ -1452,6 +1465,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit) RelFileNode *delrels; int ndelrels; SharedInvalidationMessage *invalmsgs; + int i; + UndoRecPtr start_urec_ptr[UndoPersistenceLevels]; + UndoRecPtr end_urec_ptr[UndoPersistenceLevels]; /* * Validate the GID, and lock the GXACT to ensure that two backends do not @@ -1489,6 +1505,38 @@ FinishPreparedTransaction(const char *gid, bool isCommit) invalmsgs = (SharedInvalidationMessage *) bufptr; bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); + /* save the start and end undo record pointers */ + memcpy(start_urec_ptr, hdr->start_urec_ptr, sizeof(start_urec_ptr)); + memcpy(end_urec_ptr, hdr->end_urec_ptr, sizeof(end_urec_ptr)); + + /* + * Perform undo actions, if there are undologs for this transaction. + * We need to perform undo actions while we are still in transaction. + * Never push rollbacks of temp tables to undo worker. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (end_urec_ptr[i] != InvalidUndoRecPtr && !isCommit) + { + bool result = false; + uint64 rollback_size = 0; + + if (i != UNDO_TEMP) + rollback_size = end_urec_ptr[i] - start_urec_ptr[i]; + + if (rollback_size >= rollback_overflow_size * 1024 * 1024) + result = PushRollbackReq(end_urec_ptr[i], start_urec_ptr[i], InvalidOid); + + /* + * ZBORKED: set rellock = true, as we do *not* actually have all + * the locks, but that'll probably deadlock? + */ + if (!result) + execute_undo_actions(end_urec_ptr[i], start_urec_ptr[i], true, + true, true); + } + } + /* compute latestXid among all children */ latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children); diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index a5eb29e01a..79a217e25c 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -284,9 +284,21 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid) TransactionId xidStopLimit; TransactionId xidWrapLimit; TransactionId curXid; + TransactionId oldestXidHavingUndo; Assert(TransactionIdIsNormal(oldest_datfrozenxid)); + /* + * To determine the last safe xid that can be allocated, we need to + * consider oldestXidHavingUndo. The oldestXidHavingUndo will be only + * valid for zheap storage engine, so it won't impact any other storage + * engine. + */ + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + if (TransactionIdIsValid(oldestXidHavingUndo)) + oldest_datfrozenxid = Min(oldest_datfrozenxid, oldestXidHavingUndo); + /* * The place where we actually get into deep trouble is halfway around * from the oldest potentially-existing XID. (This calculation is @@ -354,6 +366,13 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid) curXid = ShmemVariableCache->nextXid; LWLockRelease(XidGenLock); + /* + * Fixme - The messages in below code need some adjustment for zheap. + * They should reflect that the system needs to discard the undo. We + * can add it once we have a pluggable storage API which might provide + * us some way to distinguish among differnt storage engines. + */ + /* Log the info */ ereport(DEBUG1, (errmsg("transaction ID wrap limit is %u, limited by database with OID %u", diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index d967400384..766ae2c3b5 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -30,6 +30,7 @@ #include "access/xlog.h" #include "access/xloginsert.h" #include "access/xlogutils.h" +#include "access/tpd.h" #include "catalog/namespace.h" #include "catalog/pg_enum.h" #include "catalog/storage.h" @@ -41,6 +42,7 @@ #include "libpq/pqsignal.h" #include "miscadmin.h" #include "pgstat.h" +#include "postmaster/undoloop.h" #include "replication/logical.h" #include "replication/logicallauncher.h" #include "replication/origin.h" @@ -66,6 +68,8 @@ #include "utils/timestamp.h" #include "pg_trace.h" +#define AtAbort_ResetUndoBuffers ResetUndoBuffers() +#define AtAbort_ResetTPDBuffers ResetTPDBuffers() /* * User-tweakable parameters @@ -188,8 +192,12 @@ typedef struct TransactionStateData bool prevXactReadOnly; /* entry-time xact r/o state */ bool startedInRecovery; /* did we start in recovery? */ bool didLogXid; /* has xid been included in WAL record? */ - int parallelModeLevel; /* Enter/ExitParallelMode counter */ - struct TransactionStateData *parent; /* back link to parent */ + int parallelModeLevel; /* Enter/ExitParallelMode counter */ + bool subXactLock; /* has lock created for subtransaction? */ + /* start and end undo record location for each persistence level */ + UndoRecPtr start_urec_ptr[UndoPersistenceLevels]; + UndoRecPtr latest_urec_ptr[UndoPersistenceLevels]; + struct TransactionStateData *parent; /* back link to parent */ } TransactionStateData; typedef TransactionStateData *TransactionState; @@ -274,6 +282,20 @@ typedef struct SubXactCallbackItem static SubXactCallbackItem *SubXact_callbacks = NULL; +/* Location in undo log from where to start applying the undo actions. */ +static UndoRecPtr UndoActionStartPtr[UndoPersistenceLevels] = + {InvalidUndoRecPtr, + InvalidUndoRecPtr, + InvalidUndoRecPtr}; + +/* Location in undo log up to which undo actions need to be applied. */ +static UndoRecPtr UndoActionEndPtr[UndoPersistenceLevels] = + {InvalidUndoRecPtr, + InvalidUndoRecPtr, + InvalidUndoRecPtr}; + +/* Do we need to perform any undo actions? */ +static bool PerformUndoActions = false; /* local function prototypes */ static void AssignTransactionId(TransactionState s); @@ -616,6 +638,28 @@ AssignTransactionId(TransactionState s) } } +/* + * SetCurrentSubTransactionLocked + */ +void +SetCurrentSubTransactionLocked() +{ + TransactionState s = CurrentTransactionState; + + s->subXactLock = true; +} + +/* + * HasCurrentSubTransactionLock + */ +bool +HasCurrentSubTransactionLock() +{ + TransactionState s = CurrentTransactionState; + + return s->subXactLock; +} + /* * GetCurrentSubTransactionId */ @@ -627,6 +671,17 @@ GetCurrentSubTransactionId(void) return s->subTransactionId; } +/* + * GetCurrentTransactionResOwner + */ +ResourceOwner +GetCurrentTransactionResOwner(void) +{ + TransactionState s = CurrentTransactionState; + + return s->curTransactionOwner; +} + /* * SubTransactionIsActive * @@ -675,6 +730,15 @@ GetCurrentCommandId(bool used) return currentCommandId; } +/* + * GetCurrentCommandIdUsed + */ +bool +GetCurrentCommandIdUsed(void) +{ + return currentCommandIdUsed; +} + /* * SetParallelStartTimestamps * @@ -911,6 +975,24 @@ IsInParallelMode(void) return CurrentTransactionState->parallelModeLevel != 0; } +/* + * SetCurrentUndoLocation + */ +void +SetCurrentUndoLocation(UndoRecPtr urec_ptr) +{ + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr)); + UndoPersistence upersistence = log->meta.persistence; + /* + * Set the start undo record pointer for first undo record in a + * subtransaction. + */ + if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence])) + CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr; + CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr; + +} + /* * CommandCounterIncrement */ @@ -1800,6 +1882,7 @@ StartTransaction(void) { TransactionState s; VirtualTransactionId vxid; + int i; /* * Let's just make sure the state stack is empty @@ -1878,6 +1961,14 @@ StartTransaction(void) nUnreportedXids = 0; s->didLogXid = false; + /* initialize undo record locations for the transaction */ + for(i = 0; i < UndoPersistenceLevels; i++) + { + s->start_urec_ptr[i] = InvalidUndoRecPtr; + s->latest_urec_ptr[i] = InvalidUndoRecPtr; + } + s->subXactLock = false; + /* * must initialize resource-management stuff first */ @@ -2152,6 +2243,10 @@ CommitTransaction(void) AtEOXact_ApplyLauncher(true); pgstat_report_xact_timestamp(0); + /* In single user mode, discard all the undo logs, once committed. */ + if (!IsUnderPostmaster) + UndoLogDiscardAll(); + CurrentResourceOwner = NULL; ResourceOwnerDelete(TopTransactionResourceOwner); s->curTransactionOwner = NULL; @@ -2187,7 +2282,7 @@ CommitTransaction(void) * NB: if you change this routine, better look at CommitTransaction too! */ static void -PrepareTransaction(void) +PrepareTransaction(UndoRecPtr *start_urec_ptr, UndoRecPtr *end_urec_ptr) { TransactionState s = CurrentTransactionState; TransactionId xid = GetCurrentTransactionId(); @@ -2335,7 +2430,7 @@ PrepareTransaction(void) * PREPARED; in particular, pay attention to whether things should happen * before or after releasing the transaction's locks. */ - StartPrepare(gxact); + StartPrepare(gxact, start_urec_ptr, end_urec_ptr); AtPrepare_Notify(); AtPrepare_Locks(); @@ -2632,6 +2727,8 @@ AbortTransaction(void) AtEOXact_PgStat(false); AtEOXact_ApplyLauncher(false); pgstat_report_xact_timestamp(0); + AtAbort_ResetUndoBuffers; + AtAbort_ResetTPDBuffers; } /* @@ -2767,6 +2864,12 @@ void CommitTransactionCommand(void) { TransactionState s = CurrentTransactionState; + UndoRecPtr end_urec_ptr[UndoPersistenceLevels]; + UndoRecPtr start_urec_ptr[UndoPersistenceLevels]; + int i; + + memcpy(start_urec_ptr, s->start_urec_ptr, sizeof(start_urec_ptr)); + memcpy(end_urec_ptr, s->latest_urec_ptr, sizeof(end_urec_ptr)); switch (s->blockState) { @@ -2856,7 +2959,7 @@ CommitTransactionCommand(void) * return to the idle state. */ case TBLOCK_PREPARE: - PrepareTransaction(); + PrepareTransaction(start_urec_ptr, end_urec_ptr); s->blockState = TBLOCK_DEFAULT; break; @@ -2902,6 +3005,24 @@ CommitTransactionCommand(void) { CommitSubTransaction(); s = CurrentTransactionState; /* changed by pop */ + + /* + * Update the end undo record pointer if it's not valid with + * the currently popped transaction's end undo record pointer. + * This is particularly required when the first command of + * the transaction is of type which does not require an undo, + * e.g. savepoint x. + * Accordingly, update the start undo record pointer. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (!UndoRecPtrIsValid(end_urec_ptr[i])) + end_urec_ptr[i] = s->latest_urec_ptr[i]; + + if (UndoRecPtrIsValid(s->start_urec_ptr[i])) + start_urec_ptr[i] = s->start_urec_ptr[i]; + } + } while (s->blockState == TBLOCK_SUBCOMMIT); /* If we had a COMMIT command, finish off the main xact too */ if (s->blockState == TBLOCK_END) @@ -2913,7 +3034,7 @@ CommitTransactionCommand(void) else if (s->blockState == TBLOCK_PREPARE) { Assert(s->parent == NULL); - PrepareTransaction(); + PrepareTransaction(start_urec_ptr, end_urec_ptr); s->blockState = TBLOCK_DEFAULT; } else @@ -3007,7 +3128,18 @@ void AbortCurrentTransaction(void) { TransactionState s = CurrentTransactionState; + int i; + /* + * The undo actions are allowed to be executed at the end of statement + * execution when we are not in transaction block, otherwise they are + * executed when user explicitly ends the transaction. + * + * So if we are in a transaction block don't set the PerformUndoActions + * because this flag will be set when user explicitly issue rollback or + * rollback to savepoint. + */ + PerformUndoActions = false; switch (s->blockState) { case TBLOCK_DEFAULT: @@ -3041,6 +3173,16 @@ AbortCurrentTransaction(void) AbortTransaction(); CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; + + /* + * We are outside the transaction block so remember the required + * information to perform undo actions and also set the + * PerformUndoActions so that we execute it before completing this + * command. + */ + PerformUndoActions = true; + memcpy (UndoActionStartPtr, s->latest_urec_ptr, sizeof(UndoActionStartPtr)); + memcpy (UndoActionEndPtr, s->start_urec_ptr, sizeof(UndoActionEndPtr)); break; /* @@ -3077,6 +3219,9 @@ AbortCurrentTransaction(void) AbortTransaction(); CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; + + /* Failed during commit, so we need to perform the undo actions. */ + PerformUndoActions = true; break; /* @@ -3096,6 +3241,9 @@ AbortCurrentTransaction(void) case TBLOCK_ABORT_END: CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; + + /* Failed during commit, so we need to perform the undo actions. */ + PerformUndoActions = true; break; /* @@ -3106,6 +3254,12 @@ AbortCurrentTransaction(void) AbortTransaction(); CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; + + /* + * Failed while executing the rollback command, need perform any + * pending undo actions. + */ + PerformUndoActions = true; break; /* @@ -3117,6 +3271,12 @@ AbortCurrentTransaction(void) AbortTransaction(); CleanupTransaction(); s->blockState = TBLOCK_DEFAULT; + + /* + * Perform any pending actions if failed while preparing the + * transaction. + */ + PerformUndoActions = true; break; /* @@ -3139,6 +3299,17 @@ AbortCurrentTransaction(void) case TBLOCK_SUBCOMMIT: case TBLOCK_SUBABORT_PENDING: case TBLOCK_SUBRESTART: + /* + * If we are here and still UndoActionStartPtr is valid that means + * the subtransaction failed while executing the undo action, so + * store its undo action start point in parent so that parent can + * start its undo action from this point. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(UndoActionStartPtr[i])) + s->parent->latest_urec_ptr[i] = UndoActionStartPtr[i]; + } AbortSubTransaction(); CleanupSubTransaction(); AbortCurrentTransaction(); @@ -3155,6 +3326,109 @@ AbortCurrentTransaction(void) } } +/* + * XactPerformUndoActionsIfPending - Execute pending undo actions. + * + * If the parent transaction state is valid (when there is an error in the + * subtransaction and rollback to savepoint is executed), then allow to + * perform undo actions in it, otherwise perform them in a new transaction. + */ +void +XactPerformUndoActionsIfPending() +{ + TransactionState s = CurrentTransactionState; + uint64 rollback_size = 0; + bool new_xact = true, result = false, no_pending_action = true; + UndoRecPtr parent_latest_urec_ptr[UndoPersistenceLevels]; + int i = 0; + + if (!PerformUndoActions) + return; + + /* If there is no undo log for any persistence level, then return. */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(UndoActionStartPtr[i])) + { + no_pending_action = false; + break; + } + } + + if (no_pending_action) + { + PerformUndoActions = false; + return; + } + + /* + * Execute undo actions under parent transaction, if any. Otherwise start + * a new transaction. + */ + if (GetTopTransactionIdIfAny() != InvalidTransactionId) + { + memcpy(parent_latest_urec_ptr, s->latest_urec_ptr, + sizeof (parent_latest_urec_ptr)); + new_xact = false; + } + + /* + * If this is a large rollback request then push it to undo-worker + * through RollbackHT, undo-worker will perform it's undo actions later. + * Never push the rollbacks for temp tables. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (!UndoRecPtrIsValid(UndoActionStartPtr[i])) + continue; + + if (i == UNDO_TEMP) + goto perform_rollback; + else + rollback_size = UndoActionStartPtr[i] - UndoActionEndPtr[i]; + + if (new_xact && rollback_size > rollback_overflow_size * 1024 * 1024) + result = PushRollbackReq(UndoActionStartPtr[i], UndoActionEndPtr[i], InvalidOid); + + if (!result) + { +perform_rollback: + if (new_xact) + { + TransactionState xact; + + /* Start a new transaction for performing the rollback */ + StartTransactionCommand(); + xact = CurrentTransactionState; + + /* + * Store the previous transactions start and end undo record + * pointers into this transaction's state so that if there is + * some error while performing undo actions we can restart + * from begining. + */ + memcpy(xact->start_urec_ptr, UndoActionEndPtr, + sizeof(UndoActionEndPtr)); + memcpy(xact->latest_urec_ptr, UndoActionStartPtr, + sizeof(UndoActionStartPtr)); + } + + execute_undo_actions(UndoActionStartPtr[i], UndoActionEndPtr[i], + new_xact, true, true); + + if (new_xact) + CommitTransactionCommand(); + else + { + /* Restore parent's state. */ + s->latest_urec_ptr[i] = parent_latest_urec_ptr[i]; + } + } + } + + PerformUndoActions = false; +} + /* * PreventInTransactionBlock * @@ -3556,6 +3830,10 @@ EndTransactionBlock(void) { TransactionState s = CurrentTransactionState; bool result = false; + UndoRecPtr latest_urec_ptr[UndoPersistenceLevels]; + int i ; + + memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr)); switch (s->blockState) { @@ -3601,6 +3879,16 @@ EndTransactionBlock(void) elog(FATAL, "EndTransactionBlock: unexpected state %s", BlockStateAsString(s->blockState)); s = s->parent; + + /* + * We are calculating latest_urec_ptr, even though its a commit + * case. This is to handle any error during the commit path. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (!UndoRecPtrIsValid(latest_urec_ptr[i])) + latest_urec_ptr[i] = s->latest_urec_ptr[i]; + } } if (s->blockState == TBLOCK_INPROGRESS) s->blockState = TBLOCK_END; @@ -3626,6 +3914,12 @@ EndTransactionBlock(void) elog(FATAL, "EndTransactionBlock: unexpected state %s", BlockStateAsString(s->blockState)); s = s->parent; + + for (i = 0; i < UndoPersistenceLevels; i++) + { + if(!UndoRecPtrIsValid(latest_urec_ptr[i])) + latest_urec_ptr[i] = s->latest_urec_ptr[i]; + } } if (s->blockState == TBLOCK_INPROGRESS) s->blockState = TBLOCK_ABORT_PENDING; @@ -3678,6 +3972,18 @@ EndTransactionBlock(void) break; } + /* + * We need to perform undo actions if the transaction is failed. Remember + * the required information to perform undo actions at the end of + * statement execution. + */ + if (!result) + PerformUndoActions = true; + + memcpy(UndoActionStartPtr, latest_urec_ptr, sizeof(UndoActionStartPtr)); + memcpy(UndoActionEndPtr, TopTransactionStateData.start_urec_ptr, + sizeof(UndoActionEndPtr)); + return result; } @@ -3691,6 +3997,10 @@ void UserAbortTransactionBlock(void) { TransactionState s = CurrentTransactionState; + UndoRecPtr latest_urec_ptr[UndoPersistenceLevels]; + int i ; + + memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr)); switch (s->blockState) { @@ -3729,6 +4039,12 @@ UserAbortTransactionBlock(void) elog(FATAL, "UserAbortTransactionBlock: unexpected state %s", BlockStateAsString(s->blockState)); s = s->parent; + for(i = 0; i < UndoPersistenceLevels; i++) + { + if (!UndoRecPtrIsValid(latest_urec_ptr[i])) + latest_urec_ptr[i] = s->latest_urec_ptr[i]; + } + } if (s->blockState == TBLOCK_INPROGRESS) s->blockState = TBLOCK_ABORT_PENDING; @@ -3786,6 +4102,54 @@ UserAbortTransactionBlock(void) BlockStateAsString(s->blockState)); break; } + + /* + * Remember the required information for performing undo actions. So that + * if there is any failure in executing the undo action we can execute + * it later. + */ + memcpy (UndoActionStartPtr, latest_urec_ptr, sizeof(UndoActionStartPtr)); + memcpy (UndoActionEndPtr, s->start_urec_ptr, sizeof(UndoActionEndPtr)); + + /* + * If we are in a valid transaction state then execute the undo action here + * itself, otherwise we have already stored the required information for + * executing the undo action later. + */ + if (CurrentTransactionState->state == TRANS_INPROGRESS) + { + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (latest_urec_ptr[i]) + { + if (i == UNDO_TEMP) + execute_undo_actions(UndoActionStartPtr[i], UndoActionEndPtr[i], + false, true, true); + else + { + uint64 size = latest_urec_ptr[i] - s->start_urec_ptr[i]; + bool result = false; + + /* + * If this is a large rollback request then push it to undo-worker + * through RollbackHT, undo-worker will perform it's undo actions + * later. + */ + if (size >= rollback_overflow_size * 1024 * 1024) + result = PushRollbackReq(UndoActionStartPtr[i], UndoActionEndPtr[i], InvalidOid); + + if (!result) + { + execute_undo_actions(UndoActionStartPtr[i], UndoActionEndPtr[i], + true, true, true); + UndoActionStartPtr[i] = InvalidUndoRecPtr; + } + } + } + } + } + else + PerformUndoActions = true; } /* @@ -3935,6 +4299,12 @@ ReleaseSavepoint(const char *name) TransactionState s = CurrentTransactionState; TransactionState target, xact; + UndoRecPtr latest_urec_ptr[UndoPersistenceLevels]; + UndoRecPtr start_urec_ptr[UndoPersistenceLevels]; + int i = 0; + + memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr)); + memcpy(start_urec_ptr, s->start_urec_ptr, sizeof(start_urec_ptr)); /* * Workers synchronize transaction state at the beginning of each parallel @@ -4028,8 +4398,34 @@ ReleaseSavepoint(const char *name) if (xact == target) break; xact = xact->parent; + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (!UndoRecPtrIsValid(latest_urec_ptr[i])) + latest_urec_ptr[i] = xact->latest_urec_ptr[i]; + + if (UndoRecPtrIsValid(xact->start_urec_ptr[i])) + start_urec_ptr[i] = xact->start_urec_ptr[i]; + } + + Assert(PointerIsValid(xact)); } + + /* + * Before cleaning up the current sub transaction state, overwrite parent + * transaction's latest_urec_ptr with current transaction's latest_urec_ptr + * so that in case parent transaction get aborted we will not skip + * performing undo for this transaction. Also set the start_urec_ptr if + * parent start_urec_ptr is not valid. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(latest_urec_ptr[i])) + xact->parent->latest_urec_ptr[i] = latest_urec_ptr[i]; + if (!UndoRecPtrIsValid(xact->parent->start_urec_ptr[i])) + xact->parent->start_urec_ptr[i] = start_urec_ptr[i]; + } + } /* @@ -4044,6 +4440,12 @@ RollbackToSavepoint(const char *name) TransactionState s = CurrentTransactionState; TransactionState target, xact; + UndoRecPtr latest_urec_ptr[UndoPersistenceLevels]; + UndoRecPtr start_urec_ptr[UndoPersistenceLevels]; + int i = 0; + + memcpy(latest_urec_ptr, s->latest_urec_ptr, sizeof(latest_urec_ptr)); + memcpy(start_urec_ptr, s->start_urec_ptr, sizeof(start_urec_ptr)); /* * Workers synchronize transaction state at the beginning of each parallel @@ -4143,6 +4545,15 @@ RollbackToSavepoint(const char *name) BlockStateAsString(xact->blockState)); xact = xact->parent; Assert(PointerIsValid(xact)); + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (!UndoRecPtrIsValid(latest_urec_ptr[i])) + latest_urec_ptr[i] = xact->latest_urec_ptr[i]; + + if (UndoRecPtrIsValid(xact->start_urec_ptr[i])) + start_urec_ptr[i] = xact->start_urec_ptr[i]; + } + } /* And mark the target as "restart pending" */ @@ -4153,6 +4564,34 @@ RollbackToSavepoint(const char *name) else elog(FATAL, "RollbackToSavepoint: unexpected state %s", BlockStateAsString(xact->blockState)); + + /* + * Remember the required information for performing undo actions. So that + * if there is any failure in executing the undo action we can execute + * it later. + */ + memcpy (UndoActionStartPtr, latest_urec_ptr, sizeof(UndoActionStartPtr)); + memcpy (UndoActionEndPtr, start_urec_ptr, sizeof(UndoActionEndPtr)); + + /* + * If we are in a valid transaction state then execute the undo action here + * itself, otherwise we have already stored the required information for + * executing the undo action later. + */ + if (s->state == TRANS_INPROGRESS) + { + for ( i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(latest_urec_ptr[i])) + { + execute_undo_actions(latest_urec_ptr[i], start_urec_ptr[i], false, true, false); + xact->latest_urec_ptr[i] = InvalidUndoRecPtr; + UndoActionStartPtr[i] = InvalidUndoRecPtr; + } + } + } + else + PerformUndoActions = true; } /* @@ -4240,6 +4679,7 @@ void ReleaseCurrentSubTransaction(void) { TransactionState s = CurrentTransactionState; + int i; /* * Workers synchronize transaction state at the beginning of each parallel @@ -4258,6 +4698,22 @@ ReleaseCurrentSubTransaction(void) BlockStateAsString(s->blockState)); Assert(s->state == TRANS_INPROGRESS); MemoryContextSwitchTo(CurTransactionContext); + + /* + * Before cleaning up the current sub transaction state, overwrite parent + * transaction's latest_urec_ptr with current transaction's latest_urec_ptr + * so that in case parent transaction get aborted we will not skip + * performing undo for this transaction. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(s->latest_urec_ptr[i])) + s->parent->latest_urec_ptr[i] = s->latest_urec_ptr[i]; + + if (!UndoRecPtrIsValid(s->parent->start_urec_ptr[i])) + s->parent->start_urec_ptr[i] = s->start_urec_ptr[i]; + } + CommitSubTransaction(); s = CurrentTransactionState; /* changed by pop */ Assert(s->state == TRANS_INPROGRESS); @@ -4274,6 +4730,14 @@ void RollbackAndReleaseCurrentSubTransaction(void) { TransactionState s = CurrentTransactionState; + UndoRecPtr latest_urec_ptr[UndoPersistenceLevels] = {InvalidUndoRecPtr, + InvalidUndoRecPtr, + InvalidUndoRecPtr}; + UndoRecPtr start_urec_ptr[UndoPersistenceLevels] = {InvalidUndoRecPtr, + InvalidUndoRecPtr, + InvalidUndoRecPtr}; + UndoRecPtr parent_latest_urec_ptr[UndoPersistenceLevels]; + int i; /* * Unlike ReleaseCurrentSubTransaction(), this is nominally permitted @@ -4320,6 +4784,19 @@ RollbackAndReleaseCurrentSubTransaction(void) if (s->blockState == TBLOCK_SUBINPROGRESS) AbortSubTransaction(); + /* + * Remember the required information to perform undo actions before + * cleaning up the subtransaction state. + */ + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(s->latest_urec_ptr[i])) + { + latest_urec_ptr[i] = s->latest_urec_ptr[i]; + start_urec_ptr[i] = s->start_urec_ptr[i]; + } + } + /* And clean it up, too */ CleanupSubTransaction(); @@ -4328,6 +4805,30 @@ RollbackAndReleaseCurrentSubTransaction(void) s->blockState == TBLOCK_INPROGRESS || s->blockState == TBLOCK_IMPLICIT_INPROGRESS || s->blockState == TBLOCK_STARTED); + + for (i = 0; i < UndoPersistenceLevels; i++) + { + if (UndoRecPtrIsValid(latest_urec_ptr[i])) + { + parent_latest_urec_ptr[i] = s->latest_urec_ptr[i]; + + /* + * Store the undo action start point in the parent state so that + * we can apply undo actions these undos also during rollback of + * parent transaction in case of error while applying the undo + * actions. + */ + s->latest_urec_ptr[i] = latest_urec_ptr[i]; + execute_undo_actions(latest_urec_ptr[i], start_urec_ptr[i], false, + true, true); + + /* Restore parent state. */ + s->latest_urec_ptr[i] = parent_latest_urec_ptr[i]; + } + } + + /* Successfully performed undo actions so reset the flag. */ + PerformUndoActions = false; } /* @@ -4541,6 +5042,7 @@ static void StartSubTransaction(void) { TransactionState s = CurrentTransactionState; + int i; if (s->state != TRANS_DEFAULT) elog(WARNING, "StartSubTransaction while in %s state", @@ -4558,6 +5060,14 @@ StartSubTransaction(void) AtSubStart_Notify(); AfterTriggerBeginSubXact(); + /* initialize undo record locations for the transaction */ + for(i = 0; i < UndoPersistenceLevels; i++) + { + s->start_urec_ptr[i] = InvalidUndoRecPtr; + s->latest_urec_ptr[i] = InvalidUndoRecPtr; + } + + s->subXactLock = false; s->state = TRANS_INPROGRESS; /* diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c80b14ed97..5b37702b7c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -31,6 +31,7 @@ #include "access/transam.h" #include "access/tuptoaster.h" #include "access/twophase.h" +#include "access/undolog.h" #include "access/xact.h" #include "access/xlog_internal.h" #include "access/xloginsert.h" @@ -972,6 +973,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); XLogRecPtr XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn, + XLogRecPtr OldRedoRecPtr, uint8 flags) { XLogCtlInsert *Insert = &XLogCtl->Insert; @@ -1066,6 +1068,21 @@ XLogInsertRecord(XLogRecData *rdata, return InvalidXLogRecPtr; } + /* + * If the redo point is changed and wal need to include the undo attach + * information i.e. (this is the first WAL which after the checkpoint). + * then return from here so that the caller can restart. + */ + if (rechdr->xl_rmid == RM_ZHEAP_ID && + OldRedoRecPtr != InvalidXLogRecPtr && + OldRedoRecPtr != RedoRecPtr && + NeedUndoMetaLog(RedoRecPtr)) + { + WALInsertLockRelease(); + END_CRIT_SECTION(); + return InvalidXLogRecPtr; + } + /* * Reserve space for the record in the WAL. This also sets the xl_prev * pointer. @@ -4493,6 +4510,8 @@ WriteControlFile(void) ControlFile->float4ByVal = FLOAT4PASSBYVAL; ControlFile->float8ByVal = FLOAT8PASSBYVAL; + ControlFile->zheap_page_trans_slots = ZHEAP_PAGE_TRANS_SLOTS; + /* Contents are protected with a CRC */ INIT_CRC32C(ControlFile->crc); COMP_CRC32C(ControlFile->crc, @@ -4725,6 +4744,13 @@ ReadControlFile(void) " but the server was compiled without USE_FLOAT8_BYVAL."), errhint("It looks like you need to recompile or initdb."))); #endif + if (ControlFile->zheap_page_trans_slots != ZHEAP_PAGE_TRANS_SLOTS) + ereport(FATAL, + (errmsg("database files are incompatible with server"), + errdetail("The database cluster was initialized with ZHEAP_PAGE_TRANS_SLOTS %d," + " but the server was compiled with ZHEAP_PAGE_TRANS_SLOTS %d.", + ControlFile->zheap_page_trans_slots, (int) ZHEAP_PAGE_TRANS_SLOTS), + errhint("It looks like you need to recompile or initdb."))); wal_segment_size = ControlFile->xlog_seg_size; @@ -5169,6 +5195,7 @@ BootStrapXLOG(void) checkPoint.newestCommitTsXid = InvalidTransactionId; checkPoint.time = (pg_time_t) time(NULL); checkPoint.oldestActiveXid = InvalidTransactionId; + checkPoint.oldestXidWithEpochHavingUndo = InvalidTransactionId; ShmemVariableCache->nextXid = checkPoint.nextXid; ShmemVariableCache->nextOid = checkPoint.nextOid; @@ -6603,6 +6630,10 @@ StartupXLOG(void) (errmsg_internal("commit timestamp Xid oldest/newest: %u/%u", checkPoint.oldestCommitTsXid, checkPoint.newestCommitTsXid))); + ereport(DEBUG1, + (errmsg_internal("oldest xid with epoch having undo: " UINT64_FORMAT, + checkPoint.oldestXidWithEpochHavingUndo))); + if (!TransactionIdIsNormal(checkPoint.nextXid)) ereport(PANIC, (errmsg("invalid next transaction ID"))); @@ -6620,6 +6651,10 @@ StartupXLOG(void) XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch; XLogCtl->ckptXid = checkPoint.nextXid; + /* Read oldest xid having undo from checkpoint and set in proc global. */ + pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo, + checkPoint.oldestXidWithEpochHavingUndo); + /* * Initialize replication slots, before there's a chance to remove * required resources. @@ -6693,6 +6728,9 @@ StartupXLOG(void) */ restoreTwoPhaseData(); + /* Recover undo log meta data corresponding to this checkpoint. */ + StartupUndoLogs(ControlFile->checkPointCopy.redo); + lastFullPageWrites = checkPoint.fullPageWrites; RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; @@ -7315,7 +7353,13 @@ StartupXLOG(void) * end-of-recovery steps fail. */ if (InRecovery) + { ResetUnloggedRelations(UNLOGGED_RELATION_INIT); + ResetUndoLogs(UNDO_UNLOGGED); + } + + /* Always reset temporary undo logs. */ + ResetUndoLogs(UNDO_TEMP); /* * We don't need the latch anymore. It's not strictly necessary to disown @@ -8312,6 +8356,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch) *epoch = ckptXidEpoch; } +/* + * GetEpochForXid - get the epoch associated with the xid + */ +uint32 +GetEpochForXid(TransactionId xid) +{ + uint32 ckptXidEpoch; + TransactionId ckptXid; + + SpinLockAcquire(&XLogCtl->info_lck); + ckptXidEpoch = XLogCtl->ckptXidEpoch; + ckptXid = XLogCtl->ckptXid; + SpinLockRelease(&XLogCtl->info_lck); + + /* Xid can be on either side when near wrap-around. Xid is certainly + * logically later than ckptXid. So if it's numerically less, it must + * have wrapped into the next epoch. OTOH, if it is numerically more, + * but logically lesser, then it belongs to previous epoch. + */ + if (xid > ckptXid && + TransactionIdPrecedes(xid, ckptXid)) + ckptXidEpoch--; + else if (xid < ckptXid && + TransactionIdFollows(xid, ckptXid)) + ckptXidEpoch++; + + return ckptXidEpoch; +} + /* * This must be called ONCE during postmaster or standalone-backend shutdown */ @@ -8752,6 +8825,9 @@ CreateCheckPoint(int flags) checkPoint.nextOid += ShmemVariableCache->oidCount; LWLockRelease(OidGenLock); + checkPoint.oldestXidWithEpochHavingUndo = + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo); + MultiXactGetCheckptMulti(shutdown, &checkPoint.nextMulti, &checkPoint.nextMultiOffset, @@ -9020,6 +9096,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags) CheckPointSnapBuild(); CheckPointLogicalRewriteHeap(); CheckPointBuffers(flags); /* performs all required fsyncs */ + CheckPointUndoLogs(checkPointRedo, ControlFile->checkPointCopy.redo); CheckPointReplicationOrigin(); /* We deliberately delay 2PC checkpointing as long as possible */ CheckPointTwoPhase(checkPointRedo); @@ -9661,6 +9738,9 @@ xlog_redo(XLogReaderState *record) MultiXactAdvanceOldest(checkPoint.oldestMulti, checkPoint.oldestMultiDB); + pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo, + checkPoint.oldestXidWithEpochHavingUndo); + /* * No need to set oldestClogXid here as well; it'll be set when we * redo an xl_clog_truncate if it changed since initialization. @@ -9719,6 +9799,8 @@ xlog_redo(XLogReaderState *record) /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; ControlFile->checkPointCopy.nextXid = checkPoint.nextXid; + ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo = + checkPoint.oldestXidWithEpochHavingUndo; /* Update shared-memory copy of checkpoint XID/epoch */ SpinLockAcquire(&XLogCtl->info_lck); @@ -9726,6 +9808,12 @@ xlog_redo(XLogReaderState *record) XLogCtl->ckptXid = checkPoint.nextXid; SpinLockRelease(&XLogCtl->info_lck); + ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo = + checkPoint.oldestXidWithEpochHavingUndo; + + /* Write an undo log metadata snapshot. */ + CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo); + /* * We should've already switched to the new TLI before replaying this * record. @@ -9765,6 +9853,9 @@ xlog_redo(XLogReaderState *record) MultiXactAdvanceNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); + pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo, + checkPoint.oldestXidWithEpochHavingUndo); + /* * NB: This may perform multixact truncation when replaying WAL * generated by an older primary. @@ -9785,6 +9876,9 @@ xlog_redo(XLogReaderState *record) XLogCtl->ckptXid = checkPoint.nextXid; SpinLockRelease(&XLogCtl->info_lck); + /* Write an undo log metadata snapshot. */ + CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo); + /* TLI should not change in an on-line checkpoint */ if (checkPoint.ThisTimeLineID != ThisTimeLineID) ereport(PANIC, diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 034d5b3b62..ce6a8c1a83 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -459,7 +459,8 @@ XLogInsert(RmgrId rmid, uint8 info) rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites, &fpw_lsn); - EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags); + EndPos = XLogInsertRecord(rdt, fpw_lsn, InvalidXLogRecPtr, + curinsert_flags); } while (EndPos == InvalidXLogRecPtr); XLogResetInsertion(); @@ -467,6 +468,63 @@ XLogInsert(RmgrId rmid, uint8 info) return EndPos; } +/* + * XLogInsertExtended + * Like XLogInsert, but with extra options. + * + * The internal logic of this function is almost same as XLogInsert, but there + * are some differences: unlike XLogInsert, this function will not retry for WAL + * insert if the page image inclusion decision got changed instead it will + * return immediately, and it will not calculate the latest value of RedoRecPtr + * like XLogInsert, instead it will take that as input from caller so that if + * the caller has not included the tuple info (because page image is not present + * in the WAL) it can start over again if including page image decision got + * changed later during WAL insertion. + */ +XLogRecPtr +XLogInsertExtended(RmgrId rmid, uint8 info, XLogRecPtr RedoRecPtr, + bool doPageWrites) +{ + XLogRecPtr EndPos; + XLogRecPtr fpw_lsn; + XLogRecData *rdt; + + /* XLogBeginInsert() must have been called. */ + if (!begininsert_called) + elog(ERROR, "XLogBeginInsert was not called"); + + /* + * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and + * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me. + */ + if ((info & ~(XLR_RMGR_INFO_MASK | + XLR_SPECIAL_REL_UPDATE | + XLR_CHECK_CONSISTENCY)) != 0) + elog(PANIC, "invalid xlog info mask %02X", info); + + TRACE_POSTGRESQL_WAL_INSERT(rmid, info); + + /* + * In bootstrap mode, we don't actually log anything but XLOG resources; + * return a phony record pointer. + */ + if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID) + { + XLogResetInsertion(); + EndPos = SizeOfXLogLongPHD; /* start of 1st chkpt record */ + return EndPos; + } + + rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites, + &fpw_lsn); + + EndPos = XLogInsertRecord(rdt, fpw_lsn, RedoRecPtr, curinsert_flags); + + XLogResetInsertion(); + + return EndPos; +} + /* * Assemble a WAL record from the registered data and buffers into an * XLogRecData chain, ready for insertion with XLogInsertRecord(). @@ -783,8 +841,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, * Fill in the fields in the record header. Prev-link is filled in later, * once we know where in the WAL the record will be inserted. The CRC does * not include the record header yet. + * + * Since zheap storage always use TopTransactionId, if this xlog is for the + * zheap then get the TopTransactionId. */ - rechdr->xl_xid = GetCurrentTransactionIdIfAny(); + if (rmid == RM_ZHEAP_ID) + rechdr->xl_xid = GetTopTransactionIdIfAny(); + else + rechdr->xl_xid = GetCurrentTransactionIdIfAny(); rechdr->xl_tot_len = total_len; rechdr->xl_info = info; rechdr->xl_rmid = rmid; diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 4ecdc9220f..8718a4fdfc 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, { /* page exists in file */ buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, - mode, NULL); + mode, NULL, RELPERSISTENCE_PERMANENT); } else { @@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, ReleaseBuffer(buffer); } buffer = ReadBufferWithoutRelcache(rnode, forknum, - P_NEW, mode, NULL); + P_NEW, mode, NULL, + RELPERSISTENCE_PERMANENT); } while (BufferGetBlockNumber(buffer) < blkno); /* Handle the corner case that P_NEW returns non-consecutive pages */ @@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, LockBuffer(buffer, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buffer); buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, - mode, NULL); + mode, NULL, + RELPERSISTENCE_PERMANENT); } } diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile new file mode 100644 index 0000000000..585f15cdb1 --- /dev/null +++ b/src/backend/access/undo/Makefile @@ -0,0 +1,17 @@ +#------------------------------------------------------------------------- +# +# Makefile-- +# Makefile for access/undo +# +# IDENTIFICATION +# src/backend/access/undo/Makefile +# +#------------------------------------------------------------------------- + +subdir = src/backend/access/undo +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = undodiscard.o undoinsert.o undolog.o undorecord.o undoaction.o undoactionxlog.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/undo/README b/src/backend/access/undo/README new file mode 100644 index 0000000000..9ba81f960d --- /dev/null +++ b/src/backend/access/undo/README @@ -0,0 +1,169 @@ +src/backend/access/undo/README + +Undo Logs +========= + +The undo log subsystem provides a way to store data that is needed for +a limited time. Undo data is generated whenever zheap relations are +modified, but it is only useful until (1) the generating transaction +is committed or rolled back and (2) there is no snapshot that might +need it for MVCC purposes. See src/backend/access/zheap/README for +more information on zheap. The undo log subsystem is concerned with +raw storage optimized for efficient recycling and buffered random +access. + +Like redo data (the WAL), undo data consists of records identified by +their location within a 64 bit address space. Unlike redo data, the +addressing space is internally divided up unto multiple numbered logs. +The first 24 bits of an UndoRecPtr identify the undo log number, and +the remaining 40 bits address the space within that undo log. Higher +level code (zheap) is largely oblivious to this internal structure and +deals mostly in opaque UndoRecPtr values. + +Using multiple undo logs instead of a single uniform space avoids the +contention that would result from a single insertion point, since each +session can be given sole access to write data into a given undo log. +It also allows for parallelized space reclamation. + +Like redo data, undo data is stored on disk in numbered segment files +that are recycled as required. Unlike redo data, undo data is +accessed through the buffer pool. In this respect it is similar to +regular relation data. Buffer content is written out to disk during +checkpoints and whenever it is evicted to make space for another page. +However, unlike regular relation data, undo data has a chance of never +being written to disk at all: if a page is allocated and and then +later discarded without an intervening checkpoint and without an +eviction provoked by memory pressure, then no disk IO is generated. + +Keeping the undo data physically separate from redo data and accessing +it though the existing shared buffers mechanism allows it to be +accessed efficiently for MVCC purposes. + +Meta-Data +========= + +At any given time the set of undo logs that exists is tracked in +shared memory and can be inspected in the pg_stat_undo_logs view. For +each undo log, a set of properties called the undo log's meta-data are +tracked: + +* the tablespace that holds its segment files +* the persistence level (permanent, unlogged, temporary) +* the "discard" pointer; data before this point has been discarded +* the "insert" pointer: new data will be written here +* the "end" pointer: a new undo segment file will be needed at this point + +The three pointers discard, insert and end move strictly forwards +until the whole undo log has been exhausted. At all times discard <= +insert <= end. When discard == insert, the undo log is empty +(everything that has ever been inserted has since been discarded). +The insert pointer advances when regular backends allocate new space, +and the discard pointer usually advances when an undo worker process +determines that no session could need the data either for rollback or +for finding old versions of tuples to satisfy a snapshot. In some +special cases including single-user mode and temporary undo logs the +discard pointer might also be advanced synchronously by a foreground +session. + +In order to provide constant time access to undo log meta-data given +an UndoRecPtr, there is conceptually an array of UndoLogControl +objects indexed by undo log number. Since that array would be too +large and since we expect the set of active undo log numbers to be +small and clustered, we only keep small ranges of that logical array +in memory at a time. We use the higher order bits of the undo log +number to identify a 'bank' (array fragment), and then the lower order +bits to identify a slot within the bank. Each bank is backed by a DSM +segment. We expect to need just 1 or 2 such DSM segments to exist at +any time. + +The meta-data for all undo logs is written to disk at every +checkpoint. It is stored in files under PGDATA/pg_undo/, using the +checkpoint's redo point (a WAL LSN) as its filename. At startup time, +the redo point's file can be used to restore all undo logs' meta-data +as of the moment of the redo point into shared memory. Changes to the +discard pointer and end pointer are WAL-logged by undolog.c and will +bring the in-memory meta-data up to date in the event of recovery +after a crash. Changes to insert pointers are included in other WAL +records (see below). + +Responsibility for creating, deleting and recycling undo log segment +files and WAL logging the associated meta-data changes lies with +src/backend/storage/undo/undolog.c. + +Persistence Levels and Tablespaces +================================== + +When new undo log space is requested by client code, the persistence +level of the relation being modified and the current value of the GUC +"undo_tablespaces" controls which undo log is selected. If the +session is already attached to a suitable undo log and it hasn't run +out of address space, it can be used immediately. Otherwise a +suitable undo log must be either found or created. The system should +stabilize on one undo log per active writing backend (or more if +different tablespaces are persistence levels are used). + +When an unlogged relation is modified, undo data generated by the +operation must be stored in an unlogged undo log. This causes the +undo data to be deleted along with all unlogged relations during +recovery from a non-shutdown checkpoint. Likewise, temporary +relations require special treatment: their buffers are backend-local +and they cannot be accessed by other backend including undo workers. + +Non-empty undo logs in a tablespace prevent the tablespace from being +dropped. + +Undo Log Contents +================= + +Undo log contents are written into 1MB segment files under +PGDATA/base/undo/ or PGDATA/pg_tblspc/VERSION/undo/ using filenames +that encode the address (UndoRecPtr) of their first byte. A period +'.' separates the undo log number part from the offset part, for the +benefit of human administrators. + +Undo logs are page-oriented and use regular PosgreSQL page headers +including checksums (if enabled) and LSNs. An UndoRecPtr can be used +to obtain a buffer and an offset within the buffer, and then regular +buffer locking and page LSN rules apply. While space is allocated by +asking for a given number of usable bytes (not including page +headers), client code is responsible for stepping over the page +headers and advancing to the next page. + +Responsibility for WAL-logging the contents of the undo log lies with +client code (ie zheap). While undolog.c WAL-logs all meta-data +changes except insert points and checkpoints all meta-data including +insert points, client code is responsible for allocating undo log +space in the same sequence at recovery time. This avoids having to +WAL-log insertion points explicitly and separately for every insertion +into an undo log, greatly reducing WAL traffic. (WAL is still +generated by undolog.c whenever a 1MB segment boundary is crossed, +since that also advances the end pointer.) + +One complication of this scheme for implicit insert pointer movement +is that recovery doesn't naturally have access to the association +between transactions and undo logs. That is, while 'do' sessions have +a currently attached undo log from which they allocate new space, +recovery is performed by a single startup process which has no concept +of the sessions that generated the WAL it is replaying. For that +reason, an xid->undo log number map is maintained at recovery time. +At 'do' time, a WAL record is emitted the first time any permanent +undo log is used in a given transaction, so that the mapping can be +recovered at redo time. That allows a stream of allocations to be +directed to the appropriate undo logs so that the same resulting +stream of undo log pointer can be produced. (Unlogged and temporary +undo logs don't have this problem since they aren't used at recovery +time.) + +Another complication is that the checkpoint files written under pg_undo +may contain inconsistent data during recovery from an online checkpoint +(after a crash or base backup). To compensate for this, client code +must arrange to log an undo log meta-data record when inserting the +first WAL record that might cause undo log access during recovery. +This is conceptually similar to full page images after checkpoints, +but limited to one meta-data WAL record per undo log per checkpoint. + +src/backend/storage/buffer/bufmgr.c is unaware of the existence of +undo log as a separate category of buffered data. Reading and writing +of buffered undo log pages is handled by a new storage manager in +src/backend/storage/smgr/undo_file.c. See +src/backend/storage/smgr/README for more details. diff --git a/src/backend/access/undo/undoaction.c b/src/backend/access/undo/undoaction.c new file mode 100644 index 0000000000..16eb1ef04d --- /dev/null +++ b/src/backend/access/undo/undoaction.c @@ -0,0 +1,1631 @@ +/*------------------------------------------------------------------------- + * + * undoaction.c + * execute undo actions + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/access/undo/undoaction.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/tpd.h" +#include "access/undoaction_xlog.h" +#include "access/undolog.h" +#include "access/undorecord.h" +#include "access/visibilitymap.h" +#include "access/xact.h" +#include "access/zheap.h" +#include "nodes/pg_list.h" +#include "pgstat.h" +#include "postmaster/undoloop.h" +#include "storage/block.h" +#include "storage/buf.h" +#include "storage/bufmgr.h" +#include "utils/relfilenodemap.h" +#include "utils/syscache.h" +#include "miscadmin.h" +#include "storage/shmem.h" +#include "access/undodiscard.h" + +#define ROLLBACK_HT_SIZE 1024 + +static bool execute_undo_actions_page(List *luinfo, UndoRecPtr urec_ptr, + Oid reloid, TransactionId xid, BlockNumber blkno, + bool blk_chain_complete, bool norellock); +static inline void undo_action_insert(Relation rel, Page page, OffsetNumber off, + TransactionId xid); +static void RollbackHTRemoveEntry(UndoRecPtr start_urec_ptr); + +/* This is the hash table to store all the rollabck requests. */ +static HTAB *RollbackHT; + +/* undo record information */ +typedef struct UndoRecInfo +{ + UndoRecPtr urp; /* undo recptr (undo record location). */ + UnpackedUndoRecord *uur; /* actual undo record. */ +} UndoRecInfo; + +/* + * execute_undo_actions - Execute the undo actions + * + * from_urecptr - undo record pointer from where to start applying undo action. + * to_urecptr - undo record pointer upto which point apply undo action. + * nopartial - true if rollback is for complete transaction. + * rewind - whether to rewind the insert location of the undo log or not. + * Only the backend executed the transaction can rewind, but + * any other process e.g. undo worker should not rewind it. + * Because, if the backend have already inserted new undo records + * for the next transaction and if we rewind then we will loose + * the undo record inserted for the new transaction. + * rellock - if the caller already has the lock on the required relation, + * then this flag is false, i.e. we do not need to acquire any + * lock here. If the flag is true then we need to acquire lock + * here itself, because caller will not be having any lock. + * When we are performing undo actions for prepared transactions, + * or for rollback to savepoint, we need not to lock as we already + * have the lock on the table. In cases like error or when + * rollbacking from the undo worker we need to have proper locks. + */ +void +execute_undo_actions(UndoRecPtr from_urecptr, UndoRecPtr to_urecptr, + bool nopartial, bool rewind, bool rellock) +{ + UnpackedUndoRecord *uur = NULL; + UndoRecPtr urec_ptr, prev_urec_ptr, prev_blkprev; + UndoRecPtr save_urec_ptr; + Oid prev_reloid = InvalidOid; + ForkNumber prev_fork = InvalidForkNumber; + BlockNumber prev_block = InvalidBlockNumber; + List *luinfo = NIL; + bool more_undo; + TransactionId xid = InvalidTransactionId; + UndoRecInfo *urec_info; + + Assert(from_urecptr != InvalidUndoRecPtr); + /* + * If the location upto which rollback need to be done is not provided, + * then rollback the complete transaction. + * FIXME: this won't work if undolog crossed the limit of 1TB, because + * then from_urecptr and to_urecptr will be from different lognos. + */ + if (to_urecptr == InvalidUndoRecPtr) + { + UndoLogNumber logno = UndoRecPtrGetLogNo(from_urecptr); + to_urecptr = UndoLogGetLastXactStartPoint(logno); + } + + prev_blkprev = save_urec_ptr = urec_ptr = from_urecptr; + + if (nopartial) + { + uur = UndoFetchRecord(urec_ptr, InvalidBlockNumber, InvalidOffsetNumber, + InvalidTransactionId, NULL, NULL); + if (uur == NULL) + return; + + xid = uur->uur_xid; + UndoRecordRelease(uur); + uur = NULL; + + /* + * Grab the undo action apply lock before start applying the undo action + * this will prevent applying undo actions concurrently. If we do not + * get the lock that mean its already being applied concurrently or the + * discard worker might be pushing its request to the rollback hash + * table + */ + if (!ConditionTransactionUndoActionLock(xid)) + return; + } + + prev_urec_ptr = InvalidUndoRecPtr; + while (prev_urec_ptr != to_urecptr) + { + Oid reloid = InvalidOid; + uint16 urec_prevlen; + + more_undo = true; + + prev_urec_ptr = urec_ptr; + + /* Fetch the undo record for given undo_recptr. */ + uur = UndoFetchRecord(urec_ptr, InvalidBlockNumber, + InvalidOffsetNumber, InvalidTransactionId, NULL, NULL); + + if (uur != NULL) + reloid = uur->uur_reloid; + + /* + * If the record is already discarded by undo worker or if the relation + * is dropped or truncated, then we cannot fetch record successfully. + * Hence, exit quietly. + * + * Note: reloid remains InvalidOid for a discarded record. + */ + if (!OidIsValid(reloid)) + { + /* release the undo records for which action has been replayed */ + while (luinfo) + { + UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo); + + UndoRecordRelease(urec_info->uur); + pfree(urec_info); + luinfo = list_delete_first(luinfo); + } + + /* Release the undo action lock before returning. */ + if (nopartial) + TransactionUndoActionLockRelease(xid); + + /* Release the just-fetched record */ + if (uur != NULL) + UndoRecordRelease(uur); + + return; + } + + xid = uur->uur_xid; + + /* Collect the undo records that belong to the same page. */ + if (!OidIsValid(prev_reloid) || + (prev_reloid == reloid && + prev_fork == uur->uur_fork && + prev_block == uur->uur_block && + prev_blkprev == urec_ptr)) + { + prev_reloid = reloid; + prev_fork = uur->uur_fork; + prev_block = uur->uur_block; + + /* Prepare an undo record information element. */ + urec_info = palloc(sizeof(UndoRecInfo)); + urec_info->urp = urec_ptr; + urec_info->uur = uur; + + luinfo = lappend(luinfo, urec_info); + urec_prevlen = uur->uur_prevlen; + save_urec_ptr = uur->uur_blkprev; + + /* The undo chain must continue till we reach to_urecptr */ + if (urec_prevlen > 0 && urec_ptr != to_urecptr) + { + urec_ptr = UndoGetPrevUndoRecptr(urec_ptr, urec_prevlen); + prev_blkprev = uur->uur_blkprev; + continue; + } + else + more_undo = false; + } + else + { + more_undo = true; + } + + /* + * If no more undo is left to be processed and we are rolling back the + * complete transaction, then we can consider that the undo chain for a + * block is complete. + * If the previous undo pointer in the page is invalid, then also the + * undo chain for the current block is completed. + */ + if ((!more_undo && nopartial) || !UndoRecPtrIsValid(save_urec_ptr)) + { + execute_undo_actions_page(luinfo, save_urec_ptr, prev_reloid, + xid, prev_block, true, rellock); + } + else + { + execute_undo_actions_page(luinfo, save_urec_ptr, prev_reloid, + xid, prev_block, false, rellock); + } + + /* release the undo records for which action has been replayed */ + while (luinfo) + { + UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo); + + UndoRecordRelease(urec_info->uur); + pfree(urec_info); + luinfo = list_delete_first(luinfo); + } + + /* + * There are still more records to process, so keep moving backwards + * in the chain. + */ + if (more_undo) + { + /* Prepare an undo record information element. */ + urec_info = palloc(sizeof(UndoRecInfo)); + urec_info->urp = urec_ptr; + urec_info->uur = uur; + luinfo = lappend(luinfo, urec_info); + + prev_reloid = reloid; + prev_fork = uur->uur_fork; + prev_block = uur->uur_block; + save_urec_ptr = uur->uur_blkprev; + + /* + * Continue to process the records if this is not the last undo + * record in chain. + */ + urec_prevlen = uur->uur_prevlen; + if (urec_prevlen > 0 && urec_ptr != to_urecptr) + urec_ptr = UndoGetPrevUndoRecptr(urec_ptr, urec_prevlen); + else + break; + } + else + break; + } + + /* Apply the undo actions for the remaining records. */ + if (list_length(luinfo)) + { + execute_undo_actions_page(luinfo, save_urec_ptr, prev_reloid, + xid, prev_block, nopartial ? true : false, + rellock); + + /* release the undo records for which action has been replayed */ + while (luinfo) + { + UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo); + + UndoRecordRelease(urec_info->uur); + pfree(urec_info); + luinfo = list_delete_first(luinfo); + } + } + + if (rewind) + { + /* Read the current log from undo */ + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(to_urecptr)); + + /* Read the prevlen from the first record of this transaction. */ + uur = UndoFetchRecord(to_urecptr, InvalidBlockNumber, + InvalidOffsetNumber, InvalidTransactionId, + NULL, NULL); + /* + * If undo is already discarded before we rewind, then do nothing. + */ + if (uur == NULL) + return; + + + /* + * In ZGetMultiLockMembers we fetch the undo record without a + * buffer lock so it's possible that a transaction in the slot + * can rollback and rewind the undo record pointer. To prevent + * that we acquire the rewind lock before rewinding the undo record + * pointer and the same lock will be acquire by ZGetMultiLockMembers + * in shared mode. Other places where we fetch the undo record we + * don't need this lock as we are doing that under the buffer lock. + * So remember to acquire the rewind lock in shared mode wherever we + * are fetching the undo record of non commited transaction without + * buffer lock. + */ + LWLockAcquire(&log->rewind_lock, LW_EXCLUSIVE); + UndoLogRewind(to_urecptr, uur->uur_prevlen); + LWLockRelease(&log->rewind_lock); + + UndoRecordRelease(uur); + } + + if (nopartial) + { + /* + * Set undo action apply completed in the transaction header if this is + * a main transaction and we have not rewound its undo. + */ + if (!rewind) + { + /* + * Undo action is applied so delete the hash table entry and release + * the undo action lock. + */ + RollbackHTRemoveEntry(from_urecptr); + + /* + * Prepare and update the progress of the undo action apply in the + * transaction header. + */ + PrepareUpdateUndoActionProgress(to_urecptr, 1); + + START_CRIT_SECTION(); + + /* Update the progress in the transaction header. */ + UndoRecordUpdateTransInfo(); + + /* WAL log the undo apply progress. */ + { + xl_undoapply_progress xlrec; + + xlrec.urec_ptr = to_urecptr; + xlrec.progress = 1; + + /* + * FIXME : We need to register undo buffers and set LSN for them + * that will be required for FPW of the undo buffers. + */ + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + (void) XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS); + } + + END_CRIT_SECTION(); + UnlockReleaseUndoBuffers(); + } + + TransactionUndoActionLockRelease(xid); + } +} + +/* + * process_and_execute_undo_actions_page + * + * Collect all the undo for the input buffer and execute. Here, we don't know + * the to_urecptr and we can not collect from undo meta data also like we do in + * execute_undo_actions, because we might be applying undo of some old + * transaction and may be from different undo log as well. + * + * from_urecptr - undo record pointer from where to start applying the undo. + * rel - relation descriptor for which undo to be applied. + * buffer - buffer for which unto to be processed. + * epoch - epoch of the xid passed. + * xid - aborted transaction id whose effects needs to be reverted. + * slot_no - transaction slot number of xid. + */ +void +process_and_execute_undo_actions_page(UndoRecPtr from_urecptr, Relation rel, + Buffer buffer, uint32 epoch, + TransactionId xid, int slot_no) +{ + UnpackedUndoRecord *uur = NULL; + UndoRecPtr urec_ptr = from_urecptr; + List *luinfo = NIL; + Page page; + UndoRecInfo *urec_info; + bool actions_applied = false; + + /* + * Process and collect the undo for the block until we reach the first + * record of the transaction. + * + * Fixme: This can lead to unbounded use of memory, so we should collect + * the undo in chunks based on work_mem or some other memory unit. + */ + do + { + /* Fetch the undo record for given undo_recptr. */ + uur = UndoFetchRecord(urec_ptr, InvalidBlockNumber, + InvalidOffsetNumber, InvalidTransactionId, + NULL, NULL); + /* + * If the record is already discarded by undo worker, or the xid we + * want to rollback has already applied its undo actions then just + * cleanup the slot and exit. + */ + if(uur == NULL || uur->uur_xid != xid) + { + if (uur != NULL) + UndoRecordRelease(uur); + break; + } + + /* Prepare an undo element. */ + urec_info = palloc(sizeof(UndoRecInfo)); + urec_info->urp = urec_ptr; + urec_info->uur = uur; + + /* Collect the undo records. */ + luinfo = lappend(luinfo, urec_info); + urec_ptr = uur->uur_blkprev; + + /* + * If we have exhausted the undo chain for the slot, then we are done. + */ + if (!UndoRecPtrIsValid(urec_ptr)) + break; + } while (true); + + if (list_length(luinfo)) + actions_applied = execute_undo_actions_page(luinfo, urec_ptr, + rel->rd_id, xid, + BufferGetBlockNumber(buffer), + true, + false); + /* Release undo records and undo elements*/ + while (luinfo) + { + UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo); + + UndoRecordRelease(urec_info->uur); + pfree(urec_info); + luinfo = list_delete_first(luinfo); + } + + /* + * Clear the transaction id from the slot. We expect that if the undo + * actions are applied by execute_undo_actions_page then it would have + * cleared the xid, otherwise we will clear it here. + */ + if (!actions_applied) + { + int slot_no; + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + page = BufferGetPage(buffer); + slot_no = PageGetTransactionSlotId(rel, buffer, epoch, xid, &urec_ptr, + true, false, NULL); + /* + * If someone has already cleared the transaction info, then we don't + * need to do anything. + */ + if (slot_no != InvalidXactSlotId) + { + START_CRIT_SECTION(); + + /* Clear the epoch and xid from the slot. */ + PageSetTransactionSlotInfo(buffer, slot_no, 0, + InvalidTransactionId, urec_ptr); + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + xl_undoaction_reset_slot xlrec; + + xlrec.flags = 0; + xlrec.urec_ptr = urec_ptr; + xlrec.trans_slot_id = slot_no; + + XLogBeginInsert(); + + XLogRegisterData((char *) &xlrec, SizeOfUndoActionResetSlot); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + /* Register tpd buffer if the slot belongs to tpd page. */ + if (slot_no > ZHEAP_PAGE_TRANS_SLOTS) + { + xlrec.flags |= XLU_RESET_CONTAINS_TPD_SLOT; + RegisterTPDBuffer(page, 1); + } + + recptr = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_RESET_SLOT); + + PageSetLSN(page, recptr); + if (xlrec.flags & XLU_RESET_CONTAINS_TPD_SLOT) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + } + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + UnlockReleaseTPDBuffers(); + } +} + +/* + * undo_action_insert - perform the undo action for insert + * + * This will mark the tuple as dead so that the future access to it can't see + * this tuple. We mark it as unused if there is no other index pointing to + * it, otherwise mark it as dead. + */ +static inline void +undo_action_insert(Relation rel, Page page, OffsetNumber off, + TransactionId xid) +{ + ItemId lp; + bool relhasindex; + + /* + * This will mark the tuple as dead so that the future + * access to it can't see this tuple. We mark it as + * unused if there is no other index pointing to it, + * otherwise mark it as dead. + */ + relhasindex = RelationGetForm(rel)->relhasindex; + lp = PageGetItemId(page, off); + Assert(ItemIdIsNormal(lp)); + if (relhasindex) + { + ItemIdSetDead(lp); + } + else + { + ItemIdSetUnused(lp); + /* Set hint bit for ZPageAddItem */ + PageSetHasFreeLinePointers(page); + } + + ZPageSetPrunable(page, xid); +} + +/* + * execute_undo_actions_page - Execute the undo actions for a page + * + * After applying all the undo actions for a page, we clear the transaction + * slot on a page if the undo chain for block is complete, otherwise just + * rewind the undo pointer to the last record for that block that precedes + * the last undo record for which action is replayed. + * + * luinfo - list of undo records (along with their location) for which undo + * action needs to be replayed. + * urec_ptr - undo record pointer to which we need to rewind. + * reloid - OID of relation on which undo actions needs to be applied. + * blkno - block number on which undo actions needs to be applied. + * blk_chain_complete - indicates whether the undo chain for block is + * complete. + * nopartial - true if rollback is for complete transaction. If we are not + * rolling back the complete transaction then we need to apply the + * undo action for UNDO_INVALID_XACT_SLOT also because in such + * case we will rewind the insert undo location. + * rellock - if the caller already has the lock on the required relation, + * then this flag is false, i.e. we do not need to acquire any + * lock here. If the flag is true then we need to acquire lock + * here itself, because caller will not be having any lock. + * When we are performing undo actions for prepared transactions, + * or for rollback to savepoint, we need not to lock as we already + * have the lock on the table. In cases like error or when + * rollbacking from the undo worker we need to have proper locks. + * + * returns true, if successfully applied the undo actions, otherwise, false. + */ +static bool +execute_undo_actions_page(List *luinfo, UndoRecPtr urec_ptr, Oid reloid, + TransactionId xid, BlockNumber blkno, + bool blk_chain_complete, bool rellock) +{ + ListCell *l_iter; + Relation rel; + Buffer buffer; + Page page; + UndoRecPtr slot_urec_ptr; + uint32 epoch; + int slot_no = 0; + int tpd_map_size = 0; + char *tpd_offset_map = NULL; + UndoRecInfo *urec_info = (UndoRecInfo *) linitial(luinfo); + Buffer vmbuffer = InvalidBuffer; + bool need_init = false; + bool tpd_page_locked = false; + bool is_tpd_map_updated = false; + + /* + * FIXME: If reloid is not valid then we have nothing to do. In future, + * we might want to do it differently for transactions that perform both + * DDL and DML operations. + */ + if (!OidIsValid(reloid)) + { + elog(LOG, "ignoring undo for invalid reloid"); + return false; + } + + if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(reloid))) + return false; + + /* + * If the action is executed by backend as a result of rollback, we must + * already have an appropriate lock on relation. + */ + if (rellock) + rel = heap_open(reloid, RowExclusiveLock); + else + rel = heap_open(reloid, NoLock); + + if (RelationGetNumberOfBlocks(rel) <= blkno) + { + /* + * This is possible if the underlying relation is truncated just before + * taking the relation lock above. + */ + heap_close(rel, NoLock); + return false; + } + + buffer = ReadBuffer(rel, blkno); + + /* + * If there is a undo action of type UNDO_ITEMID_UNUSED then might need + * to clear visibility_map. Since we cannot call visibilitymap_pin or + * visibilitymap_status within a critical section it shall be called + * here and let it be before taking the buffer lock on page. + */ + foreach(l_iter, luinfo) + { + UndoRecInfo *urec_info = (UndoRecInfo *) lfirst(l_iter); + UnpackedUndoRecord *uur = urec_info->uur; + + if (uur->uur_type == UNDO_ITEMID_UNUSED) + { + visibilitymap_pin(rel, blkno, &vmbuffer); + break; + } + } + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + page = BufferGetPage(buffer); + + /* + * Identify the slot number for this transaction. As we never allow undo + * more than 2-billion transactions, we can compute epoch from xid. + * + * Here, we will always take a lock on the tpd_page, if there is a tpd + * slot on the page. This is required because sometimes we only come to + * know that we need to update the tpd page after applying the undo record. + * Now, the case where this can happen is when during DO operation the + * slot of previous updater is a non-TPD slot, but by the time we came for + * rollback it became a TPD slot which means this information won't be even + * recorded in undo. + */ + epoch = GetEpochForXid(xid); + slot_no = PageGetTransactionSlotId(rel, buffer, epoch, xid, + &slot_urec_ptr, true, true, + &tpd_page_locked); + + /* + * If undo action has been already applied for this page then skip + * the process altogether. If we didn't find a slot corresponding to + * xid, we consider the transaction is already rolledback. + * + * The logno of slot's undo record pointer must be same as the logno + * of undo record to be applied. + */ + if (slot_no == InvalidXactSlotId || + (UndoRecPtrGetLogNo(slot_urec_ptr) != UndoRecPtrGetLogNo(urec_info->urp)) || + (UndoRecPtrGetLogNo(slot_urec_ptr) == UndoRecPtrGetLogNo(urec_ptr) && + slot_urec_ptr <= urec_ptr)) + { + UnlockReleaseBuffer(buffer); + heap_close(rel, NoLock); + + UnlockReleaseTPDBuffers(); + + return false; + } + + /* + * We might need to update the TPD offset map while applying undo actions, + * so get the size of the TPD offset map and allocate the memory to fetch + * that outside the critical section. It is quite possible that the TPD + * entry is already pruned by this time, in which case, we will mark the + * slot as frozen. + * + * XXX It would have been better if we fetch the tpd map only when + * required, but that won't be possible in all cases. Sometimes + * we will come to know only during processing particular undo record. + * Now, we can process the undo records partially outside critical section + * such that we know whether we need TPD map or not, but that seems to + * be overkill. + */ + if (tpd_page_locked) + { + tpd_map_size = TPDPageGetOffsetMapSize(buffer); + if (tpd_map_size > 0) + tpd_offset_map = palloc(tpd_map_size); + } + + START_CRIT_SECTION(); + + foreach(l_iter, luinfo) + { + UndoRecInfo *urec_info = (UndoRecInfo *) lfirst(l_iter); + UnpackedUndoRecord *uur = urec_info->uur; + + /* Skip already applied undo. */ + if (slot_urec_ptr < urec_info->urp) + continue; + + switch (uur->uur_type) + { + case UNDO_INSERT: + { + int i, + nline; + ItemId lp; + + undo_action_insert(rel, page, uur->uur_offset, xid); + + nline = PageGetMaxOffsetNumber(page); + need_init = true; + for (i = FirstOffsetNumber; i <= nline; i++) + { + lp = PageGetItemId(page, i); + if (ItemIdIsUsed(lp) || ItemIdHasPendingXact(lp)) + { + need_init = false; + break; + } + } + } + break; + case UNDO_MULTI_INSERT: + { + OffsetNumber start_offset; + OffsetNumber end_offset; + OffsetNumber iter_offset; + int i, + nline; + ItemId lp; + + start_offset = ((OffsetNumber *) uur->uur_payload.data)[0]; + end_offset = ((OffsetNumber *) uur->uur_payload.data)[1]; + + for (iter_offset = start_offset; + iter_offset <= end_offset; + iter_offset++) + { + undo_action_insert(rel, page, iter_offset, xid); + } + + nline = PageGetMaxOffsetNumber(page); + need_init = true; + for (i = FirstOffsetNumber; i <= nline; i++) + { + lp = PageGetItemId(page, i); + if (ItemIdIsUsed(lp) || ItemIdHasPendingXact(lp)) + { + need_init = false; + break; + } + } + } + break; + case UNDO_DELETE: + case UNDO_UPDATE: + case UNDO_INPLACE_UPDATE: + { + ItemId lp; + ZHeapTupleHeader zhtup; + TransactionId slot_xid; + Size offset = 0; + uint32 undo_tup_len; + int trans_slot; + uint16 infomask; + int prev_trans_slot; + + /* Copy the entire tuple from undo. */ + lp = PageGetItemId(page, uur->uur_offset); + Assert(ItemIdIsNormal(lp)); + zhtup = (ZHeapTupleHeader) PageGetItem(page, lp); + infomask = zhtup->t_infomask; + trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup); + + undo_tup_len = *((uint32 *) &uur->uur_tuple.data[offset]); + ItemIdChangeLen(lp, undo_tup_len); + /* skip ctid and tableoid stored in undo tuple */ + offset += sizeof(uint32) + sizeof(ItemPointerData) + + sizeof(Oid); + memcpy(zhtup, + (ZHeapTupleHeader) &uur->uur_tuple.data[offset], + undo_tup_len); + + /* + * Fetch previous transaction slot on tuple formed from + * undo record. + */ + prev_trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup); + + /* + * If the previous version of the tuple points to a TPD + * slot then we need to update the slot in the offset map + * of the TPD entry. But, only if we still have a valid + * TPD entry for the page otherwise the old tuple version + * must be all visible and we can mark the slot as frozen. + */ + if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT && + tpd_offset_map) + { + TransactionId prev_slot_xid; + + /* Fetch TPD slot from the undo. */ + if (uur->uur_type == UNDO_UPDATE) + prev_trans_slot = *(int *) ((char *) uur->uur_payload.data + + sizeof(ItemPointerData)); + else + prev_trans_slot = *(int *) uur->uur_payload.data; + + /* + * If the previous transaction slot points to a TPD + * slot then we need to update the slot in the offset + * map of the TPD entry. + * + * This is the case where during DO operation the + * previous updater belongs to a non-TPD slot whereas + * now the same slot has become a TPD slot. In such + * cases, we need to update offset-map. + */ + GetTransactionSlotInfo(buffer, + InvalidOffsetNumber, + prev_trans_slot, + NULL, + &prev_slot_xid, + NULL, + false, + true); + + TPDPageSetOffsetMapSlot(buffer, prev_trans_slot, + uur->uur_offset); + + /* Here, we updated TPD offset map, so need to log. */ + if (!is_tpd_map_updated) + is_tpd_map_updated = true; + + /* + * If transaction slot to which tuple point is not + * same as the previous transaction slot, so that we + * need to mark the tuple with a special flag. + */ + if (uur->uur_prevxid != prev_slot_xid) + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + else if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + { + ZHeapTupleHeaderSetXactSlot(zhtup, ZHTUP_SLOT_FROZEN); + } + else if (prev_trans_slot == ZHEAP_PAGE_TRANS_SLOTS && + ZHeapPageHasTPDSlot((PageHeader) page)) + { + TransactionId prev_slot_xid; + + /* TPD page must be locked by now. */ + Assert(tpd_page_locked); + + /* + * If the previous transaction slot points to a TPD + * slot then we need to update the slot in the offset + * map of the TPD entry. + * + * This is the case where during DO operation the + * previous updater belongs to a non-TPD slot whereas + * now the same slot has become a TPD slot. In such + * cases, we need to update offset-map. + */ + GetTransactionSlotInfo(buffer, + InvalidOffsetNumber, + prev_trans_slot, + NULL, + &prev_slot_xid, + NULL, + false, + true); + TPDPageSetOffsetMapSlot(buffer, + ZHEAP_PAGE_TRANS_SLOTS + 1, + uur->uur_offset); + + /* Here, we updated TPD offset map, so need to log. */ + if (!is_tpd_map_updated) + is_tpd_map_updated = true; + + /* + * If transaction slot to which tuple point is not + * same as the previous transaction slot, so that we + * need to mark the tuple with a special flag. + */ + if (uur->uur_prevxid != prev_slot_xid) + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + else + { + trans_slot = GetTransactionSlotInfo(buffer, + uur->uur_offset, + trans_slot, + NULL, + &slot_xid, + NULL, + false, + false); + + if (TransactionIdEquals(uur->uur_prevxid, + FrozenTransactionId)) + { + /* + * If the previous xid is frozen, then we can + * safely mark the tuple as frozen. + */ + ZHeapTupleHeaderSetXactSlot(zhtup, + ZHTUP_SLOT_FROZEN); + } + else if (trans_slot != ZHTUP_SLOT_FROZEN && + uur->uur_prevxid != slot_xid) + { + /* + * If the transaction slot to which tuple point got + * reused by this time, then we need to mark the + * tuple with a special flag. See comments atop + * PageFreezeTransSlots. + */ + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + } + + /* + * We always need to retain the strongest locker + * information on the the tuple (as part of infomask and + * infomask2), if there are multiple lockers on a tuple. + * This is because the conflict detection mechanism works + * based on strongest locker. See + * zheap_update/zheap_delete. We have allowed to override + * the transaction slot information with whatever is + * present in undo as we have taken care during DO + * operation that it contains previous strongest locker + * information. See compute_new_xid_infomask. + */ + if (ZHeapTupleHasMultiLockers(infomask)) + { + /* ZHeapTupleHeaderSetXactSlot(zhtup, trans_slot); */ + zhtup->t_infomask |= ZHEAP_MULTI_LOCKERS; + zhtup->t_infomask &= ~(zhtup->t_infomask & + ZHEAP_LOCK_MASK); + zhtup->t_infomask |= infomask & ZHEAP_LOCK_MASK; + + /* + * If the tuple originally has INVALID_XACT_SLOT set, + * then we need to retain it as that must be the + * information of strongest locker. + */ + if (ZHeapTupleHasInvalidXact(infomask)) + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + } + break; + case UNDO_XID_LOCK_ONLY: + case UNDO_XID_LOCK_FOR_UPDATE: + { + ItemId lp; + ZHeapTupleHeader zhtup, undo_tup_hdr; + uint16 infomask; + + /* Copy the entire tuple from undo. */ + lp = PageGetItemId(page, uur->uur_offset); + Assert(ItemIdIsNormal(lp)); + zhtup = (ZHeapTupleHeader) PageGetItem(page, lp); + infomask = zhtup->t_infomask; + + /* + * Override the tuple header values with values retrieved + * from undo record except when there are multiple + * lockers. In such cases, we want to retain the strongest + * locker information present in infomask and infomask2. + */ + undo_tup_hdr = (ZHeapTupleHeader) uur->uur_tuple.data; + zhtup->t_hoff = undo_tup_hdr->t_hoff; + + if (!(ZHeapTupleHasMultiLockers(infomask))) + { + int trans_slot; + int prev_trans_slot PG_USED_FOR_ASSERTS_ONLY; + TransactionId slot_xid; + + zhtup->t_infomask2 = undo_tup_hdr->t_infomask2; + zhtup->t_infomask = undo_tup_hdr->t_infomask; + + /* + * We need to set the previous slot for tuples that are + * locked for update as such tuples changed the slot + * while acquiring the lock. + */ + if (uur->uur_type == UNDO_XID_LOCK_ONLY) + { + /* + * Set the slot in the tpd offset map. for detailed + * comments refer undo actions of update/delete. + */ + if ((uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) && + tpd_offset_map) + { + TransactionId prev_slot_xid; + + prev_trans_slot = *(int *)((char *)uur->uur_payload.data + + sizeof(LockTupleMode)); + /* + * If the previous transaction slot points to a TPD + * slot then we need to update the slot in the offset + * map of the TPD entry. + * + * This is the case where during DO operation the + * previous updater belongs to a non-TPD slot whereas + * now the same slot has become a TPD slot. In such + * cases, we need to update offset-map. + */ + GetTransactionSlotInfo(buffer, + InvalidOffsetNumber, + prev_trans_slot, + NULL, + &prev_slot_xid, + NULL, + false, + true); + + TPDPageSetOffsetMapSlot(buffer, prev_trans_slot, + uur->uur_offset); + + /* + * Here, we updated TPD offset map, so need to + * log. + */ + if (!is_tpd_map_updated) + is_tpd_map_updated = true; + + /* + * If transaction slot to which tuple point is not + * same as the previous transaction slot, so that we + * need to mark the tuple with a special flag. + */ + if (prev_slot_xid != uur->uur_prevxid) + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + else if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + prev_trans_slot = ZHTUP_SLOT_FROZEN; + else + prev_trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup); + + trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup); + trans_slot = GetTransactionSlotInfo(buffer, + uur->uur_offset, + trans_slot, + NULL, + &slot_xid, + NULL, + false, + false); + + /* + * For a non multi locker case, the slot in undo (and + * hence on tuple) must be either a frozen slot or the + * previous slot. Generally, we always set the multi-locker + * bit on the tuple whenever the tuple slot is not frozen. + * But, if the tuple is inserted/modified by the same + * transaction that later takes a lock on it, we keep the + * transaction slot as it is. + * See compute_new_xid_infomask for details. + */ + Assert(trans_slot == ZHTUP_SLOT_FROZEN || + trans_slot == prev_trans_slot); + } + else + { + /* + * Fetch previous transaction slot on tuple formed from + * undo record. + */ + prev_trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup); + + /* + * If the previous version of the tuple points to a TPD + * slot then we need to update the slot in the offset map + * of the TPD entry. But, only if we still have a valid + * TPD entry for the page otherwise the old tuple version + * must be all visible and we can mark the slot as frozen. + */ + if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT && + tpd_offset_map) + { + TransactionId prev_slot_xid; + + prev_trans_slot = *(int *)((char *)uur->uur_payload.data + sizeof(LockTupleMode)); + + /* + * If the previous transaction slot points to a TPD + * slot then we need to update the slot in the offset + * map of the TPD entry. + * + * This is the case where during DO operation the + * previous updater belongs to a non-TPD slot whereas + * now the same slot has become a TPD slot. In such + * cases, we need to update offset-map. + */ + GetTransactionSlotInfo(buffer, + InvalidOffsetNumber, + prev_trans_slot, + NULL, + &prev_slot_xid, + NULL, + false, + true); + + TPDPageSetOffsetMapSlot(buffer, prev_trans_slot, + uur->uur_offset); + + /* Here, we updated TPD offset map, so need to + * log. + */ + if (!is_tpd_map_updated) + is_tpd_map_updated = true; + + /* + * If transaction slot to which tuple point is not + * same as the previous transaction slot, so that we + * need to mark the tuple with a special flag. + */ + if (prev_slot_xid != uur->uur_prevxid) + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + else if (uur->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + { + ZHeapTupleHeaderSetXactSlot(zhtup, ZHTUP_SLOT_FROZEN); + } + else if (prev_trans_slot == ZHEAP_PAGE_TRANS_SLOTS && + ZHeapPageHasTPDSlot((PageHeader) page)) + { + TransactionId prev_slot_xid; + + /* TPD page must be locked by now. */ + Assert(tpd_page_locked); + + /* + * If the previous transaction slot points to a TPD + * slot then we need to update the slot in the offset + * map of the TPD entry. + * + * This is the case where during DO operation the + * previous updater belongs to a non-TPD slot whereas + * now the same slot has become a TPD slot. In such + * cases, we need to update offset-map. + */ + GetTransactionSlotInfo(buffer, + InvalidOffsetNumber, + prev_trans_slot, + NULL, + &prev_slot_xid, + NULL, + false, + true); + + TPDPageSetOffsetMapSlot(buffer, + ZHEAP_PAGE_TRANS_SLOTS + 1, + uur->uur_offset); + + /* Here, we updated TPD offset map, so need to + * log. + */ + if (!is_tpd_map_updated) + is_tpd_map_updated = true; + + if (prev_slot_xid != uur->uur_prevxid) + { + /* + * Here, transaction slot to which tuple point is not + * same as the previous transaction slot, so that we + * need to mark the tuple with a special flag. + */ + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + } + else + { + trans_slot = ZHeapTupleHeaderGetXactSlot(zhtup); + trans_slot = GetTransactionSlotInfo(buffer, + uur->uur_offset, + trans_slot, + NULL, + &slot_xid, + NULL, + false, + false); + + if (TransactionIdEquals(uur->uur_prevxid, + FrozenTransactionId)) + { + /* + * If the previous xid is frozen, then we can + * safely mark the tuple as frozen. + */ + ZHeapTupleHeaderSetXactSlot(zhtup, + ZHTUP_SLOT_FROZEN); + } + else if (trans_slot != ZHTUP_SLOT_FROZEN && + uur->uur_prevxid != slot_xid) + { + /* + * If the transaction slot to which tuple point got + * reused by this time, then we need to mark the + * tuple with a special flag. See comments atop + * PageFreezeTransSlots. + */ + zhtup->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + } + } + } + } + break; + case UNDO_XID_MULTI_LOCK_ONLY: + break; + case UNDO_ITEMID_UNUSED: + { + int item_count, i; + OffsetNumber *unused; + + unused = ((OffsetNumber *) uur->uur_payload.data); + item_count = (uur->uur_payload.len / sizeof(OffsetNumber)); + + /* + * We need to preserve all the unused items in zheap so + * that they can't be reused till the corresponding index + * entries are removed. So, marking them dead is + * a sufficient indication for the index to remove the + * entry in index. + */ + for (i = 0; i < item_count; i++) + { + ItemId itemid; + + itemid = PageGetItemId(page, unused[i]); + ItemIdSetDead(itemid); + } + + /* clear visibility map */ + Assert(BufferIsValid(vmbuffer)); + visibilitymap_clear(rel, blkno, vmbuffer, + VISIBILITYMAP_VALID_BITS); + + } + break; + default: + elog(ERROR, "unsupported undo record type"); + } + } + + /* + * If the undo chain for the block is complete then set the xid in the slot + * as InvalidTransactionId. But, rewind the slot urec_ptr to the previous + * urec_ptr in the slot. This is to make sure if any transaction reuse the + * transaction slot and rollback then put back the previous transaction's + * urec_ptr. + */ + if (blk_chain_complete) + { + epoch = 0; + xid = InvalidTransactionId; + } + + PageSetTransactionSlotInfo(buffer, slot_no, epoch, xid, urec_ptr); + + MarkBufferDirty(buffer); + + /* + * We are logging the complete page for undo actions, so we don't need to + * record the data for individual operations. We can optimize it by + * recording the data for individual operations, but again if there are + * multiple operations, then it might be better to log the complete page. + * So we can have some threshold above which we always log the complete + * page. + */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + uint8 flags = 0; + + if (slot_no > ZHEAP_PAGE_TRANS_SLOTS) + flags |= XLU_PAGE_CONTAINS_TPD_SLOT; + if (BufferIsValid(vmbuffer)) + flags |= XLU_PAGE_CLEAR_VISIBILITY_MAP; + if (is_tpd_map_updated) + { + /* TPD page must be locked. */ + Assert(tpd_page_locked); + /* tpd_offset_map must be non-null. */ + Assert(tpd_offset_map); + flags |= XLU_CONTAINS_TPD_OFFSET_MAP; + } + if (need_init) + flags |= XLU_INIT_PAGE; + + XLogBeginInsert(); + + XLogRegisterData((char *) &flags, sizeof(uint8)); + XLogRegisterBuffer(0, buffer, REGBUF_FORCE_IMAGE | REGBUF_STANDARD); + + /* Log the TPD details, if the transaction slot belongs to TPD. */ + if (flags & XLU_PAGE_CONTAINS_TPD_SLOT) + { + xl_undoaction_page xlrec; + + xlrec.urec_ptr = urec_ptr; + xlrec.xid = xid; + xlrec.trans_slot_id = slot_no; + XLogRegisterData((char *) &xlrec, SizeOfUndoActionPage); + } + + /* + * Log the TPD offset map if we have modified it. + * + * XXX Another option could be that we track all the offset map entries + * of TPD which got modified while applying the undo and only log those + * information into the WAL. + */ + if (is_tpd_map_updated) + { + /* Fetch the TPD offset map and write into the WAL record. */ + TPDPageGetOffsetMap(buffer, tpd_offset_map, tpd_map_size); + XLogRegisterData((char *) tpd_offset_map, tpd_map_size); + } + + if (flags & XLU_PAGE_CONTAINS_TPD_SLOT || + flags & XLU_CONTAINS_TPD_OFFSET_MAP) + { + RegisterTPDBuffer(page, 1); + } + + recptr = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_PAGE); + + PageSetLSN(page, recptr); + if (flags & XLU_PAGE_CONTAINS_TPD_SLOT || + flags & XLU_CONTAINS_TPD_OFFSET_MAP) + TPDPageSetLSN(page, recptr); + } + + /* + * During rollback, if all the itemids are marked as unused, we need + * to initialize the page, so that the next insertion can see the + * page as initialized. This serves two purposes (a) On next insertion, + * we can safely set the XLOG_ZHEAP_INIT_PAGE flag in WAL (OTOH, if we + * don't initialize the page here and set the flag, wal consistency + * checker can complain), (b) we don't accumulate the dead space in the + * page. + * + * Note that we initialize the page after writing WAL because the TPD + * routines use last slot in page to determine TPD block number. + */ + if (need_init) + ZheapInitPage(page, (Size) BLCKSZ); + + END_CRIT_SECTION(); + + /* Free TPD offset map memory. */ + if (tpd_offset_map) + pfree(tpd_offset_map); + + /* + * Release any remaining pin on visibility map page. + */ + if (BufferIsValid(vmbuffer)) + ReleaseBuffer(vmbuffer); + + UnlockReleaseBuffer(buffer); + UnlockReleaseTPDBuffers(); + + heap_close(rel, NoLock); + + return true; +} + +/* + * To return the size of the hash-table for rollbacks. + */ +int +RollbackHTSize(void) +{ + return hash_estimate_size(ROLLBACK_HT_SIZE, sizeof(RollbackHashEntry)); +} + +/* + * To initialize the hash-table for rollbacks in shared memory + * for the given size. + */ +void +InitRollbackHashTable(void) +{ + HASHCTL info; + MemSet(&info, 0, sizeof(info)); + + info.keysize = sizeof(UndoRecPtr); + info.entrysize = sizeof(RollbackHashEntry); + info.hash = tag_hash; + + RollbackHT = ShmemInitHash("Undo actions Lookup Table", + ROLLBACK_HT_SIZE, ROLLBACK_HT_SIZE, &info, + HASH_ELEM | HASH_FUNCTION | HASH_FIXED_SIZE); +} + +/* + * To push the rollback requests from backend to the hash-table. + * Return true if the request is successfully added, else false + * and the caller may execute undo actions itself. + */ +bool +PushRollbackReq(UndoRecPtr start_urec_ptr, UndoRecPtr end_urec_ptr, Oid dbid) +{ + bool found = false; + RollbackHashEntry *rh; + + /* Do not push any rollback request if working in single user-mode */ + if (!IsUnderPostmaster) + return false; + /* + * If the location upto which rollback need to be done is not provided, + * then rollback the complete transaction. + */ + if (start_urec_ptr == InvalidUndoRecPtr) + { + UndoLogNumber logno = UndoRecPtrGetLogNo(end_urec_ptr); + start_urec_ptr = UndoLogGetLastXactStartPoint(logno); + } + + Assert(UndoRecPtrIsValid(start_urec_ptr)); + + /* If there is no space to accomodate new request, then we can't proceed. */ + if (RollbackHTIsFull()) + return false; + + if(!UndoRecPtrIsValid(end_urec_ptr)) + { + UndoLogNumber logno = UndoRecPtrGetLogNo(start_urec_ptr); + end_urec_ptr = UndoLogGetLastXactStartPoint(logno); + } + + LWLockAcquire(RollbackHTLock, LW_EXCLUSIVE); + + rh = (RollbackHashEntry *) hash_search(RollbackHT, &start_urec_ptr, + HASH_ENTER_NULL, &found); + if (!rh) + { + LWLockRelease(RollbackHTLock); + return false; + } + /* We shouldn't try to push the same rollback request again. */ + if (!found) + { + rh->start_urec_ptr = start_urec_ptr; + rh->end_urec_ptr = end_urec_ptr; + rh->dbid = (dbid == InvalidOid) ? MyDatabaseId : dbid; + } + LWLockRelease(RollbackHTLock); + + return true; +} + +/* + * To perform the undo actions for the transactions whose rollback + * requests are in hash table. Sequentially, scan the hash-table + * and perform the undo-actions for the respective transactions. + * Once, the undo-actions are applied, remove the entry from the + * hash table. + */ +void +RollbackFromHT(Oid dbid) +{ + UndoRecPtr start[ROLLBACK_HT_SIZE]; + UndoRecPtr end[ROLLBACK_HT_SIZE]; + RollbackHashEntry *rh; + HASH_SEQ_STATUS status; + int i = 0; + + /* Fetch the rollback requests */ + LWLockAcquire(RollbackHTLock, LW_SHARED); + + Assert(hash_get_num_entries(RollbackHT) <= ROLLBACK_HT_SIZE); + hash_seq_init(&status, RollbackHT); + while (RollbackHT != NULL && + (rh = (RollbackHashEntry *) hash_seq_search(&status)) != NULL) + { + if (rh->dbid == dbid) + { + start[i] = rh->start_urec_ptr; + end[i] = rh->end_urec_ptr; + i++; + } + } + + LWLockRelease(RollbackHTLock); + + /* Execute the rollback requests */ + while(--i >= 0) + { + Assert(UndoRecPtrIsValid(start[i])); + Assert(UndoRecPtrIsValid(end[i])); + + StartTransactionCommand(); + execute_undo_actions(start[i], end[i], true, false, true); + CommitTransactionCommand(); + } +} + +/* + * Remove the rollback request entry from the rollback hash table. + */ +static void +RollbackHTRemoveEntry(UndoRecPtr start_urec_ptr) +{ + LWLockAcquire(RollbackHTLock, LW_EXCLUSIVE); + + hash_search(RollbackHT, &start_urec_ptr, HASH_REMOVE, NULL); + + LWLockRelease(RollbackHTLock); +} + +/* + * To check if the rollback requests in the hash table are all + * completed or not. This is required because we don't not want to + * expose RollbackHT in xact.c, where it is required to ensure + * that we push the resuests only when there is some space in + * the hash-table. + */ +bool +RollbackHTIsFull(void) +{ + bool result = false; + + LWLockAcquire(RollbackHTLock, LW_SHARED); + + if (hash_get_num_entries(RollbackHT) >= ROLLBACK_HT_SIZE) + result = true; + + LWLockRelease(RollbackHTLock); + + return result; +} + +/* + * Get database list from the rollback hash table. + */ +List * +RollbackHTGetDBList() +{ + HASH_SEQ_STATUS status; + RollbackHashEntry *rh; + List *dblist = NIL; + + /* Fetch the rollback requests */ + LWLockAcquire(RollbackHTLock, LW_SHARED); + + hash_seq_init(&status, RollbackHT); + while (RollbackHT != NULL && + (rh = (RollbackHashEntry *) hash_seq_search(&status)) != NULL) + dblist = list_append_unique_oid(dblist, rh->dbid); + + LWLockRelease(RollbackHTLock); + + return dblist; +} + +/* + * ConditionTransactionUndoActionLock + * + * Insert a lock showing that the undo action for given transaction is in + * progress. This is only done for the main transaction not for the + * sub-transaction. + */ +bool +ConditionTransactionUndoActionLock(TransactionId xid) +{ + LOCKTAG tag; + + SET_LOCKTAG_TRANSACTION_UNDOACTION(tag, xid); + + if (LOCKACQUIRE_NOT_AVAIL == LockAcquire(&tag, ExclusiveLock, false, true)) + return false; + else + return true; +} + +/* + * TransactionUndoActionLockRelease + * + * Delete the lock showing that the undo action given transaction ID is in + * progress. + */ +void +TransactionUndoActionLockRelease(TransactionId xid) +{ + LOCKTAG tag; + + SET_LOCKTAG_TRANSACTION_UNDOACTION(tag, xid); + + LockRelease(&tag, ExclusiveLock, false); +} diff --git a/src/backend/access/undo/undoactionxlog.c b/src/backend/access/undo/undoactionxlog.c new file mode 100644 index 0000000000..1b8c6306ba --- /dev/null +++ b/src/backend/access/undo/undoactionxlog.c @@ -0,0 +1,233 @@ +/*------------------------------------------------------------------------- + * + * undoactionxlog.c + * WAL replay logic for undo actions. + * + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/undo/undoactionxlog.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/undoaction_xlog.h" +#include "access/visibilitymap.h" +#include "access/xlog.h" +#include "access/xlogutils.h" +#include "access/zheap.h" + +#if 0 +static void +undo_xlog_insert(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_insert *xlrec = (xl_undo_insert *) XLogRecGetData(record); + Buffer buffer; + Page page; + ItemId lp; + XLogRedoAction action; + + action = XLogReadBufferForRedo(record, 0, &buffer); + if (action == BLK_NEEDS_REDO) + { + page = BufferGetPage(buffer); + + lp = PageGetItemId(page, xlrec->offnum); + if (xlrec->relhasindex) + { + ItemIdSetDead(lp); + } + else + { + ItemIdSetUnused(lp); + /* Set hint bit for ZPageAddItem */ + /*PageSetHasFreeLinePointers(page);*/ + } + + PageSetLSN(BufferGetPage(buffer), lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); +} +#endif + +/* + * replay of undo page operation + */ +static void +undo_xlog_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer buf; + xl_undoaction_page *xlrec = NULL; + char *offsetmap = NULL, + *data = NULL; + XLogRedoAction action; + uint8 *flags = (uint8 *) XLogRecGetData(record); + + if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT || + *flags & XLU_CONTAINS_TPD_OFFSET_MAP) + { + data = (char *) flags + sizeof(uint8); + if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT) + { + xlrec = (xl_undoaction_page *) data; + data += sizeof(xl_undoaction_page); + } + if (*flags & XLU_CONTAINS_TPD_OFFSET_MAP) + offsetmap = data; + } + + if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED) + elog(ERROR, "Undo page record did not contain a full-page image"); + + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + uint32 xid_epoch = 0; + + /* + * We need to replay the record for TPD only when this record contains + * slot from TPD. + */ + Assert(*flags & XLU_PAGE_CONTAINS_TPD_SLOT || + *flags & XLU_CONTAINS_TPD_OFFSET_MAP); + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + if (*flags & XLU_PAGE_CONTAINS_TPD_SLOT) + { + if (TransactionIdIsValid(xlrec->xid)) + xid_epoch = GetEpochForXid(xlrec->xid); + TPDPageSetTransactionSlotInfo(buf, xlrec->trans_slot_id, + xid_epoch, + xlrec->xid, xlrec->urec_ptr); + } + + if (offsetmap) + TPDPageSetOffsetMap(buf, offsetmap); + + TPDPageSetLSN(BufferGetPage(buf), lsn); + } + } + + if (*flags & XLU_PAGE_CLEAR_VISIBILITY_MAP) + { + Relation reln; + Buffer vmbuffer = InvalidBuffer; + RelFileNode target_node; + BlockNumber blkno; + + XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno); + reln = CreateFakeRelcacheEntry(target_node); + visibilitymap_pin(reln, blkno, &vmbuffer); + visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); + ReleaseBuffer(vmbuffer); + FreeFakeRelcacheEntry(reln); + } + + /* + * Reset Page only at the end if asked, page level flag + * PD_PAGE_HAS_TPD_SLOT and TPD slot are needed before that TPD routines. + */ + if (*flags & XLU_INIT_PAGE) + ZheapInitPage(BufferGetPage(buf), (Size) BLCKSZ); + + UnlockReleaseBuffer(buf); + UnlockReleaseTPDBuffers(); +} + +/* + * replay of undo reset slot operation + */ +static void +undo_xlog_reset_xid(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undoaction_reset_slot *xlrec = (xl_undoaction_reset_slot *) XLogRecGetData(record); + Buffer buf; + XLogRedoAction action; + + action = XLogReadBufferForRedo(record, 0, &buf); + + /* + * Reseting the TPD slot is handled separately so only handle the page + * slot here. + */ + if (action == BLK_NEEDS_REDO && + xlrec->trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS) + { + Page page; + ZHeapPageOpaque opaque; + int slot_no = xlrec->trans_slot_id; + + page = BufferGetPage(buf); + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + opaque->transinfo[slot_no - 1].xid_epoch = 0; + opaque->transinfo[slot_no - 1].xid = InvalidTransactionId; + opaque->transinfo[slot_no - 1].urec_ptr = xlrec->urec_ptr; + + PageSetLSN(page, lsn); + MarkBufferDirty(buf); + } + + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + Assert(xlrec->flags & XLU_RESET_CONTAINS_TPD_SLOT); + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + TPDPageSetTransactionSlotInfo(buf, xlrec->trans_slot_id, + 0, InvalidTransactionId, + xlrec->urec_ptr); + TPDPageSetLSN(BufferGetPage(buf), lsn); + } + } + + if (BufferIsValid(buf)) + UnlockReleaseBuffer(buf); + UnlockReleaseTPDBuffers(); +} + +/* + * Replay of undo apply progress. + */ +static void +undo_xlog_apply_progress(XLogReaderState *record) +{ + xl_undoapply_progress *xlrec = (xl_undoapply_progress *) XLogRecGetData(record); + + /* Update the progress in the transaction header. */ + PrepareUpdateUndoActionProgress(xlrec->urec_ptr, xlrec->progress); + UndoRecordUpdateTransInfo(); + UnlockReleaseUndoBuffers(); +} + +void +undoaction_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info) + { + case XLOG_UNDO_PAGE: + undo_xlog_page(record); + break; + case XLOG_UNDO_RESET_SLOT: + undo_xlog_reset_xid(record); + break; + case XLOG_UNDO_APPLY_PROGRESS: + undo_xlog_apply_progress(record); + break; + default: + elog(PANIC, "undoaction_redo: unknown op code %u", info); + } +} diff --git a/src/backend/access/undo/undodiscard.c b/src/backend/access/undo/undodiscard.c new file mode 100644 index 0000000000..2464fb6dc2 --- /dev/null +++ b/src/backend/access/undo/undodiscard.c @@ -0,0 +1,469 @@ +/*------------------------------------------------------------------------- + * + * undodiscard.c + * discard undo records + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/access/undo/undodiscard.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/xact.h" +#include "access/xlog.h" +#include "access/undolog.h" +#include "access/undodiscard.h" +#include "catalog/pg_tablespace.h" +#include "miscadmin.h" +#include "storage/block.h" +#include "storage/buf.h" +#include "storage/bufmgr.h" +#include "storage/shmem.h" +#include "storage/proc.h" +#include "utils/resowner.h" +#include "postmaster/undoloop.h" + +static UndoRecPtr FetchLatestUndoPtrForXid(UndoRecPtr urecptr, + UnpackedUndoRecord *uur_start, + UndoLogControl *log); + +/* + * Discard the undo for the log + * + * Search the undo log, get the start record for each transaction until we get + * the transaction with xid >= xmin or an invalid xid. Then call undolog + * routine to discard upto that point and update the memory structure for the + * log slot. We set the hibernate flag if we do not have any undo logs, this + * flag is passed to the undo worker wherein it determines if system is idle + * and it should sleep for sometime. + * + * Return the oldest xid remaining in this undo log (which should be >= xmin, + * since we'll discard everything older). Return InvalidTransactionId if the + * undo log is empty. + */ +static TransactionId +UndoDiscardOneLog(UndoLogControl *log, TransactionId xmin, bool *hibernate) +{ + UndoRecPtr undo_recptr, next_insert, from_urecptr; + UndoRecPtr next_urecptr = InvalidUndoRecPtr; + UnpackedUndoRecord *uur = NULL; + bool need_discard = false; + bool log_complete = false; + TransactionId undoxid = InvalidTransactionId; + TransactionId xid = log->oldest_xid; + TransactionId latest_discardxid = InvalidTransactionId; + uint32 epoch = 0; + + undo_recptr = log->oldest_data; + + /* There might not be any undo log and hibernation might be needed. */ + *hibernate = true; + + /* Loop until we run out of discardable transactions. */ + do + { + bool pending_abort = false; + + next_insert = UndoLogGetNextInsertPtr(log->logno, xid); + + /* + * If the next insert location in the undo log is same as the oldest + * data for the log then there is nothing more to discard in this log + * so discard upto this point. + */ + if (next_insert == undo_recptr) + { + /* + * If the discard location and the insert location is same then + * there is nothing to discard. + */ + if (undo_recptr == log->oldest_data) + break; + else + log_complete = true; + } + else + { + /* Fetch the undo record for given undo_recptr. */ + uur = UndoFetchRecord(undo_recptr, InvalidBlockNumber, + InvalidOffsetNumber, InvalidTransactionId, + NULL, NULL); + + Assert(uur != NULL); + + if (!TransactionIdDidCommit(uur->uur_xid) && + TransactionIdPrecedes(uur->uur_xid, xmin) && + uur->uur_progress == 0) + { + /* + * At the time of recovery, we might not have a valid next undo + * record pointer and in that case we'll calculate the location + * of from pointer using the last record of next insert + * location. + */ + if (ConditionTransactionUndoActionLock(uur->uur_xid)) + { + TransactionId xid = uur->uur_xid; + UndoLogControl *log = NULL; + UndoLogNumber logno; + + logno = UndoRecPtrGetLogNo(undo_recptr); + log = UndoLogGet(logno); + + /* + * If the corresponding log got rewinded to a location + * prior to undo_recptr, the undo actions are already + * applied. + */ + if (MakeUndoRecPtr(logno, log->meta.insert) > undo_recptr) + { + UndoRecordRelease(uur); + + /* Fetch the undo record under undo action lock. */ + uur = UndoFetchRecord(undo_recptr, InvalidBlockNumber, + InvalidOffsetNumber, InvalidTransactionId, + NULL, NULL); + /* + * If the undo actions for the aborted transaction is + * already applied then continue discarding the undo log + * otherwise discard till current point and stop processing + * this undo log. + * Also, check this is indeed the transaction id we're + * looking for. It is possible that after rewinding + * some other transaction has inserted an undo record. + */ + if (uur->uur_xid == xid && uur->uur_progress == 0) + { + from_urecptr = FetchLatestUndoPtrForXid(undo_recptr, uur, log); + (void)PushRollbackReq(from_urecptr, undo_recptr, uur->uur_dbid); + pending_abort = true; + } + } + + TransactionUndoActionLockRelease(xid); + } + else + pending_abort = true; + } + + next_urecptr = uur->uur_next; + undoxid = uur->uur_xid; + xid = undoxid; + epoch = uur->uur_xidepoch; + } + + /* we can discard upto this point. */ + if (TransactionIdFollowsOrEquals(undoxid, xmin) || + next_urecptr == InvalidUndoRecPtr || + UndoRecPtrGetLogNo(next_urecptr) != log->logno || + log_complete || pending_abort) + { + /* Hey, I got some undo log to discard, can not hibernate now. */ + *hibernate = false; + + if (uur != NULL) + UndoRecordRelease(uur); + + /* + * If Transaction id is smaller than the xmin that means this must + * be the last transaction in this undo log, so we need to get the + * last insert point in this undo log and discard till that point. + * Also, if the transaction has pending abort, we stop discarding + * undo from the same location. + */ + if (TransactionIdPrecedes(undoxid, xmin) && !pending_abort) + { + UndoRecPtr next_insert = InvalidUndoRecPtr; + + /* + * Get the last insert location for this transaction Id, if it + * returns invalid pointer that means there is new transaction + * has started for this undolog. So we need to refetch the undo + * and continue the process. + */ + next_insert = UndoLogGetNextInsertPtr(log->logno, undoxid); + if (!UndoRecPtrIsValid(next_insert)) + continue; + + undo_recptr = next_insert; + need_discard = true; + epoch = 0; + latest_discardxid = undoxid; + undoxid = InvalidTransactionId; + } + + LWLockAcquire(&log->discard_lock, LW_EXCLUSIVE); + + /* + * If no more pending undo logs then set the oldest transaction to + * InvalidTransactionId. + */ + if (log_complete) + { + log->oldest_xid = InvalidTransactionId; + log->oldest_xidepoch = 0; + } + else + { + log->oldest_xid = undoxid; + log->oldest_xidepoch = epoch; + } + + log->oldest_data = undo_recptr; + LWLockRelease(&log->discard_lock); + + if (need_discard) + UndoLogDiscard(undo_recptr, latest_discardxid); + + break; + } + + /* + * This transaction is smaller than the xmin so lets jump to the next + * transaction. + */ + undo_recptr = next_urecptr; + latest_discardxid = undoxid; + + if(uur != NULL) + { + UndoRecordRelease(uur); + uur = NULL; + } + + need_discard = true; + } while (true); + + return undoxid; +} + +/* + * Discard the undo for all the transaction whose xid is smaller than xmin + * + * Check the DiscardInfo memory array for each slot (every undo log) , process + * the undo log for all the slot which have xid smaller than xmin or invalid + * xid. Fetch the record from the undo log transaction by transaction until we + * find the xid which is not smaller than xmin. + */ +void +UndoDiscard(TransactionId oldestXmin, bool *hibernate) +{ + TransactionId oldestXidHavingUndo = oldestXmin; + uint64 epoch = GetEpochForXid(oldestXmin); + UndoLogControl *log = NULL; + + /* + * TODO: Ideally we'd arrange undo logs so that we can efficiently find + * those with oldest_xid < oldestXmin, but for now we'll just scan all of + * them. + */ + while ((log = UndoLogNext(log))) + { + TransactionId oldest_xid = InvalidTransactionId; + + /* We can't process temporary undo logs. */ + if (log->meta.persistence == UNDO_TEMP) + continue; + + /* + * If the first xid of the undo log is smaller than the xmin the try + * to discard the undo log. + */ + if (TransactionIdPrecedes(log->oldest_xid, oldestXmin)) + { + /* + * If the XID in the discard entry is invalid then start scanning + * from the first valid undorecord in the log. + */ + if (!TransactionIdIsValid(log->oldest_xid)) + { + UndoRecPtr urp = UndoLogGetFirstValidRecord(log->logno); + + if (!UndoRecPtrIsValid(urp)) + continue; + + LWLockAcquire(&log->discard_lock, LW_SHARED); + log->oldest_data = urp; + LWLockRelease(&log->discard_lock); + } + + /* Process the undo log. */ + oldest_xid = UndoDiscardOneLog(log, oldestXmin, hibernate); + } + + if (TransactionIdIsValid(oldest_xid) && + TransactionIdPrecedes(oldest_xid, oldestXidHavingUndo)) + { + oldestXidHavingUndo = oldest_xid; + epoch = GetEpochForXid(oldest_xid); + } + } + + /* + * Update the oldestXidWithEpochHavingUndo in the shared memory. + * + * XXX In future if multiple worker can perform discard then we may need + * to use compare and swap for updating the shared memory value. + */ + pg_atomic_write_u64(&ProcGlobal->oldestXidWithEpochHavingUndo, + MakeEpochXid(epoch, oldestXidHavingUndo)); +} + +/* + * To discard all the logs. Particularly required in single user mode. + * At the commit time, discard all the undo logs. + */ +void +UndoLogDiscardAll() +{ + UndoLogControl *log = NULL; + + Assert(!IsUnderPostmaster); + + while ((log = UndoLogNext(log))) + { + /* + * Process the undo log. No locks are required for discard, + * since this called only in single-user mode. Similarly, + * no transaction id is required here because WAL-logging the + * xid till whom the undo is discarded will not be required + * for single user mode. + */ + UndoLogDiscard(MakeUndoRecPtr(log->logno, log->meta.insert), + InvalidTransactionId); + } + +} +/* + * Fetch the latest urec pointer for the transaction. + */ +UndoRecPtr +FetchLatestUndoPtrForXid(UndoRecPtr urecptr, UnpackedUndoRecord *uur_start, + UndoLogControl *log) +{ + UndoRecPtr next_urecptr, from_urecptr; + uint16 prevlen; + UndoLogOffset next_insert; + UnpackedUndoRecord *uur; + bool refetch = false; + + uur = uur_start; + + while (true) + { + /* fetch the undo record again if required. */ + if (refetch) + { + uur = UndoFetchRecord(urecptr, InvalidBlockNumber, + InvalidOffsetNumber, InvalidTransactionId, + NULL, NULL); + refetch = false; + } + + next_urecptr = uur->uur_next; + prevlen = UndoLogGetPrevLen(log->logno); + + /* + * If this is the last transaction in the log then calculate the latest + * urec pointer using next insert location of the undo log. Otherwise, + * calculate using next transaction's start pointer. + */ + if (uur->uur_next == InvalidUndoRecPtr) + { + /* + * While fetching the next insert location if the new transaction + * has already started in this log then lets re-fetch the undo + * record. + */ + next_insert = UndoLogGetNextInsertPtr(log->logno, uur->uur_xid); + if (!UndoRecPtrIsValid(next_insert)) + { + if (uur != uur_start) + UndoRecordRelease(uur); + refetch = true; + continue; + } + + from_urecptr = UndoGetPrevUndoRecptr(next_insert, prevlen); + break; + } + else if ((UndoRecPtrGetLogNo(next_urecptr) != log->logno) && + UndoLogIsDiscarded(next_urecptr)) + { + /* + * If next_urecptr is in different undolog and its already discarded + * that means the undo actions for this transaction which are in the + * next log has already been executed and we only need to execute + * which are remaining in this log. + */ + next_insert = UndoLogGetNextInsertPtr(log->logno, uur->uur_xid); + + Assert(UndoRecPtrIsValid(next_insert)); + from_urecptr = UndoGetPrevUndoRecptr(next_insert, prevlen); + break; + } + else + { + UnpackedUndoRecord *next_uur; + + next_uur = UndoFetchRecord(next_urecptr, + InvalidBlockNumber, + InvalidOffsetNumber, + InvalidTransactionId, + NULL, NULL); + /* + * If the next_urecptr is in the same log then calculate the + * from pointer using prevlen. + */ + if (UndoRecPtrGetLogNo(next_urecptr) == log->logno) + { + from_urecptr = + UndoGetPrevUndoRecptr(next_urecptr, next_uur->uur_prevlen); + UndoRecordRelease(next_uur); + break; + } + else + { + /* + * The transaction is overflowed to the next log, so restart + * the processing from then next log. + */ + log = UndoLogGet(UndoRecPtrGetLogNo(next_urecptr)); + if (uur != uur_start) + UndoRecordRelease(uur); + uur = next_uur; + continue; + } + + UndoRecordRelease(next_uur); + } + } + + if (uur != uur_start) + UndoRecordRelease(uur); + + return from_urecptr; +} + +/* + * Discard the undo logs for temp tables. + */ +void +TempUndoDiscard(UndoLogNumber logno) +{ + UndoLogControl *log = UndoLogGet(logno); + + /* + * Discard the undo log for temp table only. Ensure that there is + * something to be discarded there. + */ + Assert (log->meta.persistence == UNDO_TEMP); + + /* Process the undo log. */ + UndoLogDiscard(MakeUndoRecPtr(log->logno, log->meta.insert), + InvalidTransactionId); +} diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c new file mode 100644 index 0000000000..ccb8f6b351 --- /dev/null +++ b/src/backend/access/undo/undoinsert.c @@ -0,0 +1,1245 @@ +/*------------------------------------------------------------------------- + * + * undoinsert.c + * entry points for inserting undo records + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/access/undo/undoinsert.c + * + * NOTES: + * Undo record layout: + * + * Undo record are stored in sequential order in the undo log. And, each + * transaction's first undo record a.k.a. transaction header points to the next + * transaction's start header. Transaction headers are linked so that the + * discard worker can read undo log transaction by transaction and avoid + * reading each undo record. + * + * Handling multi log: + * + * It is possible that the undo record of a transaction can be spread across + * multiple undo log. And, we need some special handling while inserting the + * undo for discard and rollback to work sanely. + * + * If the undorecord goes to next log then we insert a transaction header for + * the first record in the new log and update the transaction header with this + * new log's location. This will allow us to connect transactions across logs + * when the same transaction span across log (for this we keep track of the + * previous logno in undo log meta) which is required to find the latest undo + * record pointer of the aborted transaction for executing the undo actions + * before discard. If the next log get processed first in that case we + * don't need to trace back the actual start pointer of the transaction, + * in such case we can only execute the undo actions from the current log + * because the undo pointer in the slot will be rewound and that will be enough + * to avoid executing same actions. However, there is possibility that after + * executing the undo actions the undo pointer got discarded, now in later + * stage while processing the previous log it might try to fetch the undo + * record in the discarded log while chasing the transaction header chain. + * To avoid this situation we first check if the next_urec of the transaction + * is already discarded then no need to access that and start executing from + * the last undo record in the current log. + * + * We only connect to next log if the same transaction spread to next log + * otherwise don't. + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/subtrans.h" +#include "access/xact.h" +#include "access/xlog.h" +#include "access/undorecord.h" +#include "access/undoinsert.h" +#include "catalog/pg_tablespace.h" +#include "storage/block.h" +#include "storage/buf.h" +#include "storage/buf_internals.h" +#include "storage/bufmgr.h" +#include "miscadmin.h" +#include "commands/tablecmds.h" + +/* + * XXX Do we want to support undo tuple size which is more than the BLCKSZ + * if not than undo record can spread across 2 buffers at the max. + */ +#define MAX_BUFFER_PER_UNDO 2 + +/* + * This defines the number of undo records that can be prepared before + * calling insert by default. If you need to prepare more than + * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize + * first. + */ +#define MAX_PREPARED_UNDO 2 + +/* + * Consider buffers needed for updating previous transaction's + * starting undo record. Hence increased by 1. + */ +#define MAX_UNDO_BUFFERS (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO + +/* + * Previous top transaction id which inserted the undo. Whenever a new main + * transaction try to prepare an undo record we will check if its txid not the + * same as prev_txid then we will insert the start undo record. + */ +static TransactionId prev_txid[UndoPersistenceLevels] = { 0 }; + +/* Undo block number to buffer mapping. */ +typedef struct UndoBuffers +{ + UndoLogNumber logno; /* Undo log number */ + BlockNumber blk; /* block number */ + Buffer buf; /* buffer allocated for the block */ +} UndoBuffers; + +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS]; +static int buffer_idx; + +/* + * Structure to hold the prepared undo information. + */ +typedef struct PreparedUndoSpace +{ + UndoRecPtr urp; /* undo record pointer */ + UnpackedUndoRecord *urec; /* undo record */ + int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array + * index */ +} PreparedUndoSpace; + +static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO]; +static int prepare_idx; +static int max_prepared_undo = MAX_PREPARED_UNDO; +static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr; +static bool update_prev_header = false; + +/* + * By default prepared_undo and undo_buffer points to the static memory. + * In case caller wants to support more than default max_prepared undo records + * then the limit can be increased by calling UndoSetPrepareSize function. + * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer + * will start pointing to newly allocated memory, which will be released by + * UnlockReleaseUndoBuffers and these variables will again set back to their + * default values. + */ +static PreparedUndoSpace *prepared_undo = def_prepared; +static UndoBuffers *undo_buffer = def_buffers; + +/* + * Structure to hold the previous transaction's undo update information. This + * is populated while current transaction is updating its undo record pointer + * in previous transactions first undo record. + */ +typedef struct XactUndoRecordInfo +{ + UndoRecPtr urecptr; /* txn's start urecptr */ + int idx_undo_buffers[MAX_BUFFER_PER_UNDO]; + UnpackedUndoRecord uur; /* undo record header */ +} XactUndoRecordInfo; + +static XactUndoRecordInfo xact_urec_info; + +/* Prototypes for static functions. */ +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec, + UndoRecPtr urp, RelFileNode rnode, + UndoPersistence persistence); +static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr, + bool log_switched); +static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk, + ReadBufferMode rbm, + UndoPersistence persistence); +static bool UndoRecordIsValid(UndoLogControl * log, + UndoRecPtr urp); + +/* + * Check whether the undo record is discarded or not. If it's already discarded + * return false otherwise return true. + * + * Caller must hold lock on log->discard_lock. This function will release the + * lock if return false otherwise lock will be held on return and the caller + * need to release it. + */ +static bool +UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp) +{ + Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED)); + + if (log->oldest_data == InvalidUndoRecPtr) + { + /* + * oldest_data is only initialized when the DiscardWorker first time + * attempts to discard undo logs so we can not rely on this value to + * identify whether the undo record pointer is already discarded or + * not so we can check it by calling undo log routine. If its not yet + * discarded then we have to reacquire the log->discard_lock so that + * the doesn't get discarded concurrently. + */ + LWLockRelease(&log->discard_lock); + if (UndoLogIsDiscarded(urp)) + return false; + LWLockAcquire(&log->discard_lock, LW_SHARED); + } + + /* Check again if it's already discarded. */ + if (urp < log->oldest_data) + { + LWLockRelease(&log->discard_lock); + return false; + } + + return true; +} + +/* + * Prepare to update the previous transaction's next undo pointer to maintain + * the transaction chain in the undo. This will read the header of the first + * undo record of the previous transaction and lock the necessary buffers. + * The actual update will be done by UndoRecordUpdateTransInfo under the + * critical section. + */ +static void +UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched) +{ + UndoRecPtr xact_urp; + Buffer buffer = InvalidBuffer; + BlockNumber cur_blk; + RelFileNode rnode; + UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr); + UndoLogControl *log; + Page page; + int already_decoded = 0; + int starting_byte; + int bufidx; + int index = 0; + + log = UndoLogGet(logno); + + if (log_switched) + { + Assert(log->meta.prevlogno != InvalidUndoLogNumber); + log = UndoLogGet(log->meta.prevlogno); + } + + /* + * Temporary undo logs are discarded on transaction commit so we don't + * need to do anything. + */ + if (log->meta.persistence == UNDO_TEMP) + return; + + /* + * We can read the previous transaction's location without locking, + * because only the backend attached to the log can write to it (or we're + * in recovery). + */ + Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched); + + if (log->meta.last_xact_start == 0) + xact_urp = InvalidUndoRecPtr; + else + xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start); + + /* + * The absence of previous transaction's undo indicate that this backend + * is preparing its first undo in which case we have nothing to update. + */ + if (!UndoRecPtrIsValid(xact_urp)) + return; + + /* + * Acquire the discard lock before accessing the undo record so that + * discard worker doesn't remove the record while we are in process of + * reading it. + */ + LWLockAcquire(&log->discard_lock, LW_SHARED); + + /* + * The absence of previous transaction's undo indicate that this backend + * is preparing its first undo in which case we have nothing to update. + * UndoRecordIsValid will release the lock if it returns false. + */ + if (!UndoRecordIsValid(log, xact_urp)) + return; + + UndoRecPtrAssignRelFileNode(rnode, xact_urp); + cur_blk = UndoRecPtrGetBlockNum(xact_urp); + starting_byte = UndoRecPtrGetPageOffset(xact_urp); + + /* + * Read undo record header in by calling UnpackUndoRecord, if the undo + * record header is split across buffers then we need to read the complete + * header by invoking UnpackUndoRecord multiple times. + */ + while (true) + { + bufidx = UndoGetBufferSlot(rnode, cur_blk, + RBM_NORMAL, + log->meta.persistence); + xact_urec_info.idx_undo_buffers[index++] = bufidx; + buffer = undo_buffer[bufidx].buf; + page = BufferGetPage(buffer); + + if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte, + &already_decoded, true)) + break; + + /* Could not fetch the complete header so go to the next block. */ + starting_byte = UndoLogBlockHeaderSize; + cur_blk++; + } + + xact_urec_info.uur.uur_next = urecptr; + xact_urec_info.urecptr = xact_urp; + LWLockRelease(&log->discard_lock); +} + +/* + * Update the progress of the undo record in the transaction header. + */ +void +PrepareUpdateUndoActionProgress(UndoRecPtr urecptr, int progress) +{ + Buffer buffer = InvalidBuffer; + BlockNumber cur_blk; + RelFileNode rnode; + UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr); + UndoLogControl *log; + Page page; + int already_decoded = 0; + int starting_byte; + int bufidx; + int index = 0; + + log = UndoLogGet(logno); + + if (log->meta.persistence == UNDO_TEMP) + return; + + UndoRecPtrAssignRelFileNode(rnode, urecptr); + cur_blk = UndoRecPtrGetBlockNum(urecptr); + starting_byte = UndoRecPtrGetPageOffset(urecptr); + + while (true) + { + bufidx = UndoGetBufferSlot(rnode, cur_blk, + RBM_NORMAL, + log->meta.persistence); + xact_urec_info.idx_undo_buffers[index++] = bufidx; + buffer = undo_buffer[bufidx].buf; + page = BufferGetPage(buffer); + + if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte, + &already_decoded, true)) + break; + + starting_byte = UndoLogBlockHeaderSize; + cur_blk++; + } + + xact_urec_info.urecptr = urecptr; + xact_urec_info.uur.uur_progress = progress; +} + +/* + * Overwrite the first undo record of the previous transaction to update its + * next pointer. This will just insert the already prepared record by + * UndoRecordPrepareTransInfo. This must be called under the critical section. + * This will just overwrite the undo header not the data. + */ +void +UndoRecordUpdateTransInfo(void) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info.urecptr); + Page page; + int starting_byte; + int already_written = 0; + int idx = 0; + UndoRecPtr urec_ptr = InvalidUndoRecPtr; + UndoLogControl *log; + + log = UndoLogGet(logno); + urec_ptr = xact_urec_info.urecptr; + + /* + * Acquire the discard lock before accessing the undo record so that + * discard worker can't remove the record while we are in process of + * reading it. + */ + LWLockAcquire(&log->discard_lock, LW_SHARED); + + if (!UndoRecordIsValid(log, urec_ptr)) + return; + + /* + * Update the next transactions start urecptr in the transaction header. + */ + starting_byte = UndoRecPtrGetPageOffset(urec_ptr); + + do + { + Buffer buffer; + int buf_idx; + + buf_idx = xact_urec_info.idx_undo_buffers[idx]; + buffer = undo_buffer[buf_idx].buf; + page = BufferGetPage(buffer); + + /* Overwrite the previously written undo. */ + if (InsertUndoRecord(&xact_urec_info.uur, page, starting_byte, &already_written, true)) + { + MarkBufferDirty(buffer); + break; + } + + MarkBufferDirty(buffer); + starting_byte = UndoLogBlockHeaderSize; + idx++; + + Assert(idx < MAX_BUFFER_PER_UNDO); + } while (true); + + LWLockRelease(&log->discard_lock); +} + +/* + * Find the block number in undo buffer array, if it's present then just return + * its index otherwise search the buffer and insert an entry and lock the buffer + * in exclusive mode. + * + * Undo log insertions are append-only. If the caller is writing new data + * that begins exactly at the beginning of a page, then there cannot be any + * useful data after that point. In that case RBM_ZERO can be passed in as + * rbm so that we can skip a useless read of a disk block. In all other + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't + * happen to be already in the buffer pool. + */ +static int +UndoGetBufferSlot(RelFileNode rnode, + BlockNumber blk, + ReadBufferMode rbm, + UndoPersistence persistence) +{ + int i; + Buffer buffer; + + /* Don't do anything, if we already have a buffer pinned for the block. */ + for (i = 0; i < buffer_idx; i++) + { + /* + * It's not enough to just compare the block number because the + * undo_buffer might holds the undo from different undo logs (e.g when + * previous transaction start header is in previous undo log) so + * compare (logno + blkno). + */ + if ((blk == undo_buffer[i].blk) && + (undo_buffer[i].logno == rnode.relNode)) + { + /* caller must hold exclusive lock on buffer */ + Assert(BufferIsLocal(undo_buffer[i].buf) || + LWLockHeldByMeInMode(BufferDescriptorGetContentLock( + GetBufferDescriptor(undo_buffer[i].buf - 1)), + LW_EXCLUSIVE)); + break; + } + } + + /* + * We did not find the block so allocate the buffer and insert into the + * undo buffer array + */ + if (i == buffer_idx) + { + /* + * Fetch the buffer in which we want to insert the undo record. + */ + buffer = ReadBufferWithoutRelcache(rnode, + UndoLogForkNum, + blk, + rbm, + NULL, + RelPersistenceForUndoPersistence(persistence)); + + /* Lock the buffer */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + undo_buffer[buffer_idx].buf = buffer; + undo_buffer[buffer_idx].blk = blk; + undo_buffer[buffer_idx].logno = rnode.relNode; + buffer_idx++; + } + + return i; +} + +/* + * Calculate total size required by nrecords and allocate them in bulk. This is + * required for some operation which can allocate multiple undo record in one + * WAL operation e.g multi-insert. If we don't allocate undo space for all the + * record (which are inserted under one WAL) together than there is possibility + * that both of them go under different undo log. And, currently during + * recovery we don't have mechanism to map xid to multiple log number during one + * WAL operation. So in short all the operation under one WAL must allocate + * their undo from the same undo log. + */ +static UndoRecPtr +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords, + TransactionId txid, UndoPersistence upersistence, + xl_undolog_meta *undometa) +{ + UnpackedUndoRecord *urec = NULL; + UndoLogControl *log; + UndoRecordSize size; + UndoRecPtr urecptr; + bool need_xact_hdr = false; + bool log_switched = false; + int i; + + /* There must be at least one undo record. */ + if (nrecords <= 0) + elog(ERROR, "cannot allocate space for zero undo records"); + + /* Is this the first undo record of the transaction? */ + if ((InRecovery && IsTransactionFirstRec(txid)) || + (!InRecovery && prev_txid[upersistence] != txid)) + need_xact_hdr = true; + +resize: + size = 0; + + for (i = 0; i < nrecords; i++) + { + urec = undorecords + i; + + /* + * Prepare the transacion header for the first undo record of + * transaction. XXX there is also an option that instead of adding the + * information to this record we can prepare a new record which only + * contain transaction informations. + */ + if (need_xact_hdr && i == 0) + { + urec->uur_next = InvalidUndoRecPtr; + urec->uur_xidepoch = GetEpochForXid(txid); + urec->uur_progress = 0; + + /* During recovery, get the database id from the undo log state. */ + if (InRecovery) + urec->uur_dbid = UndoLogStateGetDatabaseId(); + else + urec->uur_dbid = MyDatabaseId; + + /* Set uur_info to include the transaction header. */ + urec->uur_info |= UREC_INFO_TRANSACTION; + } + else + { + /* + * It is okay to initialize these variables with invalid values as + * these are used only with the first record of transaction. + */ + urec->uur_next = InvalidUndoRecPtr; + urec->uur_xidepoch = 0; + urec->uur_dbid = 0; + urec->uur_progress = 0; + } + + /* Calculate the size of the undo record based on the info required. */ + UndoRecordSetInfo(urec); + size += UndoRecordExpectedSize(urec); + } + + if (InRecovery) + urecptr = UndoLogAllocateInRecovery(txid, size, upersistence); + else + urecptr = UndoLogAllocate(size, upersistence); + + log = UndoLogGet(UndoRecPtrGetLogNo(urecptr)); + + /* + * By now, we must be attached to some undo log unless we are in recovery. + */ + Assert(AmAttachedToUndoLog(log) || InRecovery); + + /* + * We can consider the log as switched if this is the first record of the + * log and not the first record of the transaction i.e. same transaction + * continued from the previous log. + */ + if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) && + log->meta.prevlogno != InvalidUndoLogNumber) + log_switched = true; + + /* + * If we've rewound all the way back to the start of the transaction by + * rolling back the first subtransaction (which we can't detect until + * after we've allocated some space), we'll need a new transaction header. + * If we weren't already generating one, then do it now. + */ + if (!need_xact_hdr && + (log->meta.insert == log->meta.last_xact_start || + UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize)) + { + need_xact_hdr = true; + urec->uur_info = 0; /* force recomputation of info bits */ + goto resize; + } + + /* Update the previous transaction's start undo record, if required. */ + if (need_xact_hdr || log_switched) + { + /* Don't update our own start header. */ + if (log->meta.last_xact_start != log->meta.insert) + UndoRecordPrepareTransInfo(urecptr, log_switched); + + /* Remember the current transaction's xid. */ + prev_txid[upersistence] = txid; + + /* Store the current transaction's start undorecptr in the undo log. */ + UndoLogSetLastXactStartPoint(urecptr); + update_prev_header = false; + } + + /* Copy undometa before advancing the insert location. */ + if (undometa) + { + undometa->meta = log->meta; + undometa->logno = log->logno; + undometa->xid = log->xid; + } + + /* + * If the insertion is for temp table then register an on commit + * action for discarding the undo logs. + */ + if (upersistence == UNDO_TEMP) + { + /* + * We only need to register when we are inserting in temp undo logs + * for the first time after the discard. + */ + if (log->meta.insert == log->meta.discard) + { + /* + * XXX Here, we are overriding the first parameter of function + * which is a unsigned int with an integer argument, that should + * work fine because logno will always be positive. + */ + register_on_commit_action(log->logno, ONCOMMIT_TEMP_DISCARD); + } + } + + UndoLogAdvance(urecptr, size, upersistence); + + return urecptr; +} + +/* + * Call UndoSetPrepareSize to set the value of how many undo records can be + * prepared before we can insert them. If the size is greater than + * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra + * prepared undo. + * + * This is normally used when more than one undo record needs to be prepared. + */ +void +UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords, + TransactionId xid, UndoPersistence upersistence, + xl_undolog_meta *undometa) +{ + TransactionId txid; + + /* Get the top transaction id. */ + if (xid == InvalidTransactionId) + { + Assert(!InRecovery); + txid = GetTopTransactionId(); + } + else + { + Assert(InRecovery || (xid == GetTopTransactionId())); + txid = xid; + } + + prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid, + upersistence, undometa); + if (nrecords <= MAX_PREPARED_UNDO) + return; + + prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace)); + + /* + * Consider buffers needed for updating previous transaction's starting + * undo record. Hence increased by 1. + */ + undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO * + sizeof(UndoBuffers)); + max_prepared_undo = nrecords; +} + +/* + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you + * intended to insert. Upon return, the necessary undo buffers are pinned and + * locked. + * + * This should be done before any critical section is established, since it + * can fail. + * + * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise, + * it refers to the top transaction id because undo log only stores mapping + * for the top most transactions. + */ +UndoRecPtr +PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid, + UndoPersistence upersistence, xl_undolog_meta *undometa) +{ + UndoRecordSize size; + UndoRecPtr urecptr; + RelFileNode rnode; + UndoRecordSize cur_size = 0; + BlockNumber cur_blk; + TransactionId txid; + int starting_byte; + int index = 0; + int bufidx; + ReadBufferMode rbm; + + /* Already reached maximum prepared limit. */ + if (prepare_idx == max_prepared_undo) + elog(ERROR, "already reached the maximum prepared limit"); + + + if (xid == InvalidTransactionId) + { + /* During recovery, we must have a valid transaction id. */ + Assert(!InRecovery); + txid = GetTopTransactionId(); + } + else + { + /* + * Assign the top transaction id because undo log only stores mapping + * for the top most transactions. + */ + Assert(InRecovery || (xid == GetTopTransactionId())); + txid = xid; + } + + if (!UndoRecPtrIsValid(prepared_urec_ptr)) + urecptr = UndoRecordAllocate(urec, 1, txid, upersistence, undometa); + else + urecptr = prepared_urec_ptr; + + /* advance the prepared ptr location for next record. */ + size = UndoRecordExpectedSize(urec); + if (UndoRecPtrIsValid(prepared_urec_ptr)) + { + UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr); + + insert = UndoLogOffsetPlusUsableBytes(insert, size); + prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert); + } + + cur_blk = UndoRecPtrGetBlockNum(urecptr); + UndoRecPtrAssignRelFileNode(rnode, urecptr); + starting_byte = UndoRecPtrGetPageOffset(urecptr); + + /* + * If we happen to be writing the very first byte into this page, then + * there is no need to read from disk. + */ + if (starting_byte == UndoLogBlockHeaderSize) + rbm = RBM_ZERO; + else + rbm = RBM_NORMAL; + + do + { + bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence); + if (cur_size == 0) + cur_size = BLCKSZ - starting_byte; + else + cur_size += BLCKSZ - UndoLogBlockHeaderSize; + + /* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */ + Assert(index < MAX_BUFFER_PER_UNDO); + + /* Keep the track of the buffers we have pinned and locked. */ + prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx; + + /* + * If we need more pages they'll be all new so we can definitely skip + * reading from disk. + */ + rbm = RBM_ZERO; + cur_blk++; + } while (cur_size < size); + + /* + * Save the undo record information to be later used by InsertPreparedUndo + * to insert the prepared record. + */ + prepared_undo[prepare_idx].urec = urec; + prepared_undo[prepare_idx].urp = urecptr; + prepare_idx++; + + return urecptr; +} + +/* + * Insert a previously-prepared undo record. This will write the actual undo + * record into the buffers already pinned and locked in PreparedUndoInsert, + * and mark them dirty. This step should be performed after entering a + * criticalsection; it should never fail. + */ +void +InsertPreparedUndo(void) +{ + Page page; + int starting_byte; + int already_written; + int bufidx = 0; + int idx; + uint16 undo_len = 0; + UndoRecPtr urp; + UnpackedUndoRecord *uur; + UndoLogOffset offset; + UndoLogControl *log; + + /* There must be atleast one prepared undo record. */ + Assert(prepare_idx > 0); + + /* + * This must be called under a critical section or we must be in recovery. + */ + Assert(InRecovery || CritSectionCount > 0); + + for (idx = 0; idx < prepare_idx; idx++) + { + uur = prepared_undo[idx].urec; + urp = prepared_undo[idx].urp; + + already_written = 0; + bufidx = 0; + starting_byte = UndoRecPtrGetPageOffset(urp); + offset = UndoRecPtrGetOffset(urp); + + log = UndoLogGet(UndoRecPtrGetLogNo(urp)); + Assert(AmAttachedToUndoLog(log) || InRecovery); + + /* + * Store the previous undo record length in the header. We can read + * meta.prevlen without locking, because only we can write to it. + */ + uur->uur_prevlen = log->meta.prevlen; + + /* + * If starting a new log then there is no prevlen to store except when + * the same transaction is continuing from the previous undo log read + * detailed comment atop this file. + */ + if (offset == UndoLogBlockHeaderSize) + { + if (log->meta.prevlogno != InvalidUndoLogNumber) + { + UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno); + uur->uur_prevlen = prevlog->meta.prevlen; + } + else + uur->uur_prevlen = 0; + } + + /* + * if starting from a new page then consider block header size in + * prevlen calculation. + */ + else if (starting_byte == UndoLogBlockHeaderSize) + uur->uur_prevlen += UndoLogBlockHeaderSize; + + undo_len = 0; + + do + { + PreparedUndoSpace undospace = prepared_undo[idx]; + Buffer buffer; + + buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf; + page = BufferGetPage(buffer); + + /* + * Initialize the page whenever we try to write the first record + * in page. We start writting immediately after the block header. + */ + if (starting_byte == UndoLogBlockHeaderSize) + PageInit(page, BLCKSZ, 0); + + /* + * Try to insert the record into the current page. If it doesn't + * succeed then recall the routine with the next page. + */ + if (InsertUndoRecord(uur, page, starting_byte, &already_written, false)) + { + undo_len += already_written; + MarkBufferDirty(buffer); + break; + } + + MarkBufferDirty(buffer); + + /* + * If we are swithing to the next block then consider the header + * in total undo length. + */ + starting_byte = UndoLogBlockHeaderSize; + undo_len += UndoLogBlockHeaderSize; + bufidx++; + + /* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */ + Assert(bufidx < MAX_BUFFER_PER_UNDO); + } while (true); + + UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len); + + /* + * Link the transactions in the same log so that we can discard all + * the transaction's undo log in one-shot. + */ + if (UndoRecPtrIsValid(xact_urec_info.urecptr)) + UndoRecordUpdateTransInfo(); + + /* + * Set the current undo location for a transaction. This is required + * to perform rollback during abort of transaction. + */ + SetCurrentUndoLocation(urp); + } +} + +/* + * Reset the global variables related to undo buffers. This is required at the + * transaction abort and while releasing the undo buffers. + */ +void +ResetUndoBuffers(void) +{ + int i; + + for (i = 0; i < buffer_idx; i++) + { + undo_buffer[i].blk = InvalidBlockNumber; + undo_buffer[i].buf = InvalidBuffer; + } + + xact_urec_info.urecptr = InvalidUndoRecPtr; + + /* Reset the prepared index. */ + prepare_idx = 0; + buffer_idx = 0; + prepared_urec_ptr = InvalidUndoRecPtr; + + /* + * max_prepared_undo limit is changed so free the allocated memory and + * reset all the variable back to their default value. + */ + if (max_prepared_undo > MAX_PREPARED_UNDO) + { + pfree(undo_buffer); + pfree(prepared_undo); + undo_buffer = def_buffers; + prepared_undo = def_prepared; + max_prepared_undo = MAX_PREPARED_UNDO; + } +} + +/* + * Unlock and release the undo buffers. This step must be performed after + * exiting any critical section where we have perfomed undo actions. + */ +void +UnlockReleaseUndoBuffers(void) +{ + int i; + + for (i = 0; i < buffer_idx; i++) + UnlockReleaseBuffer(undo_buffer[i].buf); + + ResetUndoBuffers(); +} + +/* + * Helper function for UndoFetchRecord. It will fetch the undo record pointed + * by urp and unpack the record into urec. This function will not release the + * pin on the buffer if complete record is fetched from one buffer, so caller + * can reuse the same urec to fetch the another undo record which is on the + * same block. Caller will be responsible to release the buffer inside urec + * and set it to invalid if it wishes to fetch the record from another block. + */ +static UnpackedUndoRecord * +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, + UndoPersistence persistence) +{ + Buffer buffer = urec->uur_buffer; + Page page; + int starting_byte = UndoRecPtrGetPageOffset(urp); + int already_decoded = 0; + BlockNumber cur_blk; + bool is_undo_rec_split = false; + + cur_blk = UndoRecPtrGetBlockNum(urp); + + /* If we already have a buffer pin then no need to allocate a new one. */ + if (!BufferIsValid(buffer)) + { + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk, + RBM_NORMAL, NULL, + RelPersistenceForUndoPersistence(persistence)); + + urec->uur_buffer = buffer; + } + + while (true) + { + LockBuffer(buffer, BUFFER_LOCK_SHARE); + page = BufferGetPage(buffer); + + /* + * XXX This can be optimized to just fetch header first and only if + * matches with block number and offset then fetch the complete + * record. + */ + if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false)) + break; + + starting_byte = UndoLogBlockHeaderSize; + is_undo_rec_split = true; + + /* + * The record spans more than a page so we would have copied it (see + * UnpackUndoRecord). In such cases, we can release the buffer. + */ + urec->uur_buffer = InvalidBuffer; + UnlockReleaseBuffer(buffer); + + /* Go to next block. */ + cur_blk++; + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk, + RBM_NORMAL, NULL, + RelPersistenceForUndoPersistence(persistence)); + } + + /* + * If we have copied the data then release the buffer, otherwise, just + * unlock it. + */ + if (is_undo_rec_split) + UnlockReleaseBuffer(buffer); + else + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + return urec; +} + +/* + * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current + * record. + */ +static void +ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode, + RelFileNode *prevrec_rnode) +{ + /* + * If we have a valid buffer pinned then just ensure that we want to find + * the next tuple from the same block. Otherwise release the buffer and + * set it invalid + */ + if (BufferIsValid(urec->uur_buffer)) + { + /* + * Undo buffer will be changed if the next undo record belongs to a + * different block or undo log. + */ + if ((UndoRecPtrGetBlockNum(urp) != + BufferGetBlockNumber(urec->uur_buffer)) || + (prevrec_rnode->relNode != rnode->relNode)) + { + ReleaseBuffer(urec->uur_buffer); + urec->uur_buffer = InvalidBuffer; + } + } + else + { + /* + * If there is not a valid buffer in urec->uur_buffer that means we + * had copied the payload data and tuple data so free them. + */ + if (urec->uur_payload.data) + pfree(urec->uur_payload.data); + if (urec->uur_tuple.data) + pfree(urec->uur_tuple.data); + } + + /* Reset the urec before fetching the tuple */ + urec->uur_tuple.data = NULL; + urec->uur_tuple.len = 0; + urec->uur_payload.data = NULL; + urec->uur_payload.len = 0; +} + +/* + * Fetch the next undo record for given blkno, offset and transaction id (if + * valid). The same tuple can be modified by multiple transactions, so during + * undo chain traversal sometimes we need to distinguish based on transaction + * id. Callers that don't have any such requirement can pass + * InvalidTransactionId. + * + * Start the search from urp. Caller need to call UndoRecordRelease to release the + * resources allocated by this function. + * + * urec_ptr_out is undo record pointer of the qualified undo record if valid + * pointer is passed. + * + * callback function decides whether particular undo record satisfies the + * condition of caller. + */ +UnpackedUndoRecord * +UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset, + TransactionId xid, UndoRecPtr *urec_ptr_out, + SatisfyUndoRecordCallback callback) +{ + RelFileNode rnode, + prevrec_rnode = {0}; + UnpackedUndoRecord *urec = NULL; + int logno; + + if (urec_ptr_out) + *urec_ptr_out = InvalidUndoRecPtr; + + urec = palloc0(sizeof(UnpackedUndoRecord)); + UndoRecPtrAssignRelFileNode(rnode, urp); + + /* Find the undo record pointer we are interested in. */ + while (true) + { + UndoLogControl *log; + + logno = UndoRecPtrGetLogNo(urp); + log = UndoLogGet(logno); + if (log == NULL) + { + if (BufferIsValid(urec->uur_buffer)) + ReleaseBuffer(urec->uur_buffer); + return NULL; + } + + /* + * Prevent UndoDiscardOneLog() from discarding data while we try to + * read it. Usually we would acquire log->mutex to read log->meta + * members, but in this case we know that discard can't move without + * also holding log->discard_lock. + */ + LWLockAcquire(&log->discard_lock, LW_SHARED); + if (!UndoRecordIsValid(log, urp)) + { + if (BufferIsValid(urec->uur_buffer)) + ReleaseBuffer(urec->uur_buffer); + return NULL; + } + + /* Fetch the current undo record. */ + urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence); + LWLockRelease(&log->discard_lock); + + if (blkno == InvalidBlockNumber) + break; + + /* Check whether the undorecord satisfies conditions */ + if (callback(urec, blkno, offset, xid)) + break; + + urp = urec->uur_blkprev; + prevrec_rnode = rnode; + + /* Get rnode for the current undo record pointer. */ + UndoRecPtrAssignRelFileNode(rnode, urp); + + /* Reset the current undorecord before fetching the next. */ + ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode); + } + + if (urec_ptr_out) + *urec_ptr_out = urp; + return urec; +} + +/* + * Return the previous undo record pointer. + * + * This API can switch to the previous log if the current log is exhausted, + * so the caller shouldn't use it where that is not expected. + */ +UndoRecPtr +UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(urp); + UndoLogOffset offset = UndoRecPtrGetOffset(urp); + + /* + * We have reached to the first undo record of this undo log, so fetch the + * previous undo record of the transaction from the previous log. + */ + if (offset == UndoLogBlockHeaderSize) + { + UndoLogControl *prevlog, + *log; + + log = UndoLogGet(logno); + + Assert(log->meta.prevlogno != InvalidUndoLogNumber); + + /* Fetch the previous log control. */ + prevlog = UndoLogGet(log->meta.prevlogno); + logno = log->meta.prevlogno; + offset = prevlog->meta.insert; + } + + /* calculate the previous undo record pointer */ + return MakeUndoRecPtr(logno, offset - prevlen); +} + +/* + * Release the resources allocated by UndoFetchRecord. + */ +void +UndoRecordRelease(UnpackedUndoRecord *urec) +{ + /* + * If the undo record has a valid buffer then just release the buffer + * otherwise free the tuple and payload data. + */ + if (BufferIsValid(urec->uur_buffer)) + { + ReleaseBuffer(urec->uur_buffer); + } + else + { + if (urec->uur_payload.data) + pfree(urec->uur_payload.data); + if (urec->uur_tuple.data) + pfree(urec->uur_tuple.data); + } + + pfree(urec); +} + +/* + * Called whenever we attach to a new undo log, so that we forget about our + * translation-unit private state relating to the log we were last attached + * to. + */ +void +UndoRecordOnUndoLogChange(UndoPersistence persistence) +{ + prev_txid[persistence] = InvalidTransactionId; +} diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c new file mode 100644 index 0000000000..71be715bda --- /dev/null +++ b/src/backend/access/undo/undolog.c @@ -0,0 +1,2719 @@ +/*------------------------------------------------------------------------- + * + * undolog.c + * management of undo logs + * + * PostgreSQL undo log manager. This module is responsible for lifecycle + * management of undo logs and backing files, associating undo logs with + * backends, allocating and managing space within undo logs. + * + * For the code that reads and writes blocks of data, see undofile.c. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/access/undo/undolog.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/transam.h" +#include "access/undolog.h" +#include "access/undolog_xlog.h" +#include "access/xact.h" +#include "access/xlog.h" +#include "access/xlogreader.h" +#include "catalog/catalog.h" +#include "catalog/pg_tablespace.h" +#include "commands/tablespace.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "nodes/execnodes.h" +#include "pgstat.h" +#include "storage/buf.h" +#include "storage/bufmgr.h" +#include "storage/dsm.h" +#include "storage/fd.h" +#include "storage/ipc.h" +#include "storage/lwlock.h" +#include "storage/procarray.h" +#include "storage/shmem.h" +#include "storage/standby.h" +#include "storage/undofile.h" +#include "utils/builtins.h" +#include "utils/guc.h" +#include "utils/memutils.h" +#include "utils/varlena.h" + +#include +#include + +/* + * Number of bits of an undo log number used to identify a bank of + * UndoLogControl objects. This allows us to break up our array of + * UndoLogControl objects into many smaller arrays, called banks, and find our + * way to an UndoLogControl object in O(1) complexity in two steps. + */ +#define UndoLogBankBits 14 +#define UndoLogBanks (1 << UndoLogBankBits) + +/* Extract the undo bank number from an undo log number (upper bits). */ +#define UndoLogNoGetBankNo(logno) \ + ((logno) >> (UndoLogNumberBits - UndoLogBankBits)) + +/* Extract the slot within a bank from an undo log number (lower bits). */ +#define UndoLogNoGetSlotNo(logno) \ + ((logno) & ((1 << (UndoLogNumberBits - UndoLogBankBits)) - 1)) + +/* + * During recovery we maintain a mapping of transaction ID to undo logs + * numbers. We do this with another two-level array, so that we use memory + * only for chunks of the array that overlap with the range of active xids. + */ +#define UndoLogXidLowBits 16 + +/* + * Number of high bits. + */ +#define UndoLogXidHighBits \ + (sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits) + +/* Extract the upper bits of an xid, for undo log mapping purposes. */ +#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits) + +/* Extract the lower bits of an xid, for undo log mapping purposes. */ +#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1)) + +/* + * Main control structure for undo log management in shared memory. + */ +typedef struct UndoLogSharedData +{ + UndoLogNumber free_lists[UndoPersistenceLevels]; + int low_bankno; /* the lowest bank */ + int high_bankno; /* one past the highest bank */ + UndoLogNumber low_logno; /* the lowest logno */ + UndoLogNumber high_logno; /* one past the highest logno */ + + /* + * Array of DSM handles pointing to the arrays of UndoLogControl objects. + * We don't expect there to be many banks active at a time -- usually 1 or + * 2, but we need random access by log number so we arrange them into + * 'banks'. + */ + dsm_handle banks[UndoLogBanks]; +} UndoLogSharedData; + +/* + * Per-backend state for the undo log module. + * Backend-local pointers to undo subsystem state in shared memory. + */ +struct +{ + UndoLogSharedData *shared; + + /* + * The control object for the undo logs that this backend is currently + * attached to at each persistence level. + */ + UndoLogControl *logs[UndoPersistenceLevels]; + + /* The DSM segments used to hold banks of control objects. */ + dsm_segment *bank_segments[UndoLogBanks]; + + /* + * The address where each bank of control objects is mapped into memory in + * this backend. We map banks into memory on demand, and (for now) they + * stay mapped in until every backend that mapped them exits. + */ + UndoLogControl *banks[UndoLogBanks]; + + /* + * The lowest log number that might currently be mapped into this backend. + */ + int low_logno; + + /* + * If the undo_tablespaces GUC changes we'll remember to examine it and + * attach to a new undo log using this flag. + */ + bool need_to_choose_tablespace; + + /* + * During recovery, the startup process maintains a mapping of xid to undo + * log number, instead of using 'log' above. This is not used in regular + * backends and can be in backend-private memory so long as recovery is + * single-process. This map references UNDO_PERMANENT logs only, since + * temporary and unlogged relations don't have WAL to replay. + */ + UndoLogNumber **xid_map; + + /* + * The slot for the oldest xids still running. We advance this during + * checkpoints to free up chunks of the map. + */ + uint16 xid_map_oldest_chunk; + + /* Current dbid. Used during recovery. */ + Oid dbid; +} MyUndoLogState; + +/* GUC variables */ +char *undo_tablespaces = NULL; + +static UndoLogControl *get_undo_log_by_number(UndoLogNumber logno); +static void ensure_undo_log_number(UndoLogNumber logno); +static void attach_undo_log(UndoPersistence level, Oid tablespace); +static void detach_current_undo_log(UndoPersistence level, bool exhausted); +static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end); +static void undo_log_before_exit(int code, Datum value); +static void forget_undo_buffers(int logno, UndoLogOffset old_discard, + UndoLogOffset new_discard, + bool drop_tail); +static bool choose_undo_tablespace(bool force_detach, Oid *oid); +static void undolog_xid_map_gc(void); +static void undolog_bank_gc(void); + +PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs); + +/* + * Return the amount of traditional smhem required for undo log management. + * Extra shared memory will be managed using DSM segments. + */ +Size +UndoLogShmemSize(void) +{ + return sizeof(UndoLogSharedData); +} + +/* + * Initialize the undo log subsystem. Called in each backend. + */ +void +UndoLogShmemInit(void) +{ + bool found; + + MyUndoLogState.shared = (UndoLogSharedData *) + ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found); + + if (!IsUnderPostmaster) + { + UndoLogSharedData *shared = MyUndoLogState.shared; + int i; + + Assert(!found); + + /* + * We start with no undo logs. StartUpUndoLogs() will recreate undo + * logs that were known at last checkpoint. + */ + memset(shared, 0, sizeof(*shared)); + for (i = 0; i < UndoPersistenceLevels; ++i) + shared->free_lists[i] = InvalidUndoLogNumber; + shared->low_bankno = 0; + shared->high_bankno = 0; + } + else + Assert(found); +} + +void +UndoLogInit(void) +{ + before_shmem_exit(undo_log_before_exit, 0); +} + +/* + * Figure out which directory holds an undo log based on tablespace. + */ +static void +UndoLogDirectory(Oid tablespace, char *dir) +{ + if (tablespace == DEFAULTTABLESPACE_OID || + tablespace == InvalidOid) + snprintf(dir, MAXPGPATH, "base/undo"); + else + snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo", + tablespace, TABLESPACE_VERSION_DIRECTORY); + + /* XXX Should we use UndoLogDatabaseOid (9) instead of "undo"? */ + + /* + * XXX Should we add an extra directory between log number and segment + * files? If all undo logs are in the same directory then + * fsync(directory) may create contention in the OS between unrelated + * backends that as they rotate segment files. + */ +} + +/* + * Compute the pathname to use for an undo log segment file. + */ +void +UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path) +{ + char dir[MAXPGPATH]; + + /* Figure out which directory holds the segment, based on tablespace. */ + UndoLogDirectory(tablespace, dir); + + /* + * Build the path from log number and offset. The pathname is the + * UndoRecPtr of the first byte in the segment in hexadecimal, with a + * period inserted between the components. + */ + snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno, + segno * UndoLogSegmentSize); +} + +/* + * Iterate through the set of currently active logs. + * + * TODO: This probably needs to be replaced. For the use of UndoDiscard, + * maybe we should instead have an ordered data structure organized by + * oldest_xid so that undo workers only have to consume logs from one end of + * the queue when they have an oldest xmin. For the use of undo_file.c we'll + * need something completely different anyway (watch this space). For now we + * just stupidly visit all undo logs in the range [log_logno, high_logno), + * which is obviously not ideal. + */ +UndoLogControl * +UndoLogNext(UndoLogControl *log) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + + if (log == NULL) + { + UndoLogNumber low_logno; + + LWLockAcquire(UndoLogLock, LW_SHARED); + low_logno = shared->low_logno; + LWLockRelease(UndoLogLock); + + return get_undo_log_by_number(low_logno); + } + else + { + UndoLogNumber high_logno; + + LWLockAcquire(UndoLogLock, LW_SHARED); + high_logno = shared->high_logno; + LWLockRelease(UndoLogLock); + + if (log->logno + 1 == high_logno) + return NULL; + + return get_undo_log_by_number(log->logno + 1); + } +} + +/* + * Check if an undo log position has been discarded. 'point' must be an undo + * log pointer that was allocated at some point in the past, otherwise the + * result is undefined. + */ +bool +UndoLogIsDiscarded(UndoRecPtr point) +{ + UndoLogControl *log = get_undo_log_by_number(UndoRecPtrGetLogNo(point)); + bool result; + + /* + * If we don't recognize the log number, it's either entirely discarded or + * it's never been allocated (ie from the future) and our result is + * undefined. + */ + if (log == NULL) + return true; + + /* + * XXX For a super cheap locked operation, it's better to use LW_EXLUSIVE + * even though we don't need exclusivity, right? + */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + result = UndoRecPtrGetOffset(point) < log->meta.discard; + LWLockRelease(&log->mutex); + + return result; +} + +/* + * Store latest transaction's start undo record point in undo meta data. It + * will fetched by the backend when it's reusing the undo log and preparing + * its first undo. + */ +void +UndoLogSetLastXactStartPoint(UndoRecPtr point) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(point); + UndoLogControl *log = get_undo_log_by_number(logno); + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.last_xact_start = UndoRecPtrGetOffset(point); + LWLockRelease(&log->mutex); +} + +/* + * Fetch the previous transaction's start undo record point. Return Invalid + * undo pointer if backend is not attached to any log. + */ +UndoRecPtr +UndoLogGetLastXactStartPoint(UndoLogNumber logno) +{ + UndoLogControl *log = get_undo_log_by_number(logno); + uint64 last_xact_start = 0; + + if (unlikely(log == NULL)) + return InvalidUndoRecPtr; + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + last_xact_start = log->meta.last_xact_start; + LWLockRelease(&log->mutex); + + if (last_xact_start == 0) + return InvalidUndoRecPtr; + + return MakeUndoRecPtr(logno, last_xact_start); +} + +/* + * Store last undo record's length on undo meta so that it can be persistent + * across restart. + */ +void +UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen) +{ + UndoLogControl *log = get_undo_log_by_number(logno); + + Assert(log != NULL); + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.prevlen = prevlen; + LWLockRelease(&log->mutex); +} + +/* + * Get the last undo record's length. + */ +uint16 +UndoLogGetPrevLen(UndoLogNumber logno) +{ + UndoLogControl *log = get_undo_log_by_number(logno); + uint16 prevlen; + + Assert(log != NULL); + + LWLockAcquire(&log->mutex, LW_SHARED); + prevlen = log->meta.prevlen; + LWLockRelease(&log->mutex); + + return prevlen; +} + +/* + * Is this record is the first record for any transaction. + */ +bool +IsTransactionFirstRec(TransactionId xid) +{ + uint16 high_bits = UndoLogGetXidHigh(xid); + uint16 low_bits = UndoLogGetXidLow(xid); + UndoLogNumber logno; + UndoLogControl *log; + + Assert(InRecovery); + + if (MyUndoLogState.xid_map == NULL) + elog(ERROR, "xid to undo log number map not initialized"); + if (MyUndoLogState.xid_map[high_bits] == NULL) + elog(ERROR, "cannot find undo log number for xid %u", xid); + + logno = MyUndoLogState.xid_map[high_bits][low_bits]; + log = get_undo_log_by_number(logno); + if (log == NULL) + elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid); + + return log->meta.is_first_rec; +} + +/* + * Detach from the undo log we are currently attached to, returning it to the + * free list if it still has space. + */ +static void +detach_current_undo_log(UndoPersistence persistence, bool exhausted) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + UndoLogControl *log = MyUndoLogState.logs[persistence]; + + Assert(log != NULL); + + MyUndoLogState.logs[persistence] = NULL; + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->pid = InvalidPid; + log->xid = InvalidTransactionId; + if (exhausted) + log->meta.status = UNDO_LOG_STATUS_EXHAUSTED; + LWLockRelease(&log->mutex); + + /* Push back onto the appropriate freelist. */ + if (!exhausted) + { + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + log->next_free = shared->free_lists[persistence]; + shared->free_lists[persistence] = log->logno; + LWLockRelease(UndoLogLock); + } +} + +static void +undo_log_before_exit(int code, Datum arg) +{ + int i; + + for (i = 0; i < UndoPersistenceLevels; ++i) + { + if (MyUndoLogState.logs[i] != NULL) + detach_current_undo_log(i, false); + } +} + +/* + * Create a fully allocated empty segment file on disk for the byte starting + * at 'end'. + */ +static void +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) +{ + struct stat stat_buffer; + off_t size; + char path[MAXPGPATH]; + void *zeroes; + size_t nzeroes = 8192; + int fd; + + UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path); + + /* + * Create and fully allocate a new file. If we crashed and recovered + * then the file might already exist, so use flags that tolerate that. + * It's also possible that it exists but is too short, in which case + * we'll write the rest. We don't really care what's in the file, we + * just want to make sure that the filesystem has allocated physical + * blocks for it, so that non-COW filesystems will report ENOSPC now + * rather than later when the space is needed and we'll avoid creating + * files with holes. + */ + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY); + if (fd < 0 && tablespace != 0) + { + char undo_path[MAXPGPATH]; + + /* Try creating the undo directory for this tablespace. */ + UndoLogDirectory(tablespace, undo_path); + if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST) + { + char *parentdir; + + if (errno != ENOENT || !InRecovery) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + undo_path))); + + /* + * In recovery, it's possible that the tablespace directory + * doesn't exist because a later WAL record removed the whole + * tablespace. In that case we create a regular directory to + * stand in for it. This is similar to the logic in + * TablespaceCreateDbspace(). + */ + + /* create two parents up if not exist */ + parentdir = pstrdup(undo_path); + get_parent_directory(parentdir); + get_parent_directory(parentdir); + /* Can't create parent and it doesn't already exist? */ + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + parentdir))); + pfree(parentdir); + + /* create one parent up if not exist */ + parentdir = pstrdup(undo_path); + get_parent_directory(parentdir); + /* Can't create parent and it doesn't already exist? */ + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + parentdir))); + pfree(parentdir); + + if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + undo_path))); + } + + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY); + } + if (fd < 0) + elog(ERROR, "could not create new file \"%s\": %m", path); + if (fstat(fd, &stat_buffer) < 0) + elog(ERROR, "could not stat \"%s\": %m", path); + size = stat_buffer.st_size; + + /* A buffer full of zeroes we'll use to fill up new segment files. */ + zeroes = palloc0(nzeroes); + + while (size < UndoLogSegmentSize) + { + ssize_t written; + + written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size)); + if (written < 0) + elog(ERROR, "cannot initialize undo log segment file \"%s\": %m", + path); + size += written; + } + + /* Flush the contents of the file to disk. */ + if (pg_fsync(fd) != 0) + elog(ERROR, "cannot fsync file \"%s\": %m", path); + CloseTransientFile(fd); + + pfree(zeroes); + + elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */ +} + +/* + * Create a new undo segment, when it is unexpectedly not present. + */ +void +UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno) +{ + Assert(InRecovery); + allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize); +} + +/* + * Create and zero-fill a new segment for the undo log we are currently + * attached to. + */ +static void +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) +{ + UndoLogControl *log; + char dir[MAXPGPATH]; + size_t end; + + log = get_undo_log_by_number(logno); + + Assert(log != NULL); + Assert(log->meta.end % UndoLogSegmentSize == 0); + Assert(new_end % UndoLogSegmentSize == 0); + Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery); + + /* + * Create all the segments needed to increase 'end' to the requested + * size. This is quite expensive, so we will try to avoid it completely + * by renaming files into place in UndoLogDiscard instead. + */ + end = log->meta.end; + while (end < new_end) + { + allocate_empty_undo_segment(logno, log->meta.tablespace, end); + end += UndoLogSegmentSize; + } + + /* + * Flush the parent dir so that the directory metadata survives a crash + * after this point. + */ + UndoLogDirectory(log->meta.tablespace, dir); + fsync_fname(dir, true); + + /* + * If we're not in recovery, we need to WAL-log the creation of the new + * file(s). We do that after the above filesystem modifications, in + * violation of the data-before-WAL rule as exempted by + * src/backend/access/transam/README. This means that it's possible for + * us to crash having made some or all of the filesystem changes but + * before WAL logging, but in that case we'll eventually try to create the + * same segment(s) again which is tolerated. + */ + if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; + + xlrec.logno = logno; + xlrec.end = end; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); + XLogFlush(ptr); + } + + /* + * We didn't need to acquire the mutex to read 'end' above because only + * we write to it. But we need the mutex to update it, because the + * checkpointer might read it concurrently. + * + * XXX It's possible for meta.end to be higher already during + * recovery, because of the timing of a checkpoint; in that case we did + * nothing above and we shouldn't update shmem here. That interaction + * needs more analysis. + */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + if (log->meta.end < end) + log->meta.end = end; + LWLockRelease(&log->mutex); +} + +/* + * Get an insertion point that is guaranteed to be backed by enough space to + * hold 'size' bytes of data. To actually write into the undo log, client + * code should call this first and then use bufmgr routines to access buffers + * and provide WAL logs and redo handlers. In other words, while this module + * looks after making sure the undo log has sufficient space and the undo meta + * data is crash safe, the *contents* of the undo log and (indirectly) the + * insertion point are the responsibility of client code. + * + * XXX As an optimization, we could take a third argument 'discard_last'. If + * the caller knows that the last transaction it committed is all visible and + * has its undo pointer, it could supply that value. Then while we hold + * log->mutex we could check if log->meta.discard == discard_last, and if it's + * in the same undo log segment as the current insert then it could cheaply + * update it in shmem and include the value in the existing + * XLOG_UNDOLOG_ATTACH WAL record. We'd be leaving the heavier lifting of + * dealing with segment roll-over to undo workers, but avoiding work for undo + * workers by folding a super cheap common case into the next foreground xact. + * (Not sure how we actually avoid waking up the undo work though...) + * + * XXX Problem: if foreground processes can move the discard pointer as well + * as background processes (undo workers), then how is the undo worker + * supposed to access the undo data pointed to by the discard pointer so that + * it can read the xid? We certainly don't want to hold the undo log lock + * while doing stuff like that, because it would interfere with read-only + * sessions that need to check the discard pointer. Possible solution: we may + * need a way to 'pin' the discard pointer while the undo worker is + * considering what to do. If we add 'discard_last' as described in the + * previous paragraph, that optimisation would need to be skipped if the + * foreground process running UndoLogAllocate sees that the discard pointer is + * currently pinned by a background worker. Going to sit on this thought for + * a little while before writing any code... need to contemplate undo workers + * some more. + * + * Returns an undo log insertion point that can be converted to a buffer tag + * and an insertion point within a buffer page using the macros above. + */ +UndoRecPtr +UndoLogAllocate(size_t size, UndoPersistence persistence) +{ + UndoLogControl *log = MyUndoLogState.logs[persistence]; + UndoLogOffset new_insert; + UndoLogNumber prevlogno = InvalidUndoLogNumber; + TransactionId logxid; + + /* + * We may need to attach to an undo log, either because this is the first + * time this backend as needed to write to an undo log at all or because + * the undo_tablespaces GUC was changed. When doing that, we'll need + * interlocking against tablespaces being concurrently dropped. + */ + + retry: + /* See if we need to check the undo_tablespaces GUC. */ + if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL)) + { + Oid tablespace; + bool need_to_unlock; + + need_to_unlock = + choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace, + &tablespace); + attach_undo_log(persistence, tablespace); + if (need_to_unlock) + LWLockRelease(TablespaceCreateLock); + log = MyUndoLogState.logs[persistence]; + log->meta.prevlogno = prevlogno; + MyUndoLogState.need_to_choose_tablespace = false; + } + + /* + * If this is the first time we've allocated undo log space in this + * transaction, we'll record the xid->undo log association so that it can + * be replayed correctly. Before that, we set the first record flag to + * false. + */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.is_first_rec = false; + logxid = log->xid; + + if (logxid != GetTopTransactionId()) + { + xl_undolog_attach xlrec; + + /* + * While we have the lock, check if we have been forcibly detached by + * DROP TABLESPACE. That can only happen between transactions (see + * DetachUndoLogsInsTablespace()) so we only have to check for it + * in this branch. + */ + if (log->pid == InvalidPid) + { + LWLockRelease(&log->mutex); + log = NULL; + goto retry; + } + log->xid = GetTopTransactionId(); + log->meta.is_first_rec = true; + LWLockRelease(&log->mutex); + + /* Skip the attach record for unlogged and temporary tables. */ + if (persistence == UNDO_PERMANENT) + { + xlrec.xid = GetTopTransactionId(); + xlrec.logno = log->logno; + xlrec.dbid = MyDatabaseId; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH); + } + } + else + { + LWLockRelease(&log->mutex); + } + + /* + * 'size' is expressed in usable non-header bytes. Figure out how far we + * have to move insert to create space for 'size' usable bytes (stepping + * over any intervening headers). + */ + Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize); + new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size); + Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize); + + /* + * We don't need to acquire log->mutex to read log->meta.insert and + * log->meta.end, because this backend is the only one that can + * modify them. + */ + if (unlikely(new_insert > log->meta.end)) + { + if (new_insert > UndoLogMaxSize) + { + /* This undo log is entirely full. Get a new one. */ + if (logxid == GetTopTransactionId()) + { + /* + * If the same transaction is split over two undo logs then + * store the previous log number in new log. See detailed + * comments in undorecord.c file header. + */ + prevlogno = log->logno; + } + log = NULL; + detach_current_undo_log(persistence, true); + goto retry; + } + /* + * Extend the end of this undo log to cover new_insert (in other words + * round up to the segment size). + */ + extend_undo_log(log->logno, + new_insert + UndoLogSegmentSize - + new_insert % UndoLogSegmentSize); + Assert(new_insert <= log->meta.end); + } + + return MakeUndoRecPtr(log->logno, log->meta.insert); +} + +/* + * In recovery, we expect the xid to map to a known log which already has + * enough space in it. + */ +UndoRecPtr +UndoLogAllocateInRecovery(TransactionId xid, size_t size, + UndoPersistence level) +{ + uint16 high_bits = UndoLogGetXidHigh(xid); + uint16 low_bits = UndoLogGetXidLow(xid); + UndoLogNumber logno; + UndoLogControl *log; + + /* + * The sequence of calls to UndoLogAllocateRecovery during REDO (recovery) + * must match the sequence of calls to UndoLogAllocate during DO, for any + * given session. The XXX_redo code for any UNDO-generating operation + * must use UndoLogAllocateRecovery rather than UndoLogAllocate, because + * it must supply the extra 'xid' argument so that we can find out which + * undo log number to use. During DO, that's tracked per-backend, but + * during REDO the original backends/sessions are lost and we have only + * the Xids. + */ + Assert(InRecovery); + + /* + * Look up the undo log number for this xid. The mapping must already + * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the + * first call to UndoLogAllocate for this xid after the most recent + * checkpoint. + */ + if (MyUndoLogState.xid_map == NULL) + elog(ERROR, "xid to undo log number map not initialized"); + if (MyUndoLogState.xid_map[high_bits] == NULL) + elog(ERROR, "cannot find undo log number for xid %u", xid); + logno = MyUndoLogState.xid_map[high_bits][low_bits]; + if (logno == InvalidUndoLogNumber) + elog(ERROR, "cannot find undo log number for xid %u", xid); + + /* + * This log must already have been created by XLOG_UNDOLOG_CREATE records + * emitted by UndoLogAllocate. + */ + log = get_undo_log_by_number(logno); + if (log == NULL) + elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid); + + /* + * This log must already have been extended to cover the requested size by + * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate, or by + * XLOG_UNDLOG_DISCARD records recycling segments. + */ + if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size)) + elog(ERROR, + "unexpectedly couldn't allocate %zu bytes in undo log number %d", + size, logno); + + /* + * By this time we have allocated a undo log in transaction so after this + * it will not be first undo record for the transaction. + */ + log->meta.is_first_rec = false; + + return MakeUndoRecPtr(logno, log->meta.insert); +} + +/* + * Advance the insertion pointer by 'size' usable (non-header) bytes. + * + * Caller must WAL-log this operation first, and must replay it during + * recovery. + */ +void +UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence) +{ + UndoLogControl *log = NULL; + UndoLogNumber logno = UndoRecPtrGetLogNo(insertion_point) ; + + /* + * During recovery, MyUndoLogState is uninitialized. Hence, we need to work + * more. + */ + log = (InRecovery) ? get_undo_log_by_number(logno) + : MyUndoLogState.logs[persistence]; + + Assert(log != NULL); + Assert(InRecovery || logno == log->logno); + Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert); + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size); + LWLockRelease(&log->mutex); +} + +/* + * Advance the discard pointer in one undo log, discarding all undo data + * relating to one or more whole transactions. The passed in undo pointer is + * the address of the oldest data that the called would like to keep, and the + * affected undo log is implied by this pointer, ie + * UndoRecPtrGetLogNo(discard_pointer). + * + * The caller asserts that there will be no attempts to access the undo log + * region being discarded after this moment. This operation will cause the + * relevant buffers to be dropped immediately, without writing any data out to + * disk. Any attempt to read the buffers (except a partial buffer at the end + * of this range which will remain) may result in IO errors, because the + * underlying segment file may physically removed. + * + * Only one backend should call this for a given undo log concurrently, or + * data structures will become corrupted. It is expected that the caller will + * be an undo worker; only one undo worker should be working on a given undo + * log at a time. + * + * XXX Special case for when we wrapped past the end of an undo log, spilling + * into a new one. How do we discard that? Essentially we'll be discarding + * the whole undo log, but not sure how the caller should know that or deal + * with it and how this code should handle it. + */ +void +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point); + UndoLogControl *log = get_undo_log_by_number(logno); + UndoLogOffset old_discard; + UndoLogOffset discard = UndoRecPtrGetOffset(discard_point); + UndoLogOffset end; + int segno; + int new_segno; + bool need_to_flush_wal = false; + + if (log == NULL) + elog(ERROR, "cannot advance discard pointer for unknown undo log %d", + logno); + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + if (discard > log->meta.insert) + elog(ERROR, "cannot move discard point past insert point"); + old_discard = log->meta.discard; + if (discard < old_discard) + elog(ERROR, "cannot move discard pointer backwards"); + end = log->meta.end; + LWLockRelease(&log->mutex); + + /* + * Drop all buffers holding this undo data out of the buffer pool (except + * the last one, if the new location is in the middle of it somewhere), so + * that the contained data doesn't ever touch the disk. The caller + * promises that this data will not be needed again. We have to drop the + * buffers from the buffer pool before removing files, otherwise a + * concurrent session might try to write the block to evict the buffer. + */ + forget_undo_buffers(logno, old_discard, discard, false); + + /* + * Check if we crossed a segment boundary and need to do some synchronous + * filesystem operations. + */ + segno = old_discard / UndoLogSegmentSize; + new_segno = discard / UndoLogSegmentSize; + if (segno < new_segno) + { + int recycle; + UndoLogOffset pointer; + + /* + * We always WAL-log discards, but we only need to flush the WAL if we + * have performed a filesystem operation. + */ + need_to_flush_wal = true; + + /* + * XXX When we rename or unlink a file, it's possible that some + * backend still has it open because it has recently read a page from + * it. smgr/undofile.c in any such backend will eventually close it, + * because it considers that fd to belong to the file with the name + * that we're unlinking or renaming and it doesn't like to keep more + * than one open at a time. No backend should ever try to read from + * such a file descriptor; that is what it means when we say that the + * caller of UndoLogDiscard() asserts that there will be no attempts + * to access the discarded range of undo log! In the case of a + * rename, if a backend were to attempt to read undo data in the range + * being discarded, it would read entirely the wrong data. + * + * XXX What defenses could we build against that happening due to + * bugs/corruption? One way would be for undofile.c to refuse to read + * buffers from before the current discard point, but currently + * undofile.c doesn't need to deal with shmem/locks. That may be + * false economy, but we really don't want reader to have to wait to + * acquire the undo log lock just to read undo data while we are doing + * filesystem stuff in here. + */ + + /* + * XXX Decide how many segments to recycle (= rename from tail + * position to head position). + * + * XXX For now it's always 1 unless there is already a spare one, but + * we could have an adaptive algorithm with the following goals: + * + * (1) handle future workload without having to create new segment + * files from scratch + * + * (2) reduce the rate of fsyncs require for recycling by doing + * several at once + */ + if (log->meta.end - log->meta.insert < UndoLogSegmentSize) + recycle = 1; + else + recycle = 0; + + /* Rewind to the start of the segment. */ + pointer = segno * UndoLogSegmentSize; + + while (pointer < new_segno * UndoLogSegmentSize) + { + char discard_path[MAXPGPATH]; + + /* + * Before removing the file, make sure that undofile_sync knows + * that it might be missing. + */ + undofile_forgetsync(log->logno, + log->meta.tablespace, + pointer / UndoLogSegmentSize); + + UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize, + log->meta.tablespace, discard_path); + + /* Can we recycle the oldest segment? */ + if (recycle > 0) + { + char recycle_path[MAXPGPATH]; + + /* + * End points one byte past the end of the current undo space, + * ie to the first byte of the segment file we want to create. + */ + UndoLogSegmentPath(logno, end / UndoLogSegmentSize, + log->meta.tablespace, recycle_path); + if (rename(discard_path, recycle_path) == 0) + { + elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */ + end += UndoLogSegmentSize; + --recycle; + } + else + { + elog(ERROR, "could not rename \"%s\" to \"%s\": %m", + discard_path, recycle_path); + } + } + else + { + if (unlink(discard_path) == 0) + elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */ + else + elog(ERROR, "could not unlink \"%s\": %m", discard_path); + } + pointer += UndoLogSegmentSize; + } + } + + /* WAL log the discard. */ + { + xl_undolog_discard xlrec; + XLogRecPtr ptr; + + xlrec.logno = logno; + xlrec.discard = discard; + xlrec.end = end; + xlrec.latestxid = xid; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD); + + if (need_to_flush_wal) + XLogFlush(ptr); + } + + /* Update shmem to show the new discard and end pointers. */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.discard = discard; + log->meta.end = end; + LWLockRelease(&log->mutex); +} + +Oid +UndoRecPtrGetTablespace(UndoRecPtr ptr) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(ptr); + UndoLogControl *log = get_undo_log_by_number(logno); + + /* + * XXX What should the behaviour of this function be if you ask for the + * tablespace of a discarded log, where even the shmem bank is gone? + */ + + /* + * No need to acquire log->mutex, because log->meta.tablespace is constant + * for the lifetime of the log. TODO: will it always be? No I'm going to change that! + */ + if (log != NULL) + return log->meta.tablespace; + else + return InvalidOid; +} + +/* + * Return first valid UndoRecPtr for a given undo logno. If logno is invalid + * then return InvalidUndoRecPtr. + */ +UndoRecPtr +UndoLogGetFirstValidRecord(UndoLogNumber logno) +{ + UndoLogControl *log = get_undo_log_by_number(logno); + + if (log == NULL || log->meta.discard == log->meta.insert) + return InvalidUndoRecPtr; + + return MakeUndoRecPtr(logno, log->meta.discard); +} + +/* + * Return the Next insert location. This will also validate the input xid + * if latest insert point is not for the same transaction id then this will + * return Invalid Undo pointer. + */ +UndoRecPtr +UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid) +{ + UndoLogControl *log = get_undo_log_by_number(logno); + TransactionId logxid; + UndoRecPtr insert; + + LWLockAcquire(&log->mutex, LW_SHARED); + insert = log->meta.insert; + logxid = log->xid; + LWLockRelease(&log->mutex); + + if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid)) + return InvalidUndoRecPtr; + + return MakeUndoRecPtr(logno, insert); +} + +/* + * Rewind the undo log insert position also set the prevlen in the mata + */ +void +UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(insert_urp); + UndoLogControl *log = get_undo_log_by_number(logno); + UndoLogOffset insert = UndoRecPtrGetOffset(insert_urp); + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.insert = insert; + log->meta.prevlen = prevlen; + + /* + * Force the wal log on next undo allocation. So that during recovery undo + * insert location is consistent with normal allocation. + */ + log->need_attach_wal_record = true; + LWLockRelease(&log->mutex); + + /* WAL log the rewind. */ + { + xl_undolog_rewind xlrec; + + xlrec.logno = logno; + xlrec.insert = insert; + xlrec.prevlen = prevlen; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND); + } +} + +/* + * Delete unreachable files under pg_undo. Any files corresponding to LSN + * positions before the previous checkpoint are no longer needed. + */ +static void +CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo) +{ + DIR *dir; + struct dirent *de; + char path[MAXPGPATH]; + char oldest_path[MAXPGPATH]; + + /* + * If a base backup is in progress, we can't delete any checkpoint + * snapshot files because one of them corresponds to the backup label but + * there could be any number of checkpoints during the backup. + */ + if (BackupInProgress()) + return; + + /* Otherwise keep only those >= the previous checkpoint's redo point. */ + snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X", + checkPointRedo); + dir = AllocateDir("pg_undo"); + while ((de = ReadDir(dir, "pg_undo")) != NULL) + { + /* + * Assume that fixed width uppercase hex strings sort the same way as + * the values they represent, so we can use strcmp to identify undo + * log snapshot files corresponding to checkpoints that we don't need + * anymore. This assumption holds for ASCII. + */ + if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH)) + continue; + + if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path)) + { + snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name); + if (unlink(path) != 0) + elog(ERROR, "could not unlink file \"%s\": %m", path); + } + } + FreeDir(dir); +} + +/* + * Write out the undo log meta data to the pg_undo directory. The actual + * contents of undo logs is in shared buffers and therefore handled by + * CheckPointBuffers(), but here we record the table of undo logs and their + * properties. + */ +void +CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + UndoLogMetaData *serialized = NULL; + UndoLogNumber low_logno; + UndoLogNumber high_logno; + UndoLogNumber logno; + size_t serialized_size = 0; + char *data; + char path[MAXPGPATH]; + int num_logs; + int fd; + pg_crc32c crc; + + /* + * Take this opportunity to check if we can free up any DSM segments and + * also some entries in the checkpoint file by forgetting about entirely + * discarded undo logs. Otherwise both would eventually grow large. + */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + while (shared->low_logno < shared->high_logno) + { + UndoLogControl *log; + + log = get_undo_log_by_number(shared->low_logno); + if (log->meta.status != UNDO_LOG_STATUS_DISCARDED) + break; + + /* + * If this was the last slot in a bank, the bank is no longer needed. + * The shared memory will be given back to the operating system once + * every attached backend runs undolog_bank_gc(). + */ + if (UndoLogNoGetSlotNo(shared->low_logno + 1) == 0) + shared->banks[UndoLogNoGetBankNo(shared->low_logno)] = + DSM_HANDLE_INVALID; + + ++shared->low_logno; + } + LWLockRelease(UndoLogLock); + + /* Detach from any banks that we don't need if low_logno advanced. */ + undolog_bank_gc(); + + /* + * We acquire UndoLogLock to prevent any undo logs from being created or + * discarded while we build a snapshot of them. This isn't expected to + * take long on a healthy system because the number of active logs should + * be around the number of backends. Holding this lock won't prevent + * concurrent access to the undo log, except when segments need to be + * added or removed. + */ + LWLockAcquire(UndoLogLock, LW_SHARED); + + low_logno = shared->low_logno; + high_logno = shared->high_logno; + num_logs = high_logno - low_logno; + + /* + * Rather than doing the file IO while we hold the lock, we'll copy it + * into a palloc'd buffer. + */ + if (num_logs > 0) + { + serialized_size = sizeof(UndoLogMetaData) * num_logs; + serialized = (UndoLogMetaData *) palloc0(serialized_size); + + for (logno = low_logno; logno != high_logno; ++logno) + { + UndoLogControl *log; + + log = get_undo_log_by_number(logno); + if (log == NULL) /* XXX can this happen? */ + continue; + + /* Capture snapshot while holding the mutex. */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->need_attach_wal_record = true; + memcpy(&serialized[logno], &log->meta, sizeof(UndoLogMetaData)); + LWLockRelease(&log->mutex); + } + } + + LWLockRelease(UndoLogLock); + + /* Dump into a file under pg_undo. */ + snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X", + checkPointRedo); + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE); + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY); + if (fd < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create file \"%s\": %m", path))); + + /* Compute header checksum. */ + INIT_CRC32C(crc); + COMP_CRC32C(crc, &low_logno, sizeof(low_logno)); + COMP_CRC32C(crc, &high_logno, sizeof(high_logno)); + FIN_CRC32C(crc); + + /* Write out range of active log numbers + crc. */ + if ((write(fd, &low_logno, sizeof(low_logno)) != sizeof(low_logno)) || + (write(fd, &high_logno, sizeof(high_logno)) != sizeof(high_logno)) || + (write(fd, &crc, sizeof(crc)) != sizeof(crc))) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", path))); + + /* Write out the meta data for all undo logs in that range. */ + data = (char *) serialized; + INIT_CRC32C(crc); + while (serialized_size > 0) + { + ssize_t written; + + written = write(fd, data, serialized_size); + if (written < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", path))); + COMP_CRC32C(crc, data, written); + serialized_size -= written; + data += written; + } + FIN_CRC32C(crc); + + if (write(fd, &crc, sizeof(crc)) != sizeof(crc)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", path))); + + + /* Flush file and directory entry. */ + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC); + pg_fsync(fd); + CloseTransientFile(fd); + fsync_fname("pg_undo", true); + pgstat_report_wait_end(); + + if (serialized) + pfree(serialized); + + CleanUpUndoCheckPointFiles(priorCheckPointRedo); + undolog_xid_map_gc(); +} + +void +StartupUndoLogs(XLogRecPtr checkPointRedo) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + char path[MAXPGPATH]; + int logno; + int fd; + pg_crc32c crc; + pg_crc32c new_crc; + + /* If initdb is calling, there is no file to read yet. */ + if (IsBootstrapProcessingMode()) + return; + + /* Open the pg_undo file corresponding to the given checkpoint. */ + snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X", + checkPointRedo); + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ); + fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); + if (fd < 0) + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path); + + /* Read the active log number range. */ + if ((read(fd, &shared->low_logno, sizeof(shared->low_logno)) + != sizeof(shared->low_logno)) || + (read(fd, &shared->high_logno, sizeof(shared->high_logno)) + != sizeof(shared->high_logno)) || + (read(fd, &crc, sizeof(crc)) != sizeof(crc))) + elog(ERROR, "pg_undo file \"%s\" is corrupted", path); + + /* Verify the header checksum. */ + INIT_CRC32C(new_crc); + COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno)); + COMP_CRC32C(new_crc, &shared->high_logno, sizeof(shared->high_logno)); + FIN_CRC32C(new_crc); + + if (crc != new_crc) + elog(ERROR, + "pg_undo file \"%s\" has incorrect checksum", path); + + /* Initialize all the logs and set up the freelist. */ + INIT_CRC32C(new_crc); + for (logno = shared->low_logno; logno < shared->high_logno; ++logno) + { + UndoLogControl *log; + + /* Get a zero-initialized control objects. */ + ensure_undo_log_number(logno); + log = get_undo_log_by_number(logno); + + /* Read in the meta data for this undo log. */ + if (read(fd, &log->meta, sizeof(log->meta)) != sizeof(log->meta)) + elog(ERROR, "corrupted pg_undo meta data in file \"%s\": %m", + path); + COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta)); + + /* + * At normal start-up, or during recovery, all active undo logs start + * out on the appropriate free list. + */ + log->pid = InvalidPid; + log->xid = InvalidTransactionId; + if (log->meta.status == UNDO_LOG_STATUS_ACTIVE) + { + log->next_free = shared->free_lists[log->meta.persistence]; + shared->free_lists[log->meta.persistence] = logno; + } + } + FIN_CRC32C(new_crc); + + /* Verify body checksum. */ + if (read(fd, &crc, sizeof(crc)) != sizeof(crc)) + elog(ERROR, "pg_undo file \"%s\" is corrupted", path); + if (crc != new_crc) + elog(ERROR, + "pg_undo file \"%s\" has incorrect checksum", path); + + CloseTransientFile(fd); + pgstat_report_wait_end(); +} + +/* + * WAL-LOG undo log meta data information before inserting the first WAL after + * the checkpoint for any undo log. + */ +void +LogUndoMetaData(xl_undolog_meta *xlrec) +{ + XLogRecPtr RedoRecPtr; + bool doPageWrites; + XLogRecPtr recptr; + +prepare_xlog: + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + + if (NeedUndoMetaLog(RedoRecPtr)) + { + XLogBeginInsert(); + XLogRegisterData((char *) xlrec, sizeof(xl_undolog_meta)); + recptr = XLogInsertExtended(RM_UNDOLOG_ID, XLOG_UNDOLOG_META, + RedoRecPtr, doPageWrites); + if (recptr == InvalidXLogRecPtr) + goto prepare_xlog; + + UndoLogSetLSN(recptr); + } +} + +/* + * Check whether we need to log undolog meta or not. + */ +bool +NeedUndoMetaLog(XLogRecPtr redo_point) +{ + UndoLogControl *log = MyUndoLogState.logs[UNDO_PERMANENT]; + + /* + * If the current session is not attached to any undo log then we don't + * need to log meta. It is quite possible that some operations skip + * writing undo, so those won't be attached to any undo log. + */ + if (log == NULL) + return false; + + Assert(AmAttachedToUndoLog(log)); + + if (log->lsn <= redo_point) + return true; + + return false; +} + +/* + * Update the WAL lsn in the undo. This is to test whether we need to include + * the xid to logno mapping information in the next WAL or not. + */ +void +UndoLogSetLSN(XLogRecPtr lsn) +{ + UndoLogControl *log = MyUndoLogState.logs[UNDO_PERMANENT]; + + Assert(AmAttachedToUndoLog(log)); + log->lsn = lsn; +} + +/* + * Get an UndoLogControl pointer for a given logno. This may require + * attaching to a DSM segment if it isn't already attached in this backend. + * Return NULL if there is no such logno because it has been entirely + * discarded. + */ +static UndoLogControl * +get_undo_log_by_number(UndoLogNumber logno) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + int bankno = UndoLogNoGetBankNo(logno); + int slotno = UndoLogNoGetSlotNo(logno); + + /* See if we need to attach to the bank that holds logno. */ + if (unlikely(MyUndoLogState.banks[bankno] == NULL)) + { + dsm_segment *segment; + + if (shared->banks[bankno] != DSM_HANDLE_INVALID) + { + segment = dsm_attach(shared->banks[bankno]); + if (segment != NULL) + { + MyUndoLogState.bank_segments[bankno] = segment; + MyUndoLogState.banks[bankno] = dsm_segment_address(segment); + dsm_pin_mapping(segment); + } + } + + if (unlikely(MyUndoLogState.banks[bankno] == NULL)) + return NULL; + } + + return &MyUndoLogState.banks[bankno][slotno]; +} + +UndoLogControl * +UndoLogGet(UndoLogNumber logno) +{ + /* TODO just rename the above function */ + return get_undo_log_by_number(logno); +} + +/* + * We write the undo log number into each UndoLogControl object. + */ +static void +initialize_undo_log_bank(int bankno, UndoLogControl *bank) +{ + int i; + int logs_per_bank = 1 << (UndoLogNumberBits - UndoLogBankBits); + + for (i = 0; i < logs_per_bank; ++i) + { + bank[i].logno = logs_per_bank * bankno + i; + LWLockInitialize(&bank[i].mutex, LWTRANCHE_UNDOLOG); + LWLockInitialize(&bank[i].discard_lock, LWTRANCHE_UNDODISCARD); + LWLockInitialize(&bank[i].rewind_lock, LWTRANCHE_REWIND); + } +} + +/* + * Create shared memory space for a given undo log number, if it doesn't exist + * already. + */ +static void +ensure_undo_log_number(UndoLogNumber logno) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + int bankno = UndoLogNoGetBankNo(logno); + + /* In single-user mode, we have to use backend-private memory. */ + if (!IsUnderPostmaster) + { + if (MyUndoLogState.banks[bankno] == NULL) + { + size_t size; + + size = sizeof(UndoLogControl) * (1 << UndoLogBankBits); + MyUndoLogState.banks[bankno] = + MemoryContextAllocZero(TopMemoryContext, size); + initialize_undo_log_bank(bankno, MyUndoLogState.banks[bankno]); + } + return; + } + + /* Do we need to create a bank in shared memory for this undo log number? */ + if (shared->banks[bankno] == DSM_HANDLE_INVALID) + { + dsm_segment *segment; + size_t size; + + size = sizeof(UndoLogControl) * (1 << UndoLogBankBits); + segment = dsm_create(size, 0); + dsm_pin_mapping(segment); + dsm_pin_segment(segment); + memset(dsm_segment_address(segment), 0, size); + shared->banks[bankno] = dsm_segment_handle(segment); + MyUndoLogState.banks[bankno] = dsm_segment_address(segment); + initialize_undo_log_bank(bankno, MyUndoLogState.banks[bankno]); + } +} + +/* + * Attach to an undo log, possibly creating or recycling one. + */ +static void +attach_undo_log(UndoPersistence persistence, Oid tablespace) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + UndoLogControl *log = NULL; + UndoLogNumber logno; + UndoLogNumber *place; + + Assert(!InRecovery); + Assert(MyUndoLogState.logs[persistence] == NULL); + + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + + /* + * For now we have a simple linked list of unattached undo logs for each + * persistence level. We'll grovel though it to find something for the + * tablespace you asked for. If you're not using multiple tablespaces + * it'll be able to pop one off the front. We might need a hash table + * keyed by tablespace if this simple scheme turns out to be too slow when + * using many tablespaces and many undo logs, but that seems like an + * unusual use case not worth optimizing for. + */ + place = &shared->free_lists[persistence]; + while (*place != InvalidUndoLogNumber) + { + UndoLogControl *candidate = get_undo_log_by_number(*place); + + if (candidate == NULL) + elog(ERROR, "corrupted undo log freelist"); + if (candidate->meta.tablespace == tablespace) + { + logno = *place; + log = candidate; + *place = candidate->next_free; + break; + } + place = &candidate->next_free; + } + + /* + * All existing undo logs for this tablespace and persistence level are + * busy, so we'll have to create a new one. + */ + if (log == NULL) + { + if (shared->high_logno > (1 << UndoLogNumberBits)) + { + /* + * You've used up all 16 exabytes of undo log addressing space. + * This is a difficult state to reach using only 16 exabytes of + * WAL. + */ + elog(ERROR, "cannot create new undo log"); + } + + logno = shared->high_logno; + ensure_undo_log_number(logno); + + /* Get new zero-filled UndoLogControl object. */ + log = get_undo_log_by_number(logno); + + Assert(log->meta.persistence == 0); + Assert(log->meta.tablespace == InvalidOid); + Assert(log->meta.discard == 0); + Assert(log->meta.insert == 0); + Assert(log->meta.end == 0); + Assert(log->pid == 0); + Assert(log->xid == 0); + + /* + * The insert and discard pointers start after the first block's + * header. XXX That means that insert is > end for a short time in a + * newly created undo log. Is there any problem with that? + */ + log->meta.insert = UndoLogBlockHeaderSize; + log->meta.discard = UndoLogBlockHeaderSize; + + log->meta.tablespace = tablespace; + log->meta.persistence = persistence; + log->meta.status = UNDO_LOG_STATUS_ACTIVE; + + /* Move the high log number pointer past this one. */ + ++shared->high_logno; + + /* WAL-log the creation of this new undo log. */ + { + xl_undolog_create xlrec; + + xlrec.logno = logno; + xlrec.tablespace = log->meta.tablespace; + xlrec.persistence = log->meta.persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE); + } + + /* + * This undo log has no segments. UndoLogAllocate will create the + * first one on demand. + */ + } + LWLockRelease(UndoLogLock); + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->pid = MyProcPid; + log->xid = InvalidTransactionId; + log->need_attach_wal_record = true; + LWLockRelease(&log->mutex); + + MyUndoLogState.logs[persistence] = log; +} + +/* + * Free chunks of the xid/undo log map that relate to transactions that are no + * longer running. This is run at each checkpoint. + */ +static void +undolog_xid_map_gc(void) +{ + UndoLogNumber **xid_map = MyUndoLogState.xid_map; + TransactionId oldest_xid; + uint16 new_oldest_chunk; + uint16 oldest_chunk; + + if (xid_map == NULL) + return; + + /* + * During crash recovery, it may not be possible to call GetOldestXmin() + * yet because latestCompletedXid is invalid. + */ + if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid)) + return; + + oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT); + new_oldest_chunk = UndoLogGetXidHigh(oldest_xid); + oldest_chunk = MyUndoLogState.xid_map_oldest_chunk; + + while (oldest_chunk != new_oldest_chunk) + { + if (xid_map[oldest_chunk]) + { + pfree(xid_map[oldest_chunk]); + xid_map[oldest_chunk] = NULL; + } + oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits); + } + MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk; +} + +/* + * Detach from shared memory banks that are no longer needed because they hold + * undo logs that are entirely discarded. This should ideally be called + * periodically in any backend that accesses undo data, so that they have a + * chance to detach from DSM segments that hold banks of entirely discarded + * undo log control objects. + */ +static void +undolog_bank_gc(void) +{ + UndoLogSharedData *shared = MyUndoLogState.shared; + UndoLogNumber low_logno = shared->low_logno; + + if (unlikely(MyUndoLogState.low_logno < low_logno)) + { + int low_bank = UndoLogNoGetBankNo(low_logno); + int bank = UndoLogNoGetBankNo(MyUndoLogState.low_logno); + + while (bank < low_bank) + { + Assert(shared->banks[bank] == DSM_HANDLE_INVALID); + if (MyUndoLogState.banks[bank] != NULL) + { + dsm_detach(MyUndoLogState.bank_segments[bank]); + MyUndoLogState.bank_segments[bank] = NULL; + MyUndoLogState.banks[bank] = NULL; + } + ++bank; + } + } + + MyUndoLogState.low_logno = low_logno; +} + +/* + * Associate a xid with an undo log, during recovery. In a primary server, + * this isn't necessary because backends know which undo log they're attached + * to. During recovery, the natural association between backends and xids is + * lost, so we need to manage that explicitly. + */ +static void +undolog_xid_map_add(TransactionId xid, UndoLogNumber logno) +{ + uint16 high_bits; + uint16 low_bits; + + high_bits = UndoLogGetXidHigh(xid); + low_bits = UndoLogGetXidLow(xid); + + if (unlikely(MyUndoLogState.xid_map == NULL)) + { + /* First time through. Create mapping array. */ + MyUndoLogState.xid_map = + MemoryContextAllocZero(TopMemoryContext, + sizeof(UndoLogNumber *) * + (1 << (32 - UndoLogXidLowBits))); + MyUndoLogState.xid_map_oldest_chunk = high_bits; + } + + if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL)) + { + /* This bank of mappings doesn't exist yet. Create it. */ + MyUndoLogState.xid_map[high_bits] = + MemoryContextAllocZero(TopMemoryContext, + sizeof(UndoLogNumber) * + (1 << UndoLogXidLowBits)); + } + + /* Associate this xid with this undo log number. */ + MyUndoLogState.xid_map[high_bits][low_bits] = logno; +} + +/* check_hook: validate new undo_tablespaces */ +bool +check_undo_tablespaces(char **newval, void **extra, GucSource source) +{ + char *rawname; + List *namelist; + + /* Need a modifiable copy of string */ + rawname = pstrdup(*newval); + + /* + * Parse string into list of identifiers, just to check for + * well-formedness (unfortunateley we can't validate the names in the + * catalog yet). + */ + if (!SplitIdentifierString(rawname, ',', &namelist)) + { + /* syntax error in name list */ + GUC_check_errdetail("List syntax is invalid."); + pfree(rawname); + list_free(namelist); + return false; + } + + /* + * Make sure we aren't already in a transaction that has been assigned an + * XID. This ensures we don't detach from an undo log that we might have + * started writing undo data into for this transaction. + */ + if (GetTopTransactionIdIfAny() != InvalidTransactionId) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + (errmsg("undo_tablespaces cannot be changed while a transaction is in progress")))); + list_free(namelist); + + return true; +} + +/* assign_hook: do extra actions as needed */ +void +assign_undo_tablespaces(const char *newval, void *extra) +{ + /* + * This is normally called only when GetTopTransactionIdIfAny() == + * InvalidTransactionId (because you can't change undo_tablespaces in the + * middle of a transaction that's been asigned an xid), but we can't + * assert that because it's also called at the end of a transaction that's + * rolling back, to reset the GUC if it was set inside the transaction. + */ + + /* Tell UndoLogAllocate() to reexamine undo_tablespaces. */ + MyUndoLogState.need_to_choose_tablespace = true; +} + +static bool +choose_undo_tablespace(bool force_detach, Oid *tablespace) +{ + char *rawname; + List *namelist; + bool need_to_unlock; + int length; + int i; + + /* We need a modifiable copy of string. */ + rawname = pstrdup(undo_tablespaces); + + /* Break string into list of identifiers. */ + if (!SplitIdentifierString(rawname, ',', &namelist)) + elog(ERROR, "undo_tablespaces is unexpectedly malformed"); + + length = list_length(namelist); + if (length == 0 || + (length == 1 && ((char *) linitial(namelist))[0] == '\0')) + { + /* + * If it's an empty string, then we'll use the default tablespace. No + * locking is required because it can't be dropped. + */ + *tablespace = DEFAULTTABLESPACE_OID; + need_to_unlock = false; + } + else + { + /* + * Choose an OID using our pid, so that if several backends have the + * same multi-tablespace setting they'll spread out. We could easily + * do better than this if more serious load balancing is judged + * useful. + */ + int index = MyProcPid % length; + int first_index = index; + Oid oid = InvalidOid; + + /* + * Take the tablespace create/drop lock while we look the name up. + * This prevents the tablespace from being dropped while we're trying + * to resolve the name, or while the called is trying to create an + * undo log in it. The caller will have to release this lock. + */ + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); + for (;;) + { + const char *name = list_nth(namelist, index); + + oid = get_tablespace_oid(name, true); + if (oid == InvalidOid) + { + /* Unknown tablespace, try the next one. */ + index = (index + 1) % length; + /* + * But if we've tried them all, it's time to complain. We'll + * arbitrarily complain about the last one we tried in the + * error message. + */ + if (index == first_index) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_OBJECT), + errmsg("tablespace \"%s\" does not exist", name), + errhint("Create the tablespace or set undo_tablespaces to a valid or empty list."))); + continue; + } + if (oid == GLOBALTABLESPACE_OID) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("undo logs cannot be placed in pg_global tablespace"))); + /* If we got here we succeeded in finding one. */ + break; + } + + Assert(oid != InvalidOid); + *tablespace = oid; + need_to_unlock = true; + } + + /* + * If we came here because the user changed undo_tablesaces, then detach + * from any undo logs we happen to be attached to. + */ + if (force_detach) + { + for (i = 0; i < UndoPersistenceLevels; ++i) + { + UndoLogControl *log = MyUndoLogState.logs[i]; + UndoLogSharedData *shared = MyUndoLogState.shared; + + if (log != NULL) + { + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->pid = InvalidPid; + log->xid = InvalidTransactionId; + LWLockRelease(&log->mutex); + + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + log->next_free = shared->free_lists[i]; + shared->free_lists[i] = log->logno; + LWLockRelease(UndoLogLock); + + MyUndoLogState.logs[i] = NULL; + } + } + } + + return need_to_unlock; +} + +bool +DropUndoLogsInTablespace(Oid tablespace) +{ + DIR *dir; + char undo_path[MAXPGPATH]; + UndoLogNumber low_logno; + UndoLogNumber high_logno; + UndoLogNumber logno; + UndoLogSharedData *shared = MyUndoLogState.shared; + int i; + + Assert(LWLockHeldByMe(TablespaceCreateLock)); + Assert(tablespace != DEFAULTTABLESPACE_OID); + + LWLockAcquire(UndoLogLock, LW_SHARED); + low_logno = shared->low_logno; + high_logno = shared->high_logno; + LWLockRelease(UndoLogLock); + + /* First, try to kick everyone off any undo logs in this tablespace. */ + for (logno = low_logno; logno < high_logno; ++logno) + { + UndoLogControl *log = get_undo_log_by_number(logno); + bool ok; + bool return_to_freelist = false; + + /* Skip undo logs in other tablespaces. */ + if (log->meta.tablespace != tablespace) + continue; + + /* Check if this undo log can be forcibly detached. */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + if (log->meta.discard == log->meta.insert && + (log->xid == InvalidTransactionId || + !TransactionIdIsInProgress(log->xid))) + { + log->xid = InvalidTransactionId; + if (log->pid != InvalidPid) + { + log->pid = InvalidPid; + return_to_freelist = true; + } + ok = true; + } + else + { + /* + * There is data we need in this undo log. We can't force it to + * be detached. + */ + ok = false; + } + LWLockRelease(&log->mutex); + + /* If we failed, then give up now and report failure. */ + if (!ok) + return false; + + /* + * Put this undo log back on the appropriate free-list. No one can + * attach to it while we hold TablespaceCreateLock, but if we return + * earlier in a future go around this loop, we need the undo log to + * remain usable. We'll remove all appropriate logs from the + * free-lists in a separate step below. + */ + if (return_to_freelist) + { + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + log->next_free = shared->free_lists[log->meta.persistence]; + shared->free_lists[log->meta.persistence] = logno; + LWLockRelease(UndoLogLock); + } + } + + /* + * We detached all backends from undo logs in this tablespace, and no one + * can attach to any non-default-tablespace undo logs while we hold + * TablespaceCreateLock. We can now drop the undo logs. + */ + for (logno = low_logno; logno < high_logno; ++logno) + { + UndoLogControl *log = get_undo_log_by_number(logno); + + /* Skip undo logs in other tablespaces. */ + if (log->meta.tablespace != tablespace) + continue; + + /* + * Make sure no buffers remain. When that is done by UndoDiscard(), + * the final page is left in shared_buffers because it may contain + * data, or at least be needed again very soon. Here we need to drop + * even that page from the buffer pool. + */ + forget_undo_buffers(logno, log->meta.discard, log->meta.discard, true); + + /* + * TODO: For now we drop the undo log, meaning that it will never be + * used again. That wastes the rest of its address space. Instead, + * we should put it onto a special list of 'offline' undo logs, ready + * to be reactivated in some other tablespace. Then we can keep the + * unused portion of its address space. + */ + + /* Log the dropping operation. TODO: WAL */ + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.status = UNDO_LOG_STATUS_DISCARDED; + LWLockRelease(&log->mutex); + } + + /* TODO: flush WAL? revisit */ + /* Unlink all undo segment files in this tablespace. */ + UndoLogDirectory(tablespace, undo_path); + + dir = AllocateDir(undo_path); + if (dir != NULL) + { + struct dirent *de; + + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL) + { + char segment_path[MAXPGPATH]; + + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) + continue; + snprintf(segment_path, sizeof(segment_path), "%s/%s", + undo_path, de->d_name); + if (unlink(segment_path) < 0) + elog(LOG, "couldn't unlink file \"%s\": %m", segment_path); + } + FreeDir(dir); + } + + /* Remove all dropped undo logs from the free-lists. */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + for (i = 0; i < UndoPersistenceLevels; ++i) + { + UndoLogControl *log; + UndoLogNumber *place; + + place = &shared->free_lists[i]; + while (*place != InvalidUndoLogNumber) + { + log = get_undo_log_by_number(*place); + if (log->meta.status == UNDO_LOG_STATUS_DISCARDED) + *place = log->next_free; + else + place = &log->next_free; + } + } + LWLockRelease(UndoLogLock); + + return true; +} + +void +ResetUndoLogs(UndoPersistence persistence) +{ + UndoLogNumber low_logno; + UndoLogNumber high_logno; + UndoLogNumber logno; + UndoLogSharedData *shared = MyUndoLogState.shared; + + LWLockAcquire(UndoLogLock, LW_SHARED); + low_logno = shared->low_logno; + high_logno = shared->high_logno; + LWLockRelease(UndoLogLock); + + /* TODO: figure out if locking is needed here */ + + for (logno = low_logno; logno < high_logno; ++logno) + { + UndoLogControl *log = get_undo_log_by_number(logno); + DIR *dir; + struct dirent *de; + char undo_path[MAXPGPATH]; + char segment_prefix[MAXPGPATH]; + size_t segment_prefix_size; + + if (log->meta.persistence != persistence) + continue; + + /* Scan the directory for files belonging to this undo log. */ + snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", logno); + segment_prefix_size = strlen(segment_prefix); + UndoLogDirectory(log->meta.tablespace, undo_path); + dir = AllocateDir(undo_path); + if (dir == NULL) + continue; + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL) + { + char segment_path[MAXPGPATH]; + + if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0) + continue; + snprintf(segment_path, sizeof(segment_path), "%s/%s", + undo_path, de->d_name); + elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */ + if (unlink(segment_path) < 0) + elog(LOG, "couldn't unlink file \"%s\": %m", segment_path); + } + FreeDir(dir); + + /* + * We have no segment files. Set the pointers to indicate that there + * is no data. The discard and insert pointers point to the first + * usable byte in the segment we will create when we next try to + * allocate. This is a bit strange, because it means that they are + * past the end pointer. That's the same as when new undo logs are + * created. + * + * TODO: Should we rewind to zero instead, so we can reuse that (now) + * unreferenced address space? + */ + log->meta.insert = log->meta.discard = log->meta.end + + UndoLogBlockHeaderSize; + + /* + * TODO: Here we need to call forget_undo_buffers() to nuke anything + * in shared buffers that might have resulted from replaying WAL, + * which will cause later checkpoints to fail when they can't find a + * file to write buffers to. But we can't, because we don't know the + * true discard and end pointers here. Ahh, that's not right. There + * can be no such WAL, because unlogged relations shouldn't be logging + * anything. So the fact that they are is a bug elsewhere in zheap + * code? + */ + } +} + +Datum +pg_stat_get_undo_logs(PG_FUNCTION_ARGS) +{ +#define PG_STAT_GET_UNDO_LOGS_COLS 9 + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + UndoLogNumber low_logno; + UndoLogNumber high_logno; + UndoLogNumber logno; + UndoLogSharedData *shared = MyUndoLogState.shared; + char *tablespace_name = NULL; + Oid last_tablespace = InvalidOid; + + /* check to see if caller supports us returning a tuplestore */ + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + oldcontext = MemoryContextSwitchTo(per_query_ctx); + + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* Find the range of active log numbers. */ + LWLockAcquire(UndoLogLock, LW_SHARED); + low_logno = shared->low_logno; + high_logno = shared->high_logno; + LWLockRelease(UndoLogLock); + + /* Scan all undo logs to build the results. */ + for (logno = low_logno; logno < high_logno; ++logno) + { + UndoLogControl *log = get_undo_log_by_number(logno); + char buffer[17]; + Datum values[PG_STAT_GET_UNDO_LOGS_COLS]; + bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false }; + Oid tablespace; + + if (log == NULL) + continue; + + /* + * This won't be a consistent result overall, but the values for each + * log will be consistent because we'll take the per-log lock while + * copying them. + */ + LWLockAcquire(&log->mutex, LW_SHARED); + + if (log->meta.status == UNDO_LOG_STATUS_DISCARDED) + { + LWLockRelease(&log->mutex); + continue; + } + + values[0] = ObjectIdGetDatum((Oid) logno); + values[1] = CStringGetTextDatum( + log->meta.persistence == UNDO_PERMANENT ? "permanent" : + log->meta.persistence == UNDO_UNLOGGED ? "unlogged" : + log->meta.persistence == UNDO_TEMP ? "temporary" : ""); + tablespace = log->meta.tablespace; + + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, + MakeUndoRecPtr(logno, log->meta.discard)); + values[3] = CStringGetTextDatum(buffer); + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, + MakeUndoRecPtr(logno, log->meta.insert)); + values[4] = CStringGetTextDatum(buffer); + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, + MakeUndoRecPtr(logno, log->meta.end)); + values[5] = CStringGetTextDatum(buffer); + if (log->xid == InvalidTransactionId) + nulls[6] = true; + else + values[6] = TransactionIdGetDatum(log->xid); + if (log->pid == InvalidPid) + nulls[7] = true; + else + values[7] = Int32GetDatum((int64) log->pid); + LWLockRelease(&log->mutex); + + /* + * Deal with potentially slow tablespace name lookup without the lock. + * Avoid making multiple calls to that expensive function for the + * common case of repeating tablespace. + */ + if (tablespace != last_tablespace) + { + if (tablespace_name) + pfree(tablespace_name); + tablespace_name = get_tablespace_name(tablespace); + last_tablespace = tablespace; + } + if (tablespace_name) + { + values[2] = CStringGetTextDatum(tablespace_name); + nulls[2] = false; + } + else + nulls[2] = true; + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + if (tablespace_name) + pfree(tablespace_name); + tuplestore_donestoring(tupstore); + + return (Datum) 0; +} + +/* + * replay the creation of a new undo log + */ +static void +undolog_xlog_create(XLogReaderState *record) +{ + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); + UndoLogControl *log; + UndoLogSharedData *shared = MyUndoLogState.shared; + + /* Create meta-data space in shared memory. */ + ensure_undo_log_number(xlrec->logno); + + log = get_undo_log_by_number(xlrec->logno); + log->meta.status = UNDO_LOG_STATUS_ACTIVE; + log->meta.persistence = xlrec->persistence; + log->meta.tablespace = xlrec->tablespace; + log->meta.insert = UndoLogBlockHeaderSize; + log->meta.discard = UndoLogBlockHeaderSize; + + LWLockAcquire(UndoLogLock, LW_SHARED); + shared->high_logno = Max(xlrec->logno + 1, shared->high_logno); + LWLockRelease(UndoLogLock); +} + +/* + * replay the addition of a new segment to an undo log + */ +static void +undolog_xlog_extend(XLogReaderState *record) +{ + xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record); + + /* Extend exactly as we would during DO phase. */ + extend_undo_log(xlrec->logno, xlrec->end); +} + +/* + * replay the association of an xid with a specific undo log + */ +static void +undolog_xlog_attach(XLogReaderState *record) +{ + xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record); + UndoLogControl *log; + + undolog_xid_map_add(xlrec->xid, xlrec->logno); + + /* Restore current dbid */ + MyUndoLogState.dbid = xlrec->dbid; + + /* + * Whatever follows is the first record for this transaction. Zheap will + * use this to add UREC_INFO_TRANSACTION. + */ + log = get_undo_log_by_number(xlrec->logno); + log->meta.is_first_rec = true; + log->xid = xlrec->xid; +} + +/* + * replay undo log meta-data image + */ +static void +undolog_xlog_meta(XLogReaderState *record) +{ + xl_undolog_meta *xlrec = (xl_undolog_meta *) XLogRecGetData(record); + UndoLogControl *log; + + undolog_xid_map_add(xlrec->xid, xlrec->logno); + + log = get_undo_log_by_number(xlrec->logno); + if (log == NULL) + elog(ERROR, "cannot attach to unknown undo log %u", xlrec->logno); + + /* + * Update the insertion point. While this races against a checkpoint, + * XLOG_UNDOLOG_META always wins because it must be correct for any + * subsequent data appended by this transaction, so we can simply + * overwrite it here. + */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta = xlrec->meta; + log->xid = xlrec->xid; + log->pid = MyProcPid; /* show as recovery process */ + LWLockRelease(&log->mutex); +} + +/* + * Drop all buffers for the given undo log, from the old_discard to up + * new_discard. If drop_tail is true, also drop the buffer that holds + * new_discard; this is used when dropping undo logs completely via DROP + * TABLESPACE. If it is false, then the final buffer is not dropped because + * it may contain data. + * + */ +static void +forget_undo_buffers(int logno, UndoLogOffset old_discard, + UndoLogOffset new_discard, bool drop_tail) +{ + BlockNumber old_blockno; + BlockNumber new_blockno; + RelFileNode rnode; + + UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard)); + old_blockno = old_discard / BLCKSZ; + new_blockno = new_discard / BLCKSZ; + if (drop_tail) + ++new_blockno; + while (old_blockno < new_blockno) + { + ForgetBuffer(rnode, UndoLogForkNum, old_blockno); + ForgetLocalBuffer(rnode, UndoLogForkNum, old_blockno++); + } +} +/* + * replay an undo segment discard record + */ +static void +undolog_xlog_discard(XLogReaderState *record) +{ + xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record); + UndoLogControl *log; + UndoLogOffset discard; + UndoLogOffset end; + UndoLogOffset old_segment_begin; + UndoLogOffset new_segment_begin; + RelFileNode rnode = {0}; + char dir[MAXPGPATH]; + + log = get_undo_log_by_number(xlrec->logno); + if (log == NULL) + elog(ERROR, "unknown undo log %d", xlrec->logno); + + /* + * We're about to discard undologs. In Hot Standby mode, ensure that + * there's no queries running which need to get tuple from discarded undo. + * + * XXX we are passing empty rnode to the conflict function so that it can + * check conflict in all the backend regardless of which database the + * backend is connected. + */ + if (InHotStandby && TransactionIdIsValid(xlrec->latestxid)) + ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode); + + /* + * See if we need to unlink or rename any files, but don't consider it an + * error if we find that files are missing. Since UndoLogDiscard() + * performs filesystem operations before WAL logging or updating shmem + * which could be checkpointed, a crash could have left files already + * deleted, but we could replay WAL that expects the files to be there. + */ + + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + discard = log->meta.discard; + end = log->meta.end; + LWLockRelease(&log->mutex); + + /* Drop buffers before we remove/recycle any files. */ + forget_undo_buffers(xlrec->logno, discard, xlrec->discard, false); + + /* Rewind to the start of the segment. */ + old_segment_begin = discard - discard % UndoLogSegmentSize; + new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize; + + /* Unlink or rename segments that are no longer in range. */ + while (old_segment_begin < new_segment_begin) + { + char discard_path[MAXPGPATH]; + + /* + * Before removing the file, make sure that undofile_sync knows that + * it might be missing. + */ + undofile_forgetsync(log->logno, + log->meta.tablespace, + old_segment_begin / UndoLogSegmentSize); + + UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize, + log->meta.tablespace, discard_path); + + /* Can we recycle the oldest segment? */ + if (end < xlrec->end) + { + char recycle_path[MAXPGPATH]; + + UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize, + log->meta.tablespace, recycle_path); + if (rename(discard_path, recycle_path) == 0) + { + elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */ + end += UndoLogSegmentSize; + } + else + { + elog(LOG, "could not rename \"%s\" to \"%s\": %m", + discard_path, recycle_path); + } + } + else + { + if (unlink(discard_path) == 0) + elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */ + else + elog(LOG, "could not unlink \"%s\": %m", discard_path); + } + old_segment_begin += UndoLogSegmentSize; + } + + /* Create any further new segments that are needed the slow way. */ + while (end < xlrec->end) + { + allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end); + end += UndoLogSegmentSize; + } + + /* Flush the directory entries. */ + UndoLogDirectory(log->meta.tablespace, dir); + fsync_fname(dir, true); + + /* Update shmem. */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + log->meta.discard = xlrec->discard; + log->meta.end = end; + LWLockRelease(&log->mutex); +} + +/* + * replay the rewind of a undo log + */ +static void +undolog_xlog_rewind(XLogReaderState *record) +{ + xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record); + UndoLogControl *log; + + log = get_undo_log_by_number(xlrec->logno); + log->meta.insert = xlrec->insert; + log->meta.prevlen = xlrec->prevlen; +} + +void +undolog_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info) + { + case XLOG_UNDOLOG_CREATE: + undolog_xlog_create(record); + break; + case XLOG_UNDOLOG_EXTEND: + undolog_xlog_extend(record); + break; + case XLOG_UNDOLOG_ATTACH: + undolog_xlog_attach(record); + break; + case XLOG_UNDOLOG_DISCARD: + undolog_xlog_discard(record); + break; + case XLOG_UNDOLOG_REWIND: + undolog_xlog_rewind(record); + break; + case XLOG_UNDOLOG_META: + undolog_xlog_meta(record); + break; + default: + elog(PANIC, "undo_redo: unknown op code %u", info); + } +} + +/* + * For assertions only. + */ +bool +AmAttachedToUndoLog(UndoLogControl *log) +{ + int i; + + for (i = 0; i < UndoPersistenceLevels; ++i) + { + if (MyUndoLogState.logs[i] == log) + return true; + } + return false; +} + +/* + * Fetch database id from the undo log state + */ +Oid +UndoLogStateGetDatabaseId() +{ + Assert(InRecovery); + return MyUndoLogState.dbid; +} diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c new file mode 100644 index 0000000000..73076dc5f4 --- /dev/null +++ b/src/backend/access/undo/undorecord.c @@ -0,0 +1,451 @@ +/*------------------------------------------------------------------------- + * + * undorecord.c + * encode and decode undo records + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/access/undo/undorecord.c + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/subtrans.h" +#include "access/undorecord.h" +#include "catalog/pg_tablespace.h" +#include "storage/block.h" + +/* Workspace for InsertUndoRecord and UnpackUndoRecord. */ +static UndoRecordHeader work_hdr; +static UndoRecordRelationDetails work_rd; +static UndoRecordBlock work_blk; +static UndoRecordTransaction work_txn; +static UndoRecordPayload work_payload; + +/* Prototypes for static functions. */ +static bool InsertUndoBytes(char *sourceptr, int sourcelen, + char **writeptr, char *endptr, + int *my_bytes_written, int *total_bytes_written); +static bool ReadUndoBytes(char *destptr, int readlen, + char **readptr, char *endptr, + int *my_bytes_read, int *total_bytes_read, bool nocopy); + +/* + * Compute and return the expected size of an undo record. + */ +Size +UndoRecordExpectedSize(UnpackedUndoRecord *uur) +{ + Size size; + + size = SizeOfUndoRecordHeader; + if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0) + size += SizeOfUndoRecordRelationDetails; + if ((uur->uur_info & UREC_INFO_BLOCK) != 0) + size += SizeOfUndoRecordBlock; + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) + size += SizeOfUndoRecordTransaction; + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) + { + size += SizeOfUndoRecordPayload; + size += uur->uur_payload.len; + size += uur->uur_tuple.len; + } + + return size; +} + +/* + * Insert as much of an undo record as will fit in the given page. + * starting_byte is the byte within the give page at which to begin + * writing, while *already_written is the number of bytes written to + * previous pages. Returns true if the remainder of the record was + * written and false if more bytes remain to be written; in either + * case, *already_written is set to the number of bytes written thus + * far. + * + * This function assumes that if *already_written is non-zero on entry, + * the same UnpackedUndoRecord is passed each time. It also assumes + * that UnpackUndoRecord is not called between successive calls to + * InsertUndoRecord for the same UnpackedUndoRecord. + */ +bool +InsertUndoRecord(UnpackedUndoRecord *uur, Page page, + int starting_byte, int *already_written, bool header_only) +{ + char *writeptr = (char *) page + starting_byte; + char *endptr = (char *) page + BLCKSZ; + int my_bytes_written = *already_written; + + /* The undo record must contain a valid information. */ + Assert(uur->uur_info != 0); + + /* + * If this is the first call, copy the UnpackedUndoRecord into the + * temporary variables of the types that will actually be stored in the + * undo pages. We just initialize everything here, on the assumption that + * it's not worth adding branches to save a handful of assignments. + */ + if (*already_written == 0) + { + work_hdr.urec_type = uur->uur_type; + work_hdr.urec_info = uur->uur_info; + work_hdr.urec_prevlen = uur->uur_prevlen; + work_hdr.urec_reloid = uur->uur_reloid; + work_hdr.urec_prevxid = uur->uur_prevxid; + work_hdr.urec_xid = uur->uur_xid; + work_hdr.urec_cid = uur->uur_cid; + work_rd.urec_fork = uur->uur_fork; + work_blk.urec_blkprev = uur->uur_blkprev; + work_blk.urec_block = uur->uur_block; + work_blk.urec_offset = uur->uur_offset; + work_txn.urec_next = uur->uur_next; + work_txn.urec_xidepoch = uur->uur_xidepoch; + work_txn.urec_progress = uur->uur_progress; + work_txn.urec_dbid = uur->uur_dbid; + work_payload.urec_payload_len = uur->uur_payload.len; + work_payload.urec_tuple_len = uur->uur_tuple.len; + } + else + { + /* + * We should have been passed the same record descriptor as before, or + * caller has messed up. + */ + Assert(work_hdr.urec_type == uur->uur_type); + Assert(work_hdr.urec_info == uur->uur_info); + Assert(work_hdr.urec_prevlen == uur->uur_prevlen); + Assert(work_hdr.urec_reloid == uur->uur_reloid); + Assert(work_hdr.urec_prevxid == uur->uur_prevxid); + Assert(work_hdr.urec_xid == uur->uur_xid); + Assert(work_hdr.urec_cid == uur->uur_cid); + Assert(work_rd.urec_fork == uur->uur_fork); + Assert(work_blk.urec_blkprev == uur->uur_blkprev); + Assert(work_blk.urec_block == uur->uur_block); + Assert(work_blk.urec_offset == uur->uur_offset); + Assert(work_txn.urec_next == uur->uur_next); + Assert(work_txn.urec_progress == uur->uur_progress); + Assert(work_txn.urec_dbid == uur->uur_dbid); + Assert(work_payload.urec_payload_len == uur->uur_payload.len); + Assert(work_payload.urec_tuple_len == uur->uur_tuple.len); + } + + /* Write header (if not already done). */ + if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + + /* Write relation details (if needed and not already done). */ + if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 && + !InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + + /* Write block information (if needed and not already done). */ + if ((uur->uur_info & UREC_INFO_BLOCK) != 0 && + !InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + + /* Write transaction information (if needed and not already done). */ + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 && + !InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + + if (header_only) + return true; + + /* Write payload information (if needed and not already done). */ + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) + { + /* Payload header. */ + if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + + /* Payload bytes. */ + if (uur->uur_payload.len > 0 && + !InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + + /* Tuple bytes. */ + if (uur->uur_tuple.len > 0 && + !InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len, + &writeptr, endptr, + &my_bytes_written, already_written)) + return false; + } + + /* Hooray! */ + return true; +} + +/* + * Write undo bytes from a particular source, but only to the extent that + * they weren't written previously and will fit. + * + * 'sourceptr' points to the source data, and 'sourcelen' is the length of + * that data in bytes. + * + * 'writeptr' points to the insertion point for these bytes, and is updated + * for whatever we write. The insertion point must not pass 'endptr', which + * represents the end of the buffer into which we are writing. + * + * 'my_bytes_written' is a pointer to the count of previous-written bytes + * from this and following structures in this undo record; that is, any + * bytes that are part of previous structures in the record have already + * been subtracted out. We must update it for the bytes we write. + * + * 'total_bytes_written' points to the count of all previously-written bytes, + * and must likewise be updated for the bytes we write. + * + * The return value is false if we ran out of space before writing all + * the bytes, and otherwise true. + */ +static bool +InsertUndoBytes(char *sourceptr, int sourcelen, + char **writeptr, char *endptr, + int *my_bytes_written, int *total_bytes_written) +{ + int can_write; + int remaining; + + /* + * If we've previously written all of these bytes, there's nothing to do + * except update *my_bytes_written, which we must do to ensure that the + * next call to this function gets the right starting value. + */ + if (*my_bytes_written >= sourcelen) + { + *my_bytes_written -= sourcelen; + return true; + } + + /* Compute number of bytes we can write. */ + remaining = sourcelen - *my_bytes_written; + can_write = Min(remaining, endptr - *writeptr); + + /* Bail out if no bytes can be written. */ + if (can_write == 0) + return false; + + /* Copy the bytes we can write. */ + memcpy(*writeptr, sourceptr + *my_bytes_written, can_write); + + /* Update bookkeeeping infrormation. */ + *writeptr += can_write; + *total_bytes_written += can_write; + *my_bytes_written = 0; + + /* Return true only if we wrote the whole thing. */ + return (can_write == remaining); +} + +/* + * Call UnpackUndoRecord() one or more times to unpack an undo record. For + * the first call, starting_byte should be set to the beginning of the undo + * record within the specified page, and *already_decoded should be set to 0; + * the function will update it based on the number of bytes decoded. The + * return value is true if the entire record was unpacked and false if the + * record continues on the next page. In the latter case, the function + * should be called again with the next page, passing starting_byte as the + * sizeof(PageHeaderData). + */ +bool +UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte, + int *already_decoded, bool header_only) +{ + char *readptr = (char *) page + starting_byte; + char *endptr = (char *) page + BLCKSZ; + int my_bytes_decoded = *already_decoded; + bool is_undo_splited = (my_bytes_decoded > 0) ? true : false; + + /* Decode header (if not already done). */ + if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader, + &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + + uur->uur_type = work_hdr.urec_type; + uur->uur_info = work_hdr.urec_info; + uur->uur_prevlen = work_hdr.urec_prevlen; + uur->uur_reloid = work_hdr.urec_reloid; + uur->uur_prevxid = work_hdr.urec_prevxid; + uur->uur_xid = work_hdr.urec_xid; + uur->uur_cid = work_hdr.urec_cid; + + if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0) + { + /* Decode header (if not already done). */ + if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails, + &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + + uur->uur_fork = work_rd.urec_fork; + } + + if ((uur->uur_info & UREC_INFO_BLOCK) != 0) + { + if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock, + &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + + uur->uur_blkprev = work_blk.urec_blkprev; + uur->uur_block = work_blk.urec_block; + uur->uur_offset = work_blk.urec_offset; + } + + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) + { + if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction, + &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + + uur->uur_next = work_txn.urec_next; + uur->uur_xidepoch = work_txn.urec_xidepoch; + uur->uur_progress = work_txn.urec_progress; + uur->uur_dbid = work_txn.urec_dbid; + } + + if (header_only) + return true; + + /* Read payload information (if needed and not already done). */ + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0) + { + if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload, + &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + + uur->uur_payload.len = work_payload.urec_payload_len; + uur->uur_tuple.len = work_payload.urec_tuple_len; + + /* + * If we can read the complete record from a single page then just + * point payload data and tuple data into the page otherwise allocate + * the memory. + * + * XXX There is possibility of optimization that instead of always + * allocating the memory whenever tuple is split we can check if any + * of the payload or tuple data falling into the same page then don't + * allocate the memory for that. + */ + if (!is_undo_splited && + uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr)) + { + uur->uur_payload.data = readptr; + readptr += uur->uur_payload.len; + + uur->uur_tuple.data = readptr; + } + else + { + if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL) + uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len); + + if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL) + uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len); + + if (!ReadUndoBytes((char *) uur->uur_payload.data, + uur->uur_payload.len, &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + + if (!ReadUndoBytes((char *) uur->uur_tuple.data, + uur->uur_tuple.len, &readptr, endptr, + &my_bytes_decoded, already_decoded, false)) + return false; + } + } + + return true; +} + +/* + * Read undo bytes into a particular destination, + * + * 'destptr' points to the source data, and 'readlen' is the length of + * that data to be read in bytes. + * + * 'readptr' points to the read point for these bytes, and is updated + * for how much we read. The read point must not pass 'endptr', which + * represents the end of the buffer from which we are reading. + * + * 'my_bytes_read' is a pointer to the count of previous-read bytes + * from this and following structures in this undo record; that is, any + * bytes that are part of previous structures in the record have already + * been subtracted out. We must update it for the bytes we read. + * + * 'total_bytes_read' points to the count of all previously-read bytes, + * and must likewise be updated for the bytes we read. + * + * nocopy if this flag is set true then it will just skip the readlen + * size in undo but it will not copy into the buffer. + * + * The return value is false if we ran out of space before read all + * the bytes, and otherwise true. + */ +static bool +ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr, + int *my_bytes_read, int *total_bytes_read, bool nocopy) +{ + int can_read; + int remaining; + + if (*my_bytes_read >= readlen) + { + *my_bytes_read -= readlen; + return true; + } + + /* Compute number of bytes we can read. */ + remaining = readlen - *my_bytes_read; + can_read = Min(remaining, endptr - *readptr); + + /* Bail out if no bytes can be read. */ + if (can_read == 0) + return false; + + /* Copy the bytes we can read. */ + if (!nocopy) + memcpy(destptr + *my_bytes_read, *readptr, can_read); + + /* Update bookkeeping information. */ + *readptr += can_read; + *total_bytes_read += can_read; + *my_bytes_read = 0; + + /* Return true only if we wrote the whole thing. */ + return (can_read == remaining); +} + +/* + * Set uur_info for an UnpackedUndoRecord appropriately based on which + * other fields are set. + */ +void +UndoRecordSetInfo(UnpackedUndoRecord *uur) +{ + if (uur->uur_fork != MAIN_FORKNUM) + uur->uur_info |= UREC_INFO_RELATION_DETAILS; + if (uur->uur_block != InvalidBlockNumber) + uur->uur_info |= UREC_INFO_BLOCK; + if (uur->uur_next != InvalidUndoRecPtr) + uur->uur_info |= UREC_INFO_TRANSACTION; + if (uur->uur_payload.len || uur->uur_tuple.len) + uur->uur_info |= UREC_INFO_PAYLOAD; +} diff --git a/src/backend/access/zheap/Makefile b/src/backend/access/zheap/Makefile new file mode 100644 index 0000000000..c997807d74 --- /dev/null +++ b/src/backend/access/zheap/Makefile @@ -0,0 +1,19 @@ +#------------------------------------------------------------------------- +# +# Makefile-- +# Makefile for access/zheap +# +# IDENTIFICATION +# src/backend/access/zheap/Makefile +# +#------------------------------------------------------------------------- + +subdir = src/backend/access/zheap +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = prunetpd.o prunezheap.o rewritezheap.o tpd.o tpdxlog.o zheapam.o \ + zheapam_handler.o zheapamutils.o zheapamxlog.o zhio.o zmultilocker.o \ + ztqual.o zvacuumlazy.o ztuptoaster.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/zheap/README b/src/backend/access/zheap/README new file mode 100644 index 0000000000..0dce27092a --- /dev/null +++ b/src/backend/access/zheap/README @@ -0,0 +1,602 @@ +src/backend/access/zheap/README + +Zheap +===== + +The main purpose of this README is to provide an overview of the current +design of zheap, a new storage format for PostgreSQL. This project has three +major objectives: + +1. Provide better control over bloat. In the existing heap, we always create +a new version of tuple when it is updated. These new versions are later +removed by vacuum or hot-pruning, but this only frees up space for reuse by +future inserts or updates; nothing is returned to the operating system. A +similar problem occurs for tuples that are deleted. zheap will prevent bloat +(a) by allowing in-place updates in common cases and (b) by reusing space as +soon as a transaction that has performed a delete or non-in-place-update has +committed. In short, with this new storage, whenever possible, we’ll avoid +creating bloat in the first place. + +2. Reduce write amplification both by avoiding rewrites of heap pages and by +making it possible to do an update that touches indexed columns without +updating every index. + +3. Reduce the tuple size by (a) shrinking the tuple header and +(b) eliminating most alignment padding. + +In-place updates will be supported except when (a) the new tuple is larger +than the old tuple and the increase in size makes it impossible to fit the +larger tuple onto the same page or (b) some column is modified which is +covered by an index that has not been modified to support “delete-marking”. +We have not begun work on delete-marking support for indexes yet, but intend +to support it at least for btree indexes. + +General idea of zheap with undo +-------------------------------- +Each backend is attached to a separate undo log to which it writes undo +records. Each undo record is identified by a 64-bit undo record pointer of +which the first 24 bits are used for the log number and the remaining 40 bits +are used for an offset within that undo log. Only one transaction at a time +can write to any given undo log, so the undo records for any given transaction +are always consecutive. + +Each zheap page has fixed set of transaction slots each of which contains the +transaction information (transaction id and epoch) and the latest undo record +pointer for that transaction. As of now, we have four transaction slots per +page, but this can be changed. Currently, this is a compile-time option; we +can decide later whether such an option is desirable in general for users. +Each transaction slot occupies 16 bytes. We allow the transaction slots to be +reused after the transaction is committed which allows us to operate without +needing too many slots. We can allow slots to be reused after a transaction +abort as well, once undo actions are complete. We have observed that smaller +tables say having very few pages typically need more slots; for larger tables, +four slots are enough. In our internal testing, we have found that 16 slots +give a very good performance, but more tests are needed to identify the right +number of slots. The one known problem with the fixed number of slots is that +it can lead to deadlock, so we are planning to add a mechanism to allow the +array of transactions slots to be continued on a separate overflow page. We +also need such a mechanism to support cases where a large number of +transactions acquire SHARE or KEY SHARE locks on a single page. The overflow +pages will be stored in the zheap itself, interleaved with regular pages. +These overflow pages will be marked in such a way that sequential scans will +ignore them. We will have a meta page in zheap from which all overflow pages +will be tracked. + +Typically, each zheap operation that modifies a page needs to first allocate a +transaction slot on that page and then prepare an undo record for the operation. +Then, in a critical section, it must write the undo record, perform the +operation on heap page, update the transaction slot in a page, and finally +write a WAL record for the operation. What we write as part of undo record +and WAL depends on the operation. + +Insert: Apart from the generic info, we write the TID (block number and offset +number) of the tuple in undo record to identify the record during undo replay. +In WAL, we write the offset number and the tuple, plus some minimal +information which will be needed to regenerate the undo record during replay. + +Delete: We write the complete tuple in the undo record even though we could get +away with just writing the TID as we do for an insert operation. This allows +us to reuse the space occupied by the deleted record as soon as the transaction +that has performed the operation commits. In WAL, we need to write the tuple +only if full page writes are not enabled. If full page writes are enabled, we +can rely on the page state to be same during recovery as it is during the +actual operation, so we can retrieve the tuple from page to copy it into the +undo record. + +Update: For in-place updates, we have to write the old tuple in the undo log +and the new tuple in the zheap. We could optimize and write the diff tuple +instead of the complete tuple in undo, but as of now, we are writing the +complete tuple. For non-in-place updates, we write the old tuple and the new +TID in undo; essentially this is equivalent to DELETE+INSERT. As for DELETE, +this allows space to be recycled as soon as the updating transaction commits. +In the WAL, we write a copy of the old tuple only if full pages writes are off +and we write diff tuple for the new tuple (irrespective of the value of +full-page writes) as we do in a current heap. In the case where a +non-in-place-update happens to insert new tuple on a separate page, we write +two undo records, one for old page and another for the new page. One can +imagine that writing one undo record would be sufficient as we generally reach +to a new tuple from old tuple if required, but we want to maintain a separate +undo chain for each page. + +Select .. For [Key] Share/Update +Tuple locking will work much like a DML operation: reserve a transaction slot, +update the tuple header with the lock information, write UNDO and WAL for the +operation. To detect conflicts, we sometimes need to traverse the undo chains +of all the active transactions on a page. We will always mark the tuple with +the strongest lock mode that might be present, just as is done in the current +heap, so that we can cheaply detect whether there is a potential conflict. If +there is, we must get information about all the locks from undo in order to +decide whether there is an actual conflict. The tuple will always contain +either the strongest locker information or if all the lockers are of same +strength, then it will contain the latest locker information. Whenever there +is more than one locker operating on a tuple, we set the multi-locker bit on a +tuple to indicate that the tuple has multiple lockers. Note, that we clear the +multi-locker bit lazily (which means when we decide to wait for all the +lockers to go away and there is no more locker alive on the tuple). During +Rollback operation, we retain the strongest locker information on the tuple +if there are multiple lockers on a tuple. This is because the conflict +detection mechanism works based on strongest locker. Now, even if we want to +remove strongest locker information, we don't have second strongest locker +information handy. + +Copy: Similar to insert, we need to store the corresponding TID (block number, +offset number) for a tuple in undo to identify the same during undo replay. But, +we can minimize the number of undo records written for a page. First, we +identify the unused offset ranges for a page, then insert one undo record for +each offset range. For example, if we’re about to insert in offsets +(2,3,5,9,10,11), we insert three undo records covering offset ranges (2,3), +(5,5), and (9,11), respectively. For recovery, we insert a single WAL record +containing the above-mentioned offset ranges along with some minimal +information to regenerate the undo records and tuples. + +Scans: During scans, we need to make a copy of the tuple instead of just +holding the pin on a page. In the current heap, holding a pin on the buffer +containing the tuple is sufficient because operations like vacuum which can +rearrange the page always take a cleanup lock on a buffer. In zheap, however, +in-place-updates work with just a exclusive lock on a buffer, so a tuple to +which we hold a pointer might be updated under us. + +Insert .. On Conflict: The design is similar to current heap such that we use the +speculative token to detect conflicts. We store the speculative token in undo +instead of in the tuple header (CTID) simply because zheap’s tuple header +doesn’t have CTID. Additionally, we set a bit in tuple header to indicate +speculative insertion. ZheapTupleSatisfiesDirty routine checks this bit and +fetches a speculative token from undo. + +Toast Tables: Toast tables can use zheap, too. Since zheap uses shorter tuple +headers, this saves space. In the future, someone might want to support +in-place updates for toast table data instead of doing delete+insert as we do +today. + +SQL Operations: All SQL operations that either need to interact with a heap +(scans, ALTER TABLE, etc.) or require a HeapTuple (like joins, ORDER BY, +ANALYZE, COPY, etc.) need to be changed to interact with zheap pages or zheap +tuples. For now, we have taken the approach of writing converter functions +for tuples (i.e. zheap_to_heap, heap_to_zheap) to avoid changing the whole +backend to accept to Zheap pages and tuples. Operations which need to access +pages still need to be modified. For all performance-critical operations, we +operate directly on zheap pages and zheap tuples to avoid the cost of +conversion. We think that some of this needs to be changed in response to +whatever conclusions are reached regarding the proposed storage API. + +Transaction slot reuse +----------------------- +Transaction slots can be freely reused if the transaction is committed and +all-visible, or if the transaction is aborted and undo actions for that +transaction, at least relating to that page, have been performed. If the +transaction is committed but not yet all-visible, we can reuse the slot after +writing an additional, special undo record that lets us make subsequent tuple +visibility decisions correctly. + +For committed transactions, there are two possibilities. If the transaction +slot is not referenced by any tuple in the page, we simply clear the xid from +the transaction slot. The undo record pointer is kept as it is to ensure that +we don't break the undo chain for that slot. Otherwise, we write an undo +record for each tuple that points to one of the committed transactions. We +also mark the tuple indicating that the associated slot has been reused. In +such a case, it is quite possible that the tuple has not been modified, but it +is still pointing to transaction slot which has been reused for a new +transaction which is not yet all-visible. During the visibility check for +such a tuple, it might appear that the tuple is modified by a current +transaction which is clearly wrong and can lead to wrong results. + +Subtransactions +---------------- +zheap only uses the toplevel transaction ID; subtransactions that modify a +zheap do not need separate transaction IDs. In the regular heap, when +subtransactions are present, the subtransaction’s XID is used to make tuple +visibility decisions correctly. In a zheap, subtransaction abort is instead +handled by using undo to reverse changes to the zheap pages. This design +minimizes consumption of transaction slots and pg_xact space, and ensures that +all undo records for a toplevel transaction remain consecutive in the undo +log. + +Reclaiming space within a page +------------------------------- +Space can be reclaimed within a page after (a) a delete, (b) a non-in-place +update, or (c) an in-place update that reduces the width of the tuple. We can +reuse the space when as soon as the transaction that has performed the +operation has committed. We can also reclaim space after inserts or +non-in-place updates have been undone. There is some difference between the +way space is reclaimed for transactions that are committed and all-visible vs. +the transactions that are committed but still not all-visible. In the former +case, we can just indicate in the line pointer that the corresponding item is +dead whereas for later we need the capability to fetch the prior version of a +tuple for transactions to which the delete is not visible. To allow that, we +copy the transaction slot information into the line pointer so that we can +easily reach the prior version of the tuple. As a net result, the space for a +deleted tuple can be reclaimed immediately after the delete commits, but the +space consumed by line pointer can only be freed once we delete the +corresponding index tuples. For an aborted transaction, space can be +reclaimed once undo is complete. We set the prune xid in page header during +delete or update operations and during rollback of inserts to permit pruning +to happen only when there is a possible benefit. When we try to prune, we +first check if the prune xid is in progress; only if not will we attempt to +prune the page. + +Pruning will be attempted when update operation lands to a page where there is +not enough space to accommodate a new tuple. We can also allow pruning to +occur when we evict the page from shared buffers or read the page from disk as +those are I/O intensive operations, so doing some CPU intensive operation +doesn't cost much. + +With the above idea, it is quite possible that sometimes we try to prune the +page when there is no immediate benefit of doing so. For example, even after +pruning, the page might still not have enough space in the page to accommodate +new tuple. One idea is to track the space at the transaction slot level, so +that we can know exactly how much space can be freed in page after pruning, +but that will lead to increase in a space used by each transaction slot. + +We can also reuse space if a transaction frees up space on the page (e.g. by +delete) and then tries to use additional space (e.g. by a subsequent insert). +We can’t in general reuse space freed up by a transaction until it commits, +because if it aborts we’ll need that space during undo; but an insert or +update could reuse space freed up by earlier operations in the same +transaction, since all or none of them will roll back. This is a good +optimization, but this needs some more thought. + +Free Space Map +--------------- +We can optimistically update the freespace map when we remove the tuples from +a page in the hope that eventually most of the transactions will commit and +space will be available. Additionally, we might want to update FSM during +aborts when space-consuming actions like inserts are rolled back. When +requesting free space, we would need to adjust things so that we continue the +search from the previous block instead of repeatedly returning the same block. + +I think updating it on every such operation can be costly, so we can perform +it only after some threshold number, so later we might want to add a facility +to track potentially available freespace and merge into the main data +structure. We also want to make FSM crash-safe, since we can’t count on +VACUUM to recover free space that we neglect to record. + +Page format +------------ +zheap uses a standard page header, stores transaction slots in the special +space. + +Tuple format +------------- +The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment): +2 bytes each for informask and infomask2, and one byte for t_hoff. I think we +might be able to squeeze some space from t_infomask, but for now, I have kept +it as two bytes. All transactional information is stored in undo, so fields +that store such information are not needed here. + +The idea is that we occupy somewhat more space at the page level, but save +much more at tuple level, so we come out ahead overall. + +Alignment padding +------------------ +We omit all alignment padding for pass-by-value types. Even in the current heap, +we never point directly to such values, so the alignment padding doesn’t help +much; it lets us fetch the value using a single instruction, but that is all. +Pass-by-reference types will work as they do in the heap. Many pass-by-reference +data types will be varlena data types (typlen = -1) with short varlena headers so +no alignment padding will be introduced in that case anyway, but if we have varlenas +with 4-byte headers or if we have fixed-length pass-by-reference types (e.g. interval, +box) then we'll still end up with padding. We can't directly access unaligned values; +instead, we need to use memcpy. We believe that the space savings will more than pay +for the additional CPU costs. + +We don’t need alignment padding between the tuple header and the tuple data as +we always make a copy of the tuple to support in-place updates. Likewise, we ideally +don't need any alignment padding between tuples. However, there are places in zheap +code where we access tuple header directly from page (ex. zheap_delete, zheap_update, +etc.) for which we them to be aligned at two-byte boundary). + +Undo chain +----------- +Each undo record header contains the location of previous undo record pointer +of the transaction that is performing the operation. For example, if +transaction T1 has updated the tuple two times, the undo record for the last +update will have a link for undo record of the previous update. Thus, the +undo records for a particular page in a particular transaction form a single, +linked chain. + +Snapshots and visibility +------------------------- +Given a TID and a snapshot, there are three possibilities: (a) the tuple +currently stored at the given TID; (b) some tuple previously stored at the +given TID and subsequently written to the undo log might be visible; or +(c) there might be nothing visible at all. To check the visibility of a +tuple, we fetch the transaction slot number stored in the tuple header, and +then get the transaction id and undo record pointer from transaction slot. +Next, we check the current tuple’s visibility based on transaction id fetched +from transaction slot and the last operation performed on the tuple. For +example, if the last operation on tuple is a delete and the xid is visible to +our snapshot, then we return NULL indicating no visible tuple. But if the xid +that has last operated on tuple is not visible to the snapshot, then we use +the undo record pointer to fetch the prior tuple from undo and similarly check +its visibility. The only difference in checking the visibility for the undo +tuple is that the xid that previously operated on undo tuple is present in the +undo record, so we can use that instead of relying on the transaction slot. +If the tuple from undo is also not visible, then we fetch the prior tuple from +the undo chain. We need to traverse undo chains until we find a visible tuple +or reach theinitially inserted tuple; if that is also not visible, we can +return NULL. + +During visibility checking of a tuple in a zheap page or an undo chain, if we +find that the tuple’s transaction slot has been reused, we retrieve the +transaction information (xid and cid that has modified the tuple) of that +tuple from undo. + +EvalPlanQual mechanism +----------------------- +This works in basically the same way as for the existing heap. The only +special consideration is that the updated tuple could have the same TID as the +original one if it was updated in place, so we might want to optimize such +that we need not release the buffer lock and again refetch the tuple. +However, at this stage, we are not sure if there is any big advantage in such +an optimization. + +64-bit transaction ids +----------------------- +Transaction slots in zheap pages store both the epoch and the XID; this +eliminates the confusion between a use of a given XID in the current epoch and +a use in some previous epoch, which means that we never need to freeze tuples. +The difference between the oldest running XID and the newest XID is still +limited to 2 billion because of the way that snapshots work. Moreover, the +oldest XID that still has undo must have an XID age less than 2 billion: among +other problems, this is currently the limit for how long commit status data +can be retained, and it would be bad if we had undo data but didn’t know +whether or not to apply the undo actions. Currently, this limitation is +enforced by piggybacking on the existing wraparound machinery. + +Indexing +--------- +Current index AMs are not prepared to cope with multiple tuples at the same +TID with different values stored in the index column. We plan to introduce +special index AM support for in-place updates; when an index lacks such +support, any modification to the value stored in a column covered by that +index will prevent the use of in-place update. Additionally, indexes lacking +such support will still require routine vacuuming, which we believe can be +avoided when such support is present. + +The basic idea is that we need to delete-mark index entries when they might no +longer be valid, either because of a delete or because of an update affecting +the indexed column. An in-place update that does not modify the indexed +column need not delete-mark the corresponding index entries. Note that an +entry which is delete-marked might still be valid for some snapshots; once no +relevant snapshots remain, we can remove the entry. In some cases, we may +remove a delete-mark from an entry rather than removing the entry, either +because the transaction which applied the delete-mark has rolled back, or +because the indexed column was changed from value A to value B and then +eventually back to value A. +It is very desirable for performance reasons to have be able to distinguish +from the index page whether or not the corresponding heap tuple is definitely +all-visible, but the delete-marking approach is not quite sufficient for this +purpose unless recently-inserted tuples are also delete-marked -- and that is +undesirable, since the delete-markings would have to be cleared after the +inserting transaction committed, which might end up dirtying many or all +index pages. An alternative approach is to write undo for index insertions; +then, the undo pointers in the index page tells us whether any index entries +on that page may be recently-inserted, and the presence or absence of a +delete-mark tells us whether any index entries on that page may no longer be +valid. We intend to adopt this approach; it should allow index-only scans in +most cases without the need for a separately-maintained visibility map. + +With this approach, an in-place update touches each index whose indexed +columns are modified twice -- once to delete-mark the old entry (or entries) +and once to insert the new entries. In some use cases, this will compare +favorably with the existing approach, which touches every index exactly once. +Specifically, it figures to reduce write amplification and index bloat when +only one or a few indexed columns are updated at a time. + +Indexes that don't have delete-marking +--------------------------------------- +Although indexes which lack delete-marking support still require vacuum, we +can use undo to reduce the current three-pass approach to just two passes, +avoiding the final heap scan. When a row is deleted, the vacuum will directly +mark the line pointer as unused, writing an undo record as it does, and then +mark the corresponding index entries as dead. If vacuum fails midway through +the undo can ensure that changes to the heap page are rolled back. If the +vacuum goes on to commit, we don't need to revisit the heap page after index +cleanup. + +We must be careful about TID reuse: we will only allow a TID to be reused +when the transaction that has marked it as unused has committed. At that +point, we can be assured that all the index entries corresponding to dead +tuples will be marked as dead. + +Undo actions +------------- +We need to apply undo actions during explicit ROLLBACK or ROLLBACK TO +SAVEPOINT operations and when an error causes a transaction or subtransaction +abort. These actions reverse whatever work was done when the operation was +performed; for example, if an update aborts, we must restore the old version +of the tuple. During an explicit ROLLBACK or ROLLBACK TO SAVEPOINT, the +transaction is in a good state and we have relevant locks on objects, so +applying undo actions is straightforward, but the same is not true in error +paths. In the case of a subtransaction abort, undo actions are performed +after rolling back the subtransaction; the parent transaction is still good. +In the case of a top-level abort, we begin an entirely new transaction to +perform the undo actions. If this new transaction aborts, it can be retried +later. For short transactions (say, one which generates only few kB of undo +data), it is okay to apply the actions in the foreground but for longer +transactions, it is advisable to delegate the work to an undo worker running +in the background. The user is provided with a knob to control this behavior. + +Just like the DML operations to which they correspond, undo actions require us +to write WAL. Otherwise, we would be unable to recover after a crash, and +standby servers would not be properly updated. + +Applying undo actions +---------------------- +In many cases, the same page will be modified multiple times by the same +transaction. We can save locking and reduce WAL generation by collecting all +of the undo records for a given page and then applying them all at once. +However, it’s difficult to collect all of the records that might apply to a +page from an arbitrarily large undo log in an efficient manner; in particular, +we want to avoid rereading the same undo pages multiple times. Currently, we +collect all consecutive records which apply to the same page and then apply +them at one shot. This will cover the cases where most of the changes to heap +pages are performed together. This algorithm could be improved. For example, +we could do something like this: + +1. Read the last 32MB of undo for the transaction being undone (or all of the +undo for the transaction, if there is less than 32MB). +2. For each block that is touched by at least one record in the 32MB chunk, +consolidate all records from this chunk that apply to that block. +3. Sort the blocks by buffertag and apply the changes in ascending +block-number order within each relation. Do this even for incomplete chains, +so nothing is saved for later. +4. Go to step 1. + +After applying undo actions for a page, we clear the transaction slot on a +page if the oldest undo record we applied is the oldest undo record for that +block generated by that transaction. Otherwise, we rewind the undo pointer in +the page slot to the last record for that block that precedes the last undo +record we applied. Because applying undo also always updates the transaction +slot on the page, either rewinding it or clearlying it completely, we can +always skip applying undo if we find that it’s already been applied +previously. This could happen if the application of undo for a given +transaction is interrupted a crash, or if it fails for some reason and is +retried later. + +This also prevents us from getting confused when the relation is (a) dropped, +(b) rewritten using a new relfilenode, or (c) truncated to a shorter length +(and perhaps subsequently re-extended). We apply the undo action only if the +page contains the effect of the transaction for which we are applying undo +actions, which can always be determined by examining the undo pointer in the +transaction slot. If there is no transaction slot for the current transaction +or if it is present but the undo record pointer in the slot is less than the +undo record pointer of the undo record under consideration, the undo record +can be ignored; it has already been applied or is no longer relevant. After a +toplevel transaction abort, undo space is not recycled. However, after a +subtransaction abort, we rewind the insert pointer to wherever it was at the +start of the subtransaction, so that the undo for the toplevel transaction +remains contiguous. We can’t do the same for toplevel aborts as that might +contain special undo records related to transaction slots that were reused and +we can’t afford to lose those. We write these special undo records only for +toplevel transaction when it doesn’t find any free transaction slot or there +is no transaction slot which contains transaction that is all-visible. In +such cases, we reuse the committed transaction slots and write undo record +which contains transaction information for them as we might need that +information for transaction which still can’t see the committed transaction. +We mark all such slots (that belongs to committed transactions) as available +for reuse in one shot as doing it one slot at a time is quite costly. Since +we might still need the special undo records for the transaction slots other +than the current transaction, we can’t simply rewind the insert pointer. Note +that we do this only for toplevel transactions; if we need the new slot when +in a subtransaction, we reclaim only a single transaction slot. + +WAL consideration +------------------ +Undo records are critical data and must be protected via WAL. Because an undo +record must be written if and only if a page modification occurs, the undo +record and the record for the page modification must be one and the same. +Moreover, it is very important not to duplicate any information or store any +unnecessary information, since WAL volume has a significant impact on overall +system performance. In particular, there is no need to log the undo record +pointer. We only need to ensure that after crash recovery undo record pointer +is set correctly for each of the undo logs. To ensure that, we log a WAL +record after XID change or at the first operation after checkpoint on undo +log. The WAL record contains the information of insert point, log number, and +Xid. This is enough to form an XID->(Log no. + Log insertion point) map which +will be used to calculate the location of undo insertion during recovery. + +Another important consideration is that we don't need to have full page images +for data in undo logs. Because the undo logs are always written serially, torn +pages are not an issue. Suppose that some block in one of the undo log is +half filled and synced properly to disk; now, a checkpoint occurs Next, we +add some more data to the block. During the following checkpoint, the system +crashes while flushing the block. The block could be in a condition such that +first few bytes of it say 512 bytes are flushed appropriately and rest are +old, but this won't cause problem because anyway old bytes will be intact and +we can always start inserting new records at insert location in undo +reconstructed during recovery. + +Undo Worker +------------ +Currently, we have one background undo worker which performs undo actions as +required and discards undo logs when they are no longer needed. Typically, it +performs undo actions in response to a notification from a backend that has +just aborted a transaction, but it will eventually detect and perform undo +actions for any aborted transaction that does not otherwise get cleaned up. + +We allow the undo worker to hibernate when there is no activity in the system. +It hibernates for a minimum of 100ms and maximum of 10s, based on the time +the system has remained idle. The undo worker mechanism will be extended to +multiple undo workers to perform various jobs related to undo logs. For +example, if there are many pending rollback requests, then we can spawn a new +undo worker which can help in processing the requests. + +UndoDiscard routine will be called by the undo worker for discarding the old +undo records. UndoDiscard will process all the active undo logs. It reads +each undo log and checks whether the log corresponding to the first +transaction in a log can be discarded (committed and all visible or aborted +and undo already applied). If so, it moves to the next transaction in that +undo log and continues in the same way. When it finds the first transaction +whose undo can't be discard yet, it first discards the undo log prior to that +point and then remembers the transaction ID and undo location in shared memory. +We consider undo for a transaction to be discardable once its XID is smaller +than oldestXmin. + +Ideally, for the aborted transactions once the undo actions are replayed, we +should be able to discard it’s undo, however, it might contain the undo records +for reused transaction slots, so we can’t discard them until it becomes +smaller than oldestXmin. Also, we can’t discard the undo for the aborted +transaction if there is a preceding transaction which is committed and not +all-visible. We can allow undo for aborted transactions to be discarded +immediately if we remember in the first undo record of the transaction whether +it contains undo of reused transaction slot. This will help the cases where +the aborted transaction is the last transaction in undo log which is smaller +than oldestXmin. + +In Hot Standby mode, undo is discarded via WAL replay. Before discarding +undo, we ensure that there are no queries running which need to get tuple from +discarded undo. If there are any, a recovery conflict will occur, similar to +what happens in other cases where a resource held by a particular backend +prevents replay from advancing. + +For each undo log, the undo discard module maintains in memory array to hold +the latest undiscarded xid and its start undo record pointer. The first XID +in the undo log will be compared against GlobalXmin, if the xid is greater +than GlobalXmin then nothing can be discarded; otherwise, scan the undo log +starting with the oldest transaction it contains. To avoid processing every +record in the undo log, we maintain a transaction start header in the first +undo record written by any given transaction with space to store a pointer to +the next transaction start undo record in that same undo log. This allows us +to read an undo log transaction by transaction. When discarding undo, the +background worker will read all active undo logs transaction by transaction +until it finds a transaction with an XID greater than equal to the GlobalXmin. +Once it finds such a transaction, it will discard all earlier undo records in +that undo log, without even writing unflushed buffers to disk. + +Avoid fetching discarded undo record +------------------------------------- +The system must never attempt to fetch undo records which have already been +discarded. Undo is generally discarded in the background by the undo worker, +so we must account for the possibility that undo could be discarded at any +time. We do maintain the oldest xid that have undo (oldestXidHavingUndo). +Undo worker updates the value of oldestXidHavingUndo after discarding all the +undo. Backends consider all transactions that precede oldestXidHavingUndo as +all-visible, so they normally don’t try to fetch the undo which is already +discarded. However, there is a race condition where backend decides that the +transaction is greater than oldestXidHavingUndo and it needs to fetch the undo +record and in the meantime undo worker discards the corresponding undo record. +To handle such race conditions, we need to maintain some synchronization +between backends and undo worker so that backends don’t try to access already +discarded undo. So whenever undo fetch is trying to read a undo record from +an undo log, first it needs to acquire a log->discard_lock in SHARED mode for +the undo log and check that the undo record pointer is not less than +log->oldest_data, if so, then don't fetch that undo record and return +NULL (that means the previous version is all visible). And undo worker will +take log->discard_lock in EXCLUSIVE mode for updating the +log->oldest_data. We hold this lock just to update the value in shared +memory, the actual discard happens outside this lock. + +Undo Log Storage +----------------- +This subsystem is responsible for lifecycle management of undo logs and +backing files, associating undo logs with backends, allocating and managing +space within undo logs. It provides access to undo log contents via shared +buffers. The list of available undo logs is maintained in shared memory. +Whenever a backend request for undo log allocation, it attaches a first free +undo log to a backend, and if all existing undo logs are busy, it will create +a new one. A set of APIs is provided by this subsystem to efficiently allocate +and discard undo logs. + +During a checkpoint, all the undo segment files and undo metadata files will +be flushed to the disk. diff --git a/src/backend/access/zheap/prunetpd.c b/src/backend/access/zheap/prunetpd.c new file mode 100644 index 0000000000..ce7c8576a5 --- /dev/null +++ b/src/backend/access/zheap/prunetpd.c @@ -0,0 +1,507 @@ +/*------------------------------------------------------------------------- + * + * prunetpd.c + * TPD page pruning + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/zheap/prunetpd.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/tpd_xlog.h" +#include "miscadmin.h" +#include "storage/bufpage.h" +#include "storage/freespace.h" +#include "storage/proc.h" + +typedef struct TPDPruneState +{ + int nunused; + OffsetNumber nowunused[MaxTPDTuplesPerPage]; +} TPDPruneState; + +static void TPDEntryPrune(Buffer buf, OffsetNumber offnum, TPDPruneState *prstate, + Size *space_freed); +static XLogRecPtr LogTPDClean(Relation rel, Buffer tpdbuf, + OffsetNumber *nowunused, int nunused, + OffsetNumber target_offnum, Size space_required); +static int TPDPruneEntirePage(Relation rel, Buffer tpdbuf); + +/* + * TPDPagePrune - Prune the TPD page. + * + * Process all the TPD entries in the page and remove the old entries which + * are all-visible. We first collect all such entries and then process them + * in one-shot. + * + * We expect caller must have an exclusive lock on the page. + * + * Returns the number of entries pruned. + */ +int +TPDPagePrune(Relation rel, Buffer tpdbuf, BufferAccessStrategy strategy, + OffsetNumber target_offnum, Size space_required, bool can_free, + bool *update_tpd_inplace, bool *tpd_e_pruned) +{ + Page tpdpage, tmppage = NULL; + TPDPageOpaque tpdopaque; + TPDPruneState prstate; + OffsetNumber offnum, maxoff; + ItemId itemId; + uint64 epoch_xid; + uint64 epoch; + Size space_freed; + + prstate.nunused = 0; + tpdpage = BufferGetPage(tpdbuf); + + /* Initialise the out variables. */ + if (update_tpd_inplace) + *update_tpd_inplace = false; + if (tpd_e_pruned) + *tpd_e_pruned = false; + + /* Can we prune the entire page? */ + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + epoch = tpdopaque->tpd_latest_xid_epoch; + epoch_xid = MakeEpochXid(epoch, tpdopaque->tpd_latest_xid); + if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + { + prstate.nunused = TPDPruneEntirePage(rel, tpdbuf); + goto free_tpd_page; + } + + /* initialize the space_free with already existing free space in page */ + space_freed = PageGetExactFreeSpace(tpdpage); + + /* Scan the page */ + maxoff = PageGetMaxOffsetNumber(tpdpage); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + itemId = PageGetItemId(tpdpage, offnum); + + /* Nothing to do if slot is empty. */ + if (!ItemIdIsUsed(itemId)) + continue; + + TPDEntryPrune(tpdbuf, offnum, &prstate, &space_freed); + } + + /* + * There is not much advantage in continuing, if we can't free the space + * required by the caller or we are not asked to forcefully prune the + * page. + * + * XXX - In theory, we can still continue and perform pruning in the hope + * that some future update in this page will be able to use that space. + * However, it will lead to additional writes without any guaranteed + * benefit, so we skip the pruning for now. + */ + if (space_freed < space_required) + return 0; + + /* + * We prepare the temporary copy of the page so that during page + * repair fragmentation we can use it to copy the actual tuples. + */ + if (prstate.nunused > 0 || OffsetNumberIsValid(target_offnum)) + tmppage = PageGetTempPageCopy(tpdpage); + + /* Any error while applying the changes is critical */ + START_CRIT_SECTION(); + + /* + * Have we found any prunable items or caller has asked us to make space + * next to target_offnum? + */ + if (prstate.nunused > 0 || OffsetNumberIsValid(target_offnum)) + { + /* + * Apply the planned item changes, then repair page fragmentation, and + * update the page's hint bit about whether it has free line pointers. + */ + TPDPagePruneExecute(tpdbuf, prstate.nowunused, prstate.nunused); + + /* + * Finally, repair any fragmentation, and update the page's hint bit about + * whether it has free pointers. It is quite possible that there are no + * prunable items on the page in which case it will rearrange the page to + * make the space at the required offset. + */ + TPDPageRepairFragmentation(tpdpage, tmppage, target_offnum, + space_required); + + MarkBufferDirty(tpdbuf); + + /* + * Emit a WAL TPD_CLEAN record showing what we did. + * + * XXX Unlike heap pruning, we don't need to remember latestRemovedXid + * for the purpose of generating conflicts on standby. We use + * oldestXidHavingUndo as the horizon to prune the TPD entries which + * means all the prior undo must have discarded and during undo discard + * we already generate such xid (see undolog_xlog_discard) which should + * serve our purpose as this WAL must reach after that. + */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + recptr = LogTPDClean(rel, tpdbuf, prstate.nowunused, + prstate.nunused, target_offnum, + space_required); + + PageSetLSN(tpdpage, recptr); + } + + if (update_tpd_inplace) + *update_tpd_inplace = true; + } + + END_CRIT_SECTION(); + + /* be tidy. */ + if (tmppage) + pfree(tmppage); + +free_tpd_page: + if (can_free && PageIsEmpty(tpdpage)) + { + Size freespace; + + /* + * If the page is empty, we have certainly pruned all the tpd + * entries. + */ + if (tpd_e_pruned) + *tpd_e_pruned = true; + /* + * We can reuse empty page as either a heap page or a TPD + * page, so no need to consider opaque space. + */ + freespace = BLCKSZ - SizeOfPageHeaderData; + + /* + * TPD page is empty, remove it from TPD used page list and + * record it in FSM. + */ + if (TPDFreePage(rel, tpdbuf, strategy)) + RecordPageWithFreeSpace(rel, BufferGetBlockNumber(tpdbuf), + freespace); + } + + return prstate.nunused; +} + +/* + * TPDEntryPrune - Check whether the TPD entry is prunable. + * + * Process all the transaction slots of a TPD entry present at a given offset. + * TPD entry will be considered prunable, if all the transaction slots either + * contains transaction that is older than oldestXidHavingUndo or + * doesn't have a valid transaction. + */ +static void +TPDEntryPrune(Buffer tpdbuf, OffsetNumber offnum, TPDPruneState *prstate, + Size *space_freed) +{ + Page tpdpage; + TPDEntryHeaderData tpd_e_hdr; + TransInfo *trans_slots; + ItemId itemId; + Size size_tpd_e_slots, size_tpd_e_map; + Size size_tpd_entry; + int num_trans_slots, slot_no; + int loc_trans_slots; + uint16 tpd_e_offset; + bool prune_entry = true; + + tpdpage = BufferGetPage(tpdbuf); + itemId = PageGetItemId(tpdpage, offnum); + tpd_e_offset = ItemIdGetOffset(itemId); + size_tpd_entry = ItemIdGetLength(itemId); + + memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader); + + /* + * We can prune the deleted entries as no one will be referring to such + * entries. + */ + if (TPDEntryIsDeleted(tpd_e_hdr)) + goto prune_tpd_entry; + + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint8); + else + { + Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE); + size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint32); + } + + num_trans_slots = tpd_e_hdr.tpe_num_slots; + size_tpd_e_slots = num_trans_slots * sizeof(TransInfo); + loc_trans_slots = tpd_e_offset + SizeofTPDEntryHeader + size_tpd_e_map; + + trans_slots = (TransInfo *) palloc(size_tpd_e_slots); + memcpy((char *) trans_slots, tpdpage + loc_trans_slots, size_tpd_e_slots); + + for (slot_no = 0; slot_no < num_trans_slots; slot_no++) + { + uint64 epoch_xid; + TransactionId xid; + uint64 epoch; + UndoRecPtr urec_ptr = trans_slots[slot_no].urec_ptr; + + epoch = trans_slots[slot_no].xid_epoch; + xid = trans_slots[slot_no].xid; + epoch_xid = MakeEpochXid(epoch, xid); + /* + * Check whether transaction slot can be considered frozen? + * If both transaction id and undo record pointer are invalid or + * xid is invalid and its undo has been discarded or xid is older than + * the oldest xid with undo. + */ + if ((!TransactionIdIsValid(xid) && + (!UndoRecPtrIsValid(urec_ptr) || UndoLogIsDiscarded(urec_ptr))) || + (TransactionIdIsValid(xid) && + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))) + continue; + else + { + prune_entry = false; + break; + } + } + + pfree(trans_slots); + +prune_tpd_entry: + if (prune_entry) + { + Assert (prstate->nunused < MaxTPDTuplesPerPage); + prstate->nowunused[prstate->nunused] = offnum; + prstate->nunused++; + + *space_freed += size_tpd_entry; + } +} + +/* + * TPDPagePruneExecute - Guts of the TPD page pruning. + * + * Here, we mark all the entries that can be pruned as unused and then call page + * repair fragmentation to compact the page. + */ +void +TPDPagePruneExecute(Buffer tpdbuf, OffsetNumber *nowunused, int nunused) +{ + Page tpdpage; + OffsetNumber *offnum; + int i; + + tpdpage = BufferGetPage(tpdbuf); + + /* Update all now-unused line pointers */ + offnum = nowunused; + for (i = 0; i < nunused; i++) + { + OffsetNumber off = *offnum++; + ItemId lp = PageGetItemId(tpdpage, off); + + ItemIdSetUnused(lp); + } +} + +/* + * TPDPageRepairFragmentation - Frees fragmented space on a tpd page. + * + * It doesn't remove unused line pointers because some heappage might + * still point to the line pointer. If we remove the line pointer, then + * the same space could be occupied by actual TPD entry in which case somebody + * trying to access that line pointer will get unpredictable behavior. + */ +void +TPDPageRepairFragmentation(Page page, Page tmppage, OffsetNumber target_offnum, + Size space_required) +{ + Offset pd_lower = ((PageHeader) page)->pd_lower; + Offset pd_upper = ((PageHeader) page)->pd_upper; + Offset pd_special = ((PageHeader) page)->pd_special; + itemIdSortData itemidbase[MaxTPDTuplesPerPage]; + itemIdSort itemidptr; + ItemId lp; + int nline, + nstorage, + nunused; + int i; + Size totallen; + + /* + * It's worth the trouble to be more paranoid here than in most places, + * because we are about to reshuffle data in (what is usually) a shared + * disk buffer. If we aren't careful then corrupted pointers, lengths, + * etc could cause us to clobber adjacent disk buffers, spreading the data + * loss further. So, check everything. + */ + if (pd_lower < SizeOfPageHeaderData || + pd_lower > pd_upper || + pd_upper > pd_special || + pd_special > BLCKSZ) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u", + pd_lower, pd_upper, pd_special))); + + /* + * Run through the line pointer array and collect data about live items. + */ + nline = PageGetMaxOffsetNumber(page); + itemidptr = itemidbase; + nunused = totallen = 0; + for (i = FirstOffsetNumber; i <= nline; i++) + { + lp = PageGetItemId(page, i); + if (ItemIdIsUsed(lp)) + { + if (ItemIdHasStorage(lp)) + { + itemidptr->offsetindex = i - 1; + itemidptr->itemoff = ItemIdGetOffset(lp); + if (unlikely(itemidptr->itemoff < (int) pd_upper || + itemidptr->itemoff >= (int) pd_special)) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted item pointer: %u", + itemidptr->itemoff))); + if (i == target_offnum) + itemidptr->alignedlen = ItemIdGetLength(lp) + + space_required; + else + itemidptr->alignedlen = ItemIdGetLength(lp); + totallen += itemidptr->alignedlen; + itemidptr++; + } + } + else + { + /* Unused entries should have lp_len = 0, but make sure */ + ItemIdSetUnused(lp); + nunused++; + } + } + + nstorage = itemidptr - itemidbase; + if (nstorage == 0) + { + /* Page is completely empty, so just reset it quickly */ + ((PageHeader) page)->pd_lower = SizeOfPageHeaderData; + ((PageHeader) page)->pd_upper = pd_special; + } + else + { + /* Need to compact the page the hard way */ + if (totallen > (Size) (pd_special - pd_lower)) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted item lengths: total %u, available space %u", + (unsigned int) totallen, pd_special - pd_lower))); + + compactify_ztuples(itemidbase, nstorage, page, tmppage); + } + + /* Set hint bit for TPDPageAddEntry */ + if (nunused > 0) + PageSetHasFreeLinePointers(page); + else + PageClearHasFreeLinePointers(page); +} + +/* + * LogTPDClean - Write WAL for TPD entries that can be pruned. + */ +XLogRecPtr +LogTPDClean(Relation rel, Buffer tpdbuf, + OffsetNumber *nowunused, int nunused, + OffsetNumber target_offnum, Size space_required) +{ + XLogRecPtr recptr; + xl_tpd_clean xl_rec; + + /* Caller should not call me on a non-WAL-logged relation */ + Assert(RelationNeedsWAL(rel)); + + xl_rec.flags = 0; + XLogBeginInsert(); + + if (target_offnum != InvalidOffsetNumber) + xl_rec.flags |= XL_TPD_CONTAINS_OFFSET; + XLogRegisterData((char *) &xl_rec, SizeOfTPDClean); + + /* Register the offset information. */ + if (target_offnum != InvalidOffsetNumber) + { + XLogRegisterData((char *) &target_offnum, sizeof(OffsetNumber)); + XLogRegisterData((char *) &space_required, sizeof(space_required)); + } + + XLogRegisterBuffer(0, tpdbuf, REGBUF_STANDARD); + + /* + * The OffsetNumber array is not actually in the buffer, but we pretend + * it is. When XLogInsert stores the whole buffer, the offset array need + * not be stored too. Note that even if the array is empty, we want to + * expose the buffer as a candidate for whole-page storage, since this + * record type implies a defragmentation operation even if no item pointers + * changed state. + */ + if (nunused > 0) + XLogRegisterBufData(0, (char *) nowunused, + nunused * sizeof(OffsetNumber)); + + recptr = XLogInsert(RM_TPD_ID, XLOG_TPD_CLEAN); + + return recptr; +} + +/* + * TPDPruneEntirePage + */ +static int +TPDPruneEntirePage(Relation rel, Buffer tpdbuf) +{ + Page page = BufferGetPage(tpdbuf); + int entries_removed = PageGetMaxOffsetNumber(page); + + START_CRIT_SECTION(); + + /* Page is completely empty, so just reset it quickly */ + ((PageHeader) page)->pd_lower = SizeOfPageHeaderData; + ((PageHeader) page)->pd_upper = ((PageHeader) page)->pd_special; + + MarkBufferDirty(tpdbuf); + + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + XLogBeginInsert(); + + XLogRegisterBuffer(0, tpdbuf, REGBUF_STANDARD); + + recptr = XLogInsert(RM_TPD_ID, XLOG_TPD_CLEAN_ALL_ENTRIES); + + PageSetLSN(BufferGetPage(tpdbuf), recptr); + } + + END_CRIT_SECTION(); + + return entries_removed; +} diff --git a/src/backend/access/zheap/prunezheap.c b/src/backend/access/zheap/prunezheap.c new file mode 100644 index 0000000000..cff20551e8 --- /dev/null +++ b/src/backend/access/zheap/prunezheap.c @@ -0,0 +1,943 @@ +/*------------------------------------------------------------------------- + * + * prunezheap.c + * zheap page pruning + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/heap/prunezheap.c + * + * In Zheap, we can reclaim space on following operations + * a. non-inplace updates, when committed or rolled back. + * b. inplace updates that reduces the tuple length, when commited. + * c. deletes, when committed. + * d. inserts, when rolled back. + * + * Since we only store xid which changed the page in pd_prune_xid, to prune + * the page, we can check if pd_prune_xid is in progress. This can sometimes + * lead to unwanted page pruning calls as a side effect, example in case of + * rolled back deletes. If there is nothing to prune, then the call to prune + * is cheap, so we don't want to optimize it at this stage. + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/zheap.h" +#include "access/zheapam_xlog.h" +#include "access/zheaputils.h" +#include "utils/ztqual.h" +#include "catalog/catalog.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "storage/bufmgr.h" +#include "storage/procarray.h" + +/* Working data for zheap_page_prune and subroutines */ +typedef struct +{ + TransactionId new_prune_xid; /* new prune hint value for page */ + TransactionId latestRemovedXid; /* latest xid to be removed by this prune */ + int ndeleted; /* numbers of entries in arrays below */ + int ndead; + int nunused; + /* arrays that accumulate indexes of items to be changed */ + + /* + * Fixme - arrays must use MaxZHeapTuplesPerPage, once we have constant + * value for the same. + */ + OffsetNumber nowdeleted[MaxZHeapTuplesPerPage]; + OffsetNumber nowdead[MaxZHeapTuplesPerPage]; + OffsetNumber nowunused[MaxZHeapTuplesPerPage]; + /* marked[i] is TRUE if item i is entered in one of the above arrays */ + bool marked[MaxZHeapTuplesPerPage + 1]; +} ZPruneState; + +static int zheap_prune_item(Relation relation, Buffer buffer, + OffsetNumber rootoffnum, TransactionId OldestXmin, + ZPruneState *prstate, int *space_freed); +static void zheap_prune_record_prunable(ZPruneState * prstate, + TransactionId xid); +static void zheap_prune_record_dead(ZPruneState * prstate, OffsetNumber offnum); +static void zheap_prune_record_deleted(ZPruneState * prstate, + OffsetNumber offnum); + +/* + * Optionally prune and repair fragmentation in the specified page. + * + * Caller must have exclusive lock on the page. + * + * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD + * or RECENTLY_DEAD (see ZHeapTupleSatisfiesOldestXmin). + * + * This is an opportunistic function. It will perform housekeeping only if + * the page has effect of transaction thas has modified data which can be + * pruned. + * + * Note: This is called only when we need some space in page to perform the + * action which otherwise would need a different page. It is called when an + * update statement has to update the existing tuple such that new tuple is + * bigger than old tuple and the same can't fit on page. + * + * Returns true, if we are able to free up the space such that the new tuple + * can fit into same page, otherwise, false. + */ +bool +zheap_page_prune_opt(Relation relation, Buffer buffer, + OffsetNumber offnum, Size space_required) +{ + Page page; + TransactionId OldestXmin; + TransactionId ignore = InvalidTransactionId; + Size pagefree; + bool force_prune = false; + bool pruned; + + page = BufferGetPage(buffer); + + /* + * We can't write WAL in recovery mode, so there's no point trying to + * clean the page. The master will likely issue a cleaning WAL record soon + * anyway, so this is no particular loss. + */ + if (RecoveryInProgress()) + return false; + + /* + * Use the appropriate xmin horizon for this relation. If it's a proper + * catalog relation or a user defined, additional, catalog relation, we + * need to use the horizon that includes slots, otherwise the data-only + * horizon can be used. Note that the toast relation of user defined + * relations are *not* considered catalog relations. + * + * It is OK to apply the old snapshot limit before acquiring the cleanup + * lock because the worst that can happen is that we are not quite as + * aggressive about the cleanup (by however many transaction IDs are + * consumed between this point and acquiring the lock). This allows us to + * save significant overhead in the case where the page is found not to be + * prunable. + */ + if (IsCatalogRelation(relation) || + RelationIsAccessibleInLogicalDecoding(relation)) + OldestXmin = RecentGlobalXmin; + else + OldestXmin = RecentGlobalDataXmin; + + Assert(TransactionIdIsValid(OldestXmin)); + + if (OffsetNumberIsValid(offnum)) + { + pagefree = PageGetExactFreeSpace(page); + + /* + * We want to forcefully prune the page if we are sure that the + * required space is available. This will help in rearranging the + * page such that we will be able to make space adjacent to required + * offset number. + */ + if (space_required < pagefree) + force_prune = true; + } + + + /* + * Let's see if we really need pruning. + * + * Forget it if page is not hinted to contain something prunable that's + * committed and we don't want to forcefully prune the page. + */ + if (!ZPageIsPrunable(page) && !force_prune) + return false; + + zheap_page_prune_guts(relation, buffer, OldestXmin, offnum, + space_required, true, force_prune, + &ignore, &pruned); + if (pruned) + return true; + + return false; +} + +/* + * Prune and repair fragmentation in the specified page. + * + * Caller must have pin and buffer cleanup lock on the page. + * + * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD + * or RECENTLY_DEAD (see ZHeapTupleSatisfiesVacuum). + * + * To perform pruning, we make the copy of the page. We don't scribble on + * that copy, rather it is only used during repair fragmentation to copy + * the tuples. So, we need to ensure that after making the copy, we operate + * on tuples, otherwise, the temporary copy will become useless. It is okay + * scribble on itemid's or special space of page. + * + * If report_stats is true then we send the number of reclaimed tuples to + * pgstats. (This must be false during vacuum, since vacuum will send its own + * own new total to pgstats, and we don't want this delta applied on top of + * that.) + * + * Returns the number of tuples deleted from the page and sets + * latestRemovedXid. It returns 0, when removed the dead tuples can't free up + * the space required. + */ +int +zheap_page_prune_guts(Relation relation, Buffer buffer, + TransactionId OldestXmin, OffsetNumber target_offnum, + Size space_required, bool report_stats, + bool force_prune, TransactionId *latestRemovedXid, + bool *pruned) +{ + int ndeleted = 0; + int space_freed = 0; + Page page = BufferGetPage(buffer); + Page tmppage = NULL; + OffsetNumber offnum, + maxoff; + ZPruneState prstate; + bool execute_pruning = false; + + if (pruned) + *pruned = false; + + /* initialize the space_free with already existing free space in page */ + space_freed = PageGetExactFreeSpace(page); + + /* + * Our strategy is to scan the page and make lists of items to change, + * then apply the changes within a critical section. This keeps as much + * logic as possible out of the critical section, and also ensures that + * WAL replay will work the same as the normal case. + * + * First, initialize the new pd_prune_xid value to zero (indicating no + * prunable tuples). If we find any tuples which may soon become + * prunable, we will save the lowest relevant XID in new_prune_xid. Also + * initialize the rest of our working state. + */ + prstate.new_prune_xid = InvalidTransactionId; + prstate.latestRemovedXid = *latestRemovedXid; + prstate.ndeleted = prstate.ndead = prstate.nunused = 0; + memset(prstate.marked, 0, sizeof(prstate.marked)); + + /* + * If caller has asked to rearrange the page and page is not marked for + * pruning, then skip scanning the page. + * + * XXX We might want to remove this check once we have some optimal + * strategy to rearrange the page where we anyway need to traverse all + * rows. + */ + if (force_prune && !ZPageIsPrunable(page)) + { + ; /* no need to scan */ + } + else + { + /* Scan the page */ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid; + + /* Ignore items already processed as part of an earlier chain */ + if (prstate.marked[offnum]) + continue; + + /* Nothing to do if slot is empty, already dead or marked as deleted */ + itemid = PageGetItemId(page, offnum); + if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) || + ItemIdIsDeleted(itemid)) + continue; + + /* Process this item */ + ndeleted += zheap_prune_item(relation, buffer, offnum, + OldestXmin, + &prstate, + &space_freed); + } + } + + /* + * There is not much advantage in continuing, if we can't free the space + * required by the caller or we are not asked to forcefully prune the + * page. + * + * XXX - In theory, we can still continue and perform pruning in the hope + * that some future update in this page will be able to use that space. + * However, it will lead to additional writes without any guaranteed + * benefit, so we skip the pruning for now. + */ + if (space_freed < space_required) + return 0; + + /* Do we want to prune? */ + if (prstate.ndeleted > 0 || prstate.ndead > 0 || + prstate.nunused > 0 || force_prune) + { + PageHeader phdr; + + execute_pruning = true; + + /* + * We prepare the temporary copy of the page so that during page + * repair fragmentation we can use it to copy the actual tuples. + */ + tmppage = PageGetTempPageCopy(page); + + /* + * Lock the TPD page before starting critical section. We might need + * to access it during page repair fragmentation. + */ + phdr = (PageHeader) page; + if (ZHeapPageHasTPDSlot(phdr)) + TPDPageLock(relation, buffer); + } + + /* Any error while applying the changes is critical */ + START_CRIT_SECTION(); + + if (execute_pruning) + { + bool has_pruned = false; + + /* + * Apply the planned item changes, then repair page fragmentation, and + * update the page's hint bit about whether it has free line pointers. + */ + zheap_page_prune_execute(buffer, target_offnum, + prstate.nowdeleted, prstate.ndeleted, + prstate.nowdead, prstate.ndead, + prstate.nowunused, prstate.nunused); + + /* + * Finally, repair any fragmentation, and update the page's hint bit about + * whether it has free pointers. + */ + ZPageRepairFragmentation(buffer, tmppage, target_offnum, + space_required, false, &has_pruned); + + /* + * Update the page's pd_prune_xid field to either zero, or the lowest + * XID of any soon-prunable tuple. + */ + ((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid; + + /* + * Also clear the "page is full" flag, since there's no point in + * repeating the prune/defrag process until something else happens to + * the page. + */ + PageClearFull(page); + + MarkBufferDirty(buffer); + + /* + * Emit a WAL ZHEAP_CLEAN record showing what we did + */ + if (RelationNeedsWAL(relation)) + { + XLogRecPtr recptr; + + recptr = log_zheap_clean(relation, buffer, target_offnum, + space_required, prstate.nowdeleted, + prstate.ndeleted, prstate.nowdead, + prstate.ndead, prstate.nowunused, + prstate.nunused, + prstate.latestRemovedXid, has_pruned); + + PageSetLSN(BufferGetPage(buffer), recptr); + } + + if (pruned) + *pruned = has_pruned; + } + else + { + /* + * If we didn't prune anything, but have found a new value for the + * pd_prune_xid field, update it and mark the buffer dirty. This is + * treated as a non-WAL-logged hint. + * + * Also clear the "page is full" flag if it is set, since there's no + * point in repeating the prune/defrag process until something else + * happens to the page. + */ + if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid || + PageIsFull(page)) + { + ((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid; + PageClearFull(page); + MarkBufferDirtyHint(buffer, true); + } + } + + END_CRIT_SECTION(); + + /* + * Report the number of tuples reclaimed to pgstats. This is ndeleted + * minus ndead, because we don't want to count a now-DEAD item or a + * now-DELETED item as a deletion for this purpose. + */ + if (report_stats && ndeleted > (prstate.ndead + prstate.ndeleted)) + pgstat_update_heap_dead_tuples(relation, ndeleted - (prstate.ndead + prstate.ndeleted)); + + *latestRemovedXid = prstate.latestRemovedXid; + + /* be tidy. */ + if (tmppage) + pfree(tmppage); + UnlockReleaseTPDBuffers(); + + /* + * XXX Should we update FSM information for this? Not doing so will + * increase the chances of in-place updates. See heap_page_prune for a + * detailed reason. + */ + + return ndeleted; +} + +/* + * Perform the actual page changes needed by zheap_page_prune_guts. + * It is expected that the caller has suitable pin and lock on the + * buffer, and is inside a critical section. + */ +void +zheap_page_prune_execute(Buffer buffer, OffsetNumber target_offnum, + OffsetNumber *deleted, int ndeleted, + OffsetNumber *nowdead, int ndead, + OffsetNumber *nowunused, int nunused) +{ + Page page = (Page) BufferGetPage(buffer); + OffsetNumber *offnum; + int i; + + /* Update all deleted line pointers */ + offnum = deleted; + for (i = 0; i < ndeleted; i++) + { + ZHeapTupleHeader tup; + int trans_slot; + uint8 vis_info = 0; + OffsetNumber off = *offnum++; + ItemId lp; + + /* The target offset must not be deleted. */ + Assert(target_offnum != off); + + lp = PageGetItemId(page, off); + + tup = (ZHeapTupleHeader) PageGetItem(page, lp); + trans_slot = ZHeapTupleHeaderGetXactSlot(tup); + + /* + * The frozen slot indicates tuple is dead, so we must not see them in + * the array of tuples to be marked as deleted. + */ + Assert(trans_slot != ZHTUP_SLOT_FROZEN); + + if (ZHeapTupleDeleted(tup)) + vis_info = ITEMID_DELETED; + if (ZHeapTupleHasInvalidXact(tup->t_infomask)) + vis_info |= ITEMID_XACT_INVALID; + + /* + * Mark the Item as deleted and copy the visibility info and + * transaction slot information from tuple to ItemId. + */ + ItemIdSetDeleted(lp, trans_slot, vis_info); + } + + /* Update all now-dead line pointers */ + offnum = nowdead; + for (i = 0; i < ndead; i++) + { + OffsetNumber off = *offnum++; + ItemId lp; + + /* The target offset must not be dead. */ + Assert(target_offnum != off); + + lp = PageGetItemId(page, off); + + ItemIdSetDead(lp); + } + + /* Update all now-unused line pointers */ + offnum = nowunused; + for (i = 0; i < nunused; i++) + { + OffsetNumber off = *offnum++; + ItemId lp; + + /* The target offset must not be unused. */ + Assert(target_offnum != off); + + lp = PageGetItemId(page, off); + + ItemIdSetUnused(lp); + } +} + +/* + * Prune specified item pointer. + * + * OldestXmin is the cutoff XID used to identify dead tuples. + * + * We don't actually change the page here. We just add entries to the arrays in + * prstate showing the changes to be made. Items to be set to LP_DEAD state are + * added to nowdead[]; items to be set to LP_DELETED are added to nowdeleted[]; + * and items to be set to LP_UNUSED state are added to nowunused[]. + * + * Returns the number of tuples (to be) deleted from the page. + */ +static int +zheap_prune_item(Relation relation, Buffer buffer, OffsetNumber offnum, + TransactionId OldestXmin, ZPruneState *prstate, + int *space_freed) +{ + ZHeapTupleData tup; + ItemId lp; + Page dp = (Page) BufferGetPage(buffer); + int ndeleted = 0; + TransactionId xid; + bool tupdead, + recent_dead; + + lp = PageGetItemId(dp, offnum); + + Assert(ItemIdIsNormal(lp)); + + tup.t_data = (ZHeapTupleHeader) PageGetItem(dp, lp); + tup.t_len = ItemIdGetLength(lp); + ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum); + tup.t_tableOid = RelationGetRelid(relation); + + /* + * Check tuple's visibility status. + */ + tupdead = recent_dead = false; + + switch (ZHeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer, &xid)) + { + case ZHEAPTUPLE_DEAD: + tupdead = true; + break; + + case ZHEAPTUPLE_RECENTLY_DEAD: + recent_dead = true; + break; + + case ZHEAPTUPLE_DELETE_IN_PROGRESS: + + /* + * This tuple may soon become DEAD. Update the hint field so that + * the page is reconsidered for pruning in future. + */ + zheap_prune_record_prunable(prstate, xid); + break; + + case ZHEAPTUPLE_LIVE: + case ZHEAPTUPLE_INSERT_IN_PROGRESS: + + /* + * If we wanted to optimize for aborts, we might consider marking + * the page prunable when we see INSERT_IN_PROGRESS. But we don't. + * See related decisions about when to mark the page prunable in + * heapam.c. + */ + break; + + case ZHEAPTUPLE_ABORT_IN_PROGRESS: + /* + * We can simply skip the tuple if it has inserted/operated by + * some aborted transaction and its rollback is still pending. It'll + * be taken care of by future prune calls. + */ + break; + default: + elog(ERROR, "unexpected ZHeapTupleSatisfiesVacuum result"); + break; + } + + if (tupdead) + ZHeapTupleHeaderAdvanceLatestRemovedXid(tup.t_data, xid, &prstate->latestRemovedXid); + + if (tupdead || recent_dead) + { + /* + * Count dead or recently dead tuple in result and update the space + * that can be freed. + */ + ndeleted++; + + /* short aligned */ + *space_freed += SHORTALIGN(tup.t_len); + } + + /* Record dead item */ + if (tupdead) + zheap_prune_record_dead(prstate, offnum); + + /* Record deleted item */ + if (recent_dead) + zheap_prune_record_deleted(prstate, offnum); + + return ndeleted; +} + +/* Record lowest soon-prunable XID */ +static void +zheap_prune_record_prunable(ZPruneState * prstate, TransactionId xid) +{ + /* + * This should exactly match the PageSetPrunable macro. We can't store + * directly into the page header yet, so we update working state. + */ + Assert(TransactionIdIsNormal(xid)); + if (!TransactionIdIsValid(prstate->new_prune_xid) || + TransactionIdPrecedes(xid, prstate->new_prune_xid)) + prstate->new_prune_xid = xid; +} + +/* Record item pointer to be marked dead */ +static void +zheap_prune_record_dead(ZPruneState * prstate, OffsetNumber offnum) +{ + Assert(prstate->ndead < MaxZHeapTuplesPerPage); + prstate->nowdead[prstate->ndead] = offnum; + prstate->ndead++; + Assert(!prstate->marked[offnum]); + prstate->marked[offnum] = true; +} + +/* Record item pointer to be deleted */ +static void +zheap_prune_record_deleted(ZPruneState * prstate, OffsetNumber offnum) +{ + Assert(prstate->ndead < MaxZHeapTuplesPerPage); + prstate->nowdeleted[prstate->ndeleted] = offnum; + prstate->ndeleted++; + Assert(!prstate->marked[offnum]); + prstate->marked[offnum] = true; +} + +/* + * log_zheap_clean - Perform XLogInsert for a zheap-clean operation. + * + * Caller must already have modified the buffer and marked it dirty. + * + * We also include latestRemovedXid, which is the greatest XID present in + * the removed tuples. That allows recovery processing to cancel or wait + * for long standby queries that can still see these tuples. + */ +XLogRecPtr +log_zheap_clean(Relation reln, Buffer buffer, OffsetNumber target_offnum, + Size space_required, OffsetNumber *nowdeleted, int ndeleted, + OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, + int nunused, TransactionId latestRemovedXid, bool pruned) +{ + XLogRecPtr recptr; + xl_zheap_clean xl_rec; + + /* Caller should not call me on a non-WAL-logged relation */ + Assert(RelationNeedsWAL(reln)); + + xl_rec.latestRemovedXid = latestRemovedXid; + xl_rec.ndeleted = ndeleted; + xl_rec.ndead = ndead; + xl_rec.flags = 0; + XLogBeginInsert(); + + if (pruned) + xl_rec.flags |= XLZ_CLEAN_ALLOW_PRUNING; + XLogRegisterData((char *) &xl_rec, SizeOfZHeapClean); + + /* Register the offset information. */ + if (target_offnum != InvalidOffsetNumber) + { + xl_rec.flags |= XLZ_CLEAN_CONTAINS_OFFSET; + XLogRegisterData((char *) &target_offnum, sizeof(OffsetNumber)); + XLogRegisterData((char *) &space_required, sizeof(space_required)); + } + + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + /* + * The OffsetNumber arrays are not actually in the buffer, but we pretend + * that they are. When XLogInsert stores the whole buffer, the offset + * arrays need not be stored too. Note that even if all three arrays are + * empty, we want to expose the buffer as a candidate for whole-page + * storage, since this record type implies a defragmentation operation + * even if no item pointers changed state. + */ + if (ndeleted > 0) + XLogRegisterBufData(0, (char *) nowdeleted, + ndeleted * sizeof(OffsetNumber) * 2); + + if (ndead > 0) + XLogRegisterBufData(0, (char *) nowdead, + ndead * sizeof(OffsetNumber)); + + if (nunused > 0) + XLogRegisterBufData(0, (char *) nowunused, + nunused * sizeof(OffsetNumber)); + + recptr = XLogInsert(RM_ZHEAP_ID, XLOG_ZHEAP_CLEAN); + + return recptr; +} + +/* + * After removing or marking some line pointers unused, move the tuples to + * remove the gaps caused by the removed items. Here, we are rearranging + * the page such that tuples will be placed in itemid order. It will help + * in the speedup of future sequential scans. + * + * Note that we use the temporary copy of the page to copy the tuples as + * writing in itemid order will overwrite some tuples. + */ +void +compactify_ztuples(itemIdSort itemidbase, int nitems, Page page, Page tmppage) +{ + PageHeader phdr = (PageHeader) page; + Offset upper; + int i; + + Assert(PageIsValid(tmppage)); + upper = phdr->pd_special; + for (i = nitems - 1; i >= 0; i--) + { + itemIdSort itemidptr = &itemidbase[i]; + ItemId lp; + + lp = PageGetItemId(page, itemidptr->offsetindex + 1); + upper -= itemidptr->alignedlen; + memcpy((char *) page + upper, + (char *) tmppage + itemidptr->itemoff, + lp->lp_len); + lp->lp_off = upper; + } + + phdr->pd_upper = upper; +} + +/* + * ZPageRepairFragmentation + * + * Frees fragmented space on a page. + * + * The basic idea is same as PageRepairFragmentation, but here we additionally + * deal with unused items that can't be immediately reclaimed. We don't allow + * page to be pruned, if there is an inplace update from an open transaction. + * The reason is that we don't know the size of previous row in undo which + * could be bigger in which case we might not be able to perform rollback once + * the page is repaired. Now, we can always traverse the undo chain to find + * the size of largest tuple in the chain, but we don't do that for now as it + * can take time especially if there are many such tuples on the page. + */ +void +ZPageRepairFragmentation(Buffer buffer, Page tmppage, + OffsetNumber target_offnum, Size space_required, + bool NoTPDBufLock, bool *pruned) +{ + Page page = BufferGetPage(buffer); + Offset pd_lower = ((PageHeader)page)->pd_lower; + Offset pd_upper = ((PageHeader)page)->pd_upper; + Offset pd_special = ((PageHeader)page)->pd_special; + itemIdSortData itemidbase[MaxZHeapTuplesPerPage]; + itemIdSort itemidptr; + ItemId lp; + TransactionId xid; + uint32 epoch; + UndoRecPtr urec_ptr; + int nline, + nstorage, + nunused; + int i; + Size totallen; + + /* + * It's worth the trouble to be more paranoid here than in most places, + * because we are about to reshuffle data in (what is usually) a shared + * disk buffer. If we aren't careful then corrupted pointers, lengths, + * etc could cause us to clobber adjacent disk buffers, spreading the data + * loss further. So, check everything. + */ + if (pd_lower < SizeOfPageHeaderData || + pd_lower > pd_upper || + pd_upper > pd_special || + pd_special > BLCKSZ || + pd_special != MAXALIGN(pd_special)) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u", + pd_lower, pd_upper, pd_special))); + + nline = PageGetMaxOffsetNumber(page); + + /* + * If there are any tuples which are inplace updated by any open + * transactions we shall not compactify the page contents, otherwise, + * rollback of those transactions will not be possible. There could be + * a case, where within a transaction tuple is first inplace updated + * and then, either updated or deleted. So for now avoid compaction if + * there are any tuples which are marked inplace updated, updated or + * deleted by an open transaction. + */ + for (i = FirstOffsetNumber; i <= nline; i++) + { + lp = PageGetItemId(page, i); + if (ItemIdIsUsed(lp) && ItemIdHasStorage(lp)) + { + ZHeapTupleHeader tup; + + tup = (ZHeapTupleHeader) PageGetItem(page, lp); + + if (!(tup->t_infomask & (ZHEAP_INPLACE_UPDATED | + ZHEAP_UPDATED | ZHEAP_DELETED))) + continue; + + if (!ZHeapTupleHasInvalidXact(tup->t_infomask)) + { + int trans_slot; + + trans_slot = ZHeapTupleHeaderGetXactSlot(tup); + if (trans_slot == ZHTUP_SLOT_FROZEN) + continue; + + /* + * XXX There is possibility that the updater's slot got reused by a + * locker in such a case the INVALID_XACT will be moved to lockers + * undo. Now, we will find that the tuple has in-place update flag + * but it doesn't have INVALID_XACT flag and the slot transaction is + * also running, in such case we will not prune this page. Ideally + * if the multi-locker is set we can get the actual transaction and + * check the status of the transaction. + */ + trans_slot = GetTransactionSlotInfo(buffer, i, trans_slot, + &epoch, &xid, &urec_ptr, + NoTPDBufLock, false); + /* + * It is quite possible that the item is showing some + * valid transaction slot, but actual slot has been frozen. + * This can happen when the slot belongs to TPD entry and + * the corresponding TPD entry is pruned. + */ + if (trans_slot == ZHTUP_SLOT_FROZEN) + continue; + + if (!TransactionIdDidCommit(xid)) + return; + } + } + } + + /* + * Run through the line pointer array and collect data about live items. + */ + itemidptr = itemidbase; + nunused = totallen = 0; + for (i = FirstOffsetNumber; i <= nline; i++) + { + lp = PageGetItemId(page, i); + if (ItemIdIsUsed(lp)) + { + if (ItemIdHasStorage(lp)) + { + itemidptr->offsetindex = i - 1; + itemidptr->itemoff = ItemIdGetOffset(lp); + if (unlikely(itemidptr->itemoff < (int)pd_upper || + itemidptr->itemoff >= (int)pd_special)) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted item pointer: %u", + itemidptr->itemoff))); + /* + * We need to save additional space for the target offset, so + * that we can save the space for new tuple. + */ + if (i == target_offnum) + itemidptr->alignedlen = SHORTALIGN(ItemIdGetLength(lp) + space_required); + else + itemidptr->alignedlen = SHORTALIGN(ItemIdGetLength(lp)); + totallen += itemidptr->alignedlen; + itemidptr++; + } + } + else + { + nunused++; + + /* + * We allow Unused entries to be reused only if there is no + * transaction information for the entry or the transaction + * is committed. + */ + if (ItemIdHasPendingXact(lp)) + { + int trans_slot = ItemIdGetTransactionSlot(lp); + + /* + * Here, we are relying on the transaction information in + * slot as if the corresponding slot has been reused, then + * transaction information from the entry would have been + * cleared. See PageFreezeTransSlots. + */ + if (trans_slot != ZHTUP_SLOT_FROZEN) + { + trans_slot = GetTransactionSlotInfo(buffer, i, trans_slot, + &epoch, &xid, + &urec_ptr, NoTPDBufLock, + false); + /* + * It is quite possible that the item is showing some + * valid transaction slot, but actual slot has been frozen. + * This can happen when the slot belongs to TPD entry and + * the corresponding TPD entry is pruned. + */ + if (trans_slot != ZHTUP_SLOT_FROZEN && + !TransactionIdDidCommit(xid)) + continue; + } + } + + /* Unused entries should have lp_len = 0, but make sure */ + ItemIdSetUnused(lp); + } + } + + nstorage = itemidptr - itemidbase; + if (nstorage == 0) + { + /* Page is completely empty, so just reset it quickly */ + ((PageHeader)page)->pd_upper = pd_special; + } + else + { + /* Need to compact the page the hard way */ + if (totallen > (Size)(pd_special - pd_lower)) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted item lengths: total %u, available space %u", + (unsigned int)totallen, pd_special - pd_lower))); + + compactify_ztuples(itemidbase, nstorage, page, tmppage); + } + + /* Set hint bit for PageAddItem */ + if (nunused > 0) + PageSetHasFreeLinePointers(page); + else + PageClearHasFreeLinePointers(page); + + /* indicate that the page has been pruned */ + if (pruned) + *pruned = true; +} diff --git a/src/backend/access/zheap/rewritezheap.c b/src/backend/access/zheap/rewritezheap.c new file mode 100644 index 0000000000..4f3dac46c5 --- /dev/null +++ b/src/backend/access/zheap/rewritezheap.c @@ -0,0 +1,373 @@ +/*------------------------------------------------------------------------- + * + * rewritezheap.c + * Support functions to rewrite zheap tables. + * + * These functions provide a facility to completely rewrite a heap. + * + * INTERFACE + * + * The caller is responsible for creating the new heap, all catalog + * changes, supplying the tuples to be written to the new heap, and + * rebuilding indexes. The caller must hold AccessExclusiveLock on the + * target table, because we assume no one else is writing into it. + * + * To use the facility: + * + * begin_heap_rewrite + * while (fetch next tuple) + * { + * if (tuple is dead) + * rewrite_heap_dead_tuple + * else + * { + * // do any transformations here if required + * rewrite_heap_tuple + * } + * } + * end_zheap_rewrite + * + * The contents of the new relation shouldn't be relied on until after + * end_zheap_rewrite is called. + * + * + * IMPLEMENTATION + * + * As of now, this layer gets only LIVE tuples and we freeze them before + * storing in new heap. This is not a good idea as we lose all the + * visibility information of tuples, but OTOH, the same can't be copied + * from the original tuple as that is maintained in undo and we don't have + * facility to modify undorecords. + * + * One idea to capture the visibility information is that we should write a + * special undo record such that it stores previous version's visibility + * information and later if the current version is not visible as per latest + * xid (which is of cluster/vacuum full command), then we should get previous + * xid information from undo. It seems along with previous versions xid, we + * need to write previous version tuples as well and somehow need to fix the + * ctid information in the undo records. + * + * We can't use the normal zheap_insert function to insert into the new + * heap, because heap_insert overwrites the visibility information and + * it uses buffer management layer to process the tuples which is bit + * slower. We use a special-purpose raw_zheap_insert function instead, which + * is optimized for bulk inserting a lot of tuples, knowing that we have + * exclusive access to the heap. raw_zheap_insert builds new pages in + * local storage. When a page is full, or at the end of the process, + * we insert it to WAL as a single record and then write it to disk + * directly through smgr. Note, however, that any data sent to the new + * heap's TOAST table will go through the normal bufmgr. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994-5, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/zheap/rewritezheap.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include +#include + +#include "access/rewritezheap.h" +#include "access/tuptoaster.h" +#include "miscadmin.h" +#include "storage/bufmgr.h" +#include "storage/smgr.h" +#include "storage/procarray.h" +#include "utils/memutils.h" + + +/* + * State associated with a rewrite operation. This is opaque to the user + * of the rewrite facility. + */ +typedef struct RewriteZheapStateData +{ + Relation rs_new_rel; /* destination heap */ + Page rs_buffer; /* page currently being built */ + BlockNumber rs_blockno; /* block where page will go */ + bool rs_buffer_valid; /* T if any tuples in buffer */ + bool rs_use_wal; /* must we WAL-log inserts? */ + MemoryContext rs_cxt; /* for hash tables and entries and tuples in + * them */ +} RewriteZheapStateData; + + +/* prototypes for internal functions */ +static void raw_zheap_insert(RewriteZheapState state, ZHeapTuple tup); + +/* + * Begin a rewrite of a table + * + * old_heap old, locked heap relation tuples will be read from + * new_heap new, locked heap relation to insert tuples to + * oldest_xmin xid used by the caller to determine which tuples are dead + * freeze_xid this is kept for API compatability with heap, it's value will + * be InvalidTransactionId. + * min_multi this is kept for API compatability with heap, it's value will + * will be InvalidMultiXactId + * use_wal should the inserts to the new heap be WAL-logged? + * + * Returns an opaque RewriteState, allocated in current memory context, + * to be used in subsequent calls to the other functions. + */ +RewriteZheapState +begin_zheap_rewrite(Relation old_heap, Relation new_heap, + TransactionId oldest_xmin, TransactionId freeze_xid, + MultiXactId cutoff_multi, bool use_wal) +{ + RewriteZheapState state; + MemoryContext rw_cxt; + MemoryContext old_cxt; + + /* + * To ease cleanup, make a separate context that will contain the + * RewriteState struct itself plus all subsidiary data. + */ + rw_cxt = AllocSetContextCreate(CurrentMemoryContext, + "Table rewrite", + ALLOCSET_DEFAULT_SIZES); + old_cxt = MemoryContextSwitchTo(rw_cxt); + + /* Create and fill in the state struct */ + state = palloc0(sizeof(RewriteZheapStateData)); + + state->rs_new_rel = new_heap; + state->rs_buffer = (Page) palloc(BLCKSZ); + /* new_heap needn't be empty, just locked */ + state->rs_blockno = RelationGetNumberOfBlocks(new_heap); + state->rs_buffer_valid = false; + state->rs_use_wal = use_wal; + state->rs_cxt = rw_cxt; + + MemoryContextSwitchTo(old_cxt); + + return state; +} + +/* + * End a rewrite. + * + * state and any other resources are freed. + */ +void +end_zheap_rewrite(RewriteZheapState state) +{ + /* Write the last page, if any */ + if (state->rs_buffer_valid) + { + if (state->rs_use_wal) + log_newpage(&state->rs_new_rel->rd_node, + MAIN_FORKNUM, + state->rs_blockno, + state->rs_buffer, + true); + RelationOpenSmgr(state->rs_new_rel); + + PageSetChecksumInplace(state->rs_buffer, state->rs_blockno); + + smgrextend(state->rs_new_rel->rd_smgr, MAIN_FORKNUM, state->rs_blockno, + (char *) state->rs_buffer, true); + } + + /* + * If the rel is WAL-logged, must fsync before commit. We use heap_sync + * to ensure that the toast table gets fsync'd too. + * + * It's obvious that we must do this when not WAL-logging. It's less + * obvious that we have to do it even if we did WAL-log the pages. The + * reason is the same as in tablecmds.c's copy_relation_data(): we're + * writing data that's not in shared buffers, and so a CHECKPOINT + * occurring during the rewriteheap operation won't have fsync'd data we + * wrote before the checkpoint. + */ + if (RelationNeedsWAL(state->rs_new_rel)) + heap_sync(state->rs_new_rel); + + /* Deleting the context frees everything */ + MemoryContextDelete(state->rs_cxt); +} + +/* + * Reconstruct and rewrite the given tuple + * + * We cannot simply copy the tuple as-is, see reform_and_rewrite_tuple for + * reasons. + */ +void +reform_and_rewrite_ztuple(ZHeapTuple tuple, TupleDesc oldTupDesc, + TupleDesc newTupDesc, Datum *values, bool *isnull, + RewriteZheapState rwstate) +{ + ZHeapTuple copiedTuple; + int i; + + zheap_deform_tuple(tuple, oldTupDesc, values, isnull); + + /* Be sure to null out any dropped columns */ + for (i = 0; i < newTupDesc->natts; i++) + { + if (TupleDescAttr(newTupDesc, i)->attisdropped) + isnull[i] = true; + } + + copiedTuple = zheap_form_tuple(newTupDesc, values, isnull); + + rewrite_zheap_tuple(rwstate, tuple, copiedTuple); + + zheap_freetuple(copiedTuple); +} + +/* + * Add a tuple to the new heap. + * + * Maintaining previous version's visibility information needs much more work + * (see atop of this file), so for now, we freeze all the tuples. We only get + * LIVE versions of the tuple as input. + * + * Note that since we scribble on new_tuple, it had better be temp storage + * not a pointer to the original tuple. + * + * state opaque state as returned by begin_heap_rewrite + * old_tuple original tuple in the old heap + * new_tuple new, rewritten tuple to be inserted to new heap + */ +void +rewrite_zheap_tuple(RewriteZheapState state, ZHeapTuple old_tuple, + ZHeapTuple new_tuple) +{ + MemoryContext old_cxt; + + old_cxt = MemoryContextSwitchTo(state->rs_cxt); + + /* + * As of now, we copy only LIVE tuples in zheap, so we can mark them as + * frozen. + */ + new_tuple->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + new_tuple->t_data->t_infomask2 &= ~ZHEAP_XACT_SLOT; + ZHeapTupleHeaderSetXactSlot(new_tuple->t_data, ZHTUP_SLOT_FROZEN); + + raw_zheap_insert(state, new_tuple); + + MemoryContextSwitchTo(old_cxt); +} + +/* + * Insert a tuple to the new relation. This has to track zheap_insert + * and its subsidiary functions! + * + * t_self of the tuple is set to the new TID of the tuple. + */ +static void +raw_zheap_insert(RewriteZheapState state, ZHeapTuple tup) +{ + Page page = state->rs_buffer; + Size pageFreeSpace, + saveFreeSpace; + Size len; + OffsetNumber newoff; + ZHeapTuple heaptup; + + /* + * If the new tuple is too big for storage or contains already toasted + * out-of-line attributes from some other relation, invoke the toaster. + * + * Note: below this point, heaptup is the data we actually intend to store + * into the relation; tup is the caller's original untoasted data. + */ + if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE) + { + /* toast table entries should never be recursively toasted */ + Assert(!ZHeapTupleHasExternal(tup)); + heaptup = tup; + } + else if (ZHeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD) + { + /* + * As of now, we copy only LIVE tuples in zheap, so we can mark them as + * frozen. + */ + heaptup = ztoast_insert_or_update(state->rs_new_rel, tup, NULL, + HEAP_INSERT_FROZEN | + HEAP_INSERT_SKIP_FSM | + (state->rs_use_wal ? + 0 : HEAP_INSERT_SKIP_WAL)); + } + else + heaptup = tup; + + len = SHORTALIGN(heaptup->t_len); + + /* + * If we're gonna fail for oversize tuple, do it right away + */ + if (len > MaxZHeapTupleSize) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("row is too big: size %zu, maximum size %zu", + len, MaxZHeapTupleSize))); + + /* Compute desired extra freespace due to fillfactor option */ + saveFreeSpace = RelationGetTargetPageFreeSpace(state->rs_new_rel, + HEAP_DEFAULT_FILLFACTOR); + + /* Now we can check to see if there's enough free space already. */ + if (state->rs_buffer_valid) + { + pageFreeSpace = PageGetHeapFreeSpace(page); + + if (len + saveFreeSpace > pageFreeSpace) + { + /* Doesn't fit, so write out the existing page */ + + /* XLOG stuff */ + if (state->rs_use_wal) + log_newpage(&state->rs_new_rel->rd_node, + MAIN_FORKNUM, + state->rs_blockno, + page, + true); + + /* + * Now write the page. We say isTemp = true even if it's not a + * temp table, because there's no need for smgr to schedule an + * fsync for this write; we'll do it ourselves in + * end_zheap_rewrite. + */ + RelationOpenSmgr(state->rs_new_rel); + + PageSetChecksumInplace(page, state->rs_blockno); + + smgrextend(state->rs_new_rel->rd_smgr, MAIN_FORKNUM, + state->rs_blockno, (char *) page, true); + + state->rs_blockno++; + state->rs_buffer_valid = false; + } + } + + if (!state->rs_buffer_valid) + { + /* Initialize a new empty page */ + ZheapInitPage(page, BLCKSZ); + state->rs_buffer_valid = true; + } + + /* And now we can insert the tuple into the page */ + newoff = ZPageAddItem(InvalidBuffer, page, (Item) heaptup->t_data, + heaptup->t_len, InvalidOffsetNumber, false, true, + true); + if (newoff == InvalidOffsetNumber) + elog(ERROR, "failed to add tuple"); + + /* Update caller's t_self to the actual position where it was stored */ + ItemPointerSet(&(tup->t_self), state->rs_blockno, newoff); + + /* If heaptup is a private copy, release it. */ + if (heaptup != tup) + zheap_freetuple(heaptup); +} diff --git a/src/backend/access/zheap/tpd.c b/src/backend/access/zheap/tpd.c new file mode 100644 index 0000000000..99fb59dc37 --- /dev/null +++ b/src/backend/access/zheap/tpd.c @@ -0,0 +1,3148 @@ +/*------------------------------------------------------------------------- + * + * tpd.c + * zheap transaction overflow pages code + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * TPD is nothing but temporary data page consisting of extended transaction + * slots from heap pages. There are two primary reasons for having TPD (a) In + * the heap page, we have fixed number of transaction slots which can lead to + * deadlock, (b) To support cases where a large number of transactions acquire + * SHARE or KEY SHARE locks on a single page. + * + * The TPD overflow pages will be stored in the zheap itself, interleaved with + * regular pages. We have a meta page in zheap from which all overflow pages + * are tracked. + * + * TPD Entry acts like an extension of the transaction slot array in heap + * page. Tuple headers normally point to the transaction slot responsible for + * the last modification, but since there aren't enough bits available to do + * this in the case where a TPD is used, an offset -> slot mapping is stored + * in the TPD entry itself. This array can be used to get the slot for tuples + * in heap page, but for undo tuples we can't use it because we can't track + * multiple slots that have updated the same tuple. So for undo records, we + * record the TPD transaction slot number along with the undo record. + * + * IDENTIFICATION + * src/backend/access/zheap/tpd.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/tpd_xlog.h" +#include "access/zheap.h" +#include "access/zheapam_xlog.h" +#include "miscadmin.h" +#include "storage/bufmgr.h" +#include "storage/buf_internals.h" +#include "storage/lmgr.h" +#include "storage/proc.h" +#include "utils/lsyscache.h" +#include "utils/relfilenodemap.h" + +/* + * We never need more than two TPD buffers per zheap page, so the maximum + * number of TPD buffers required will be four. This can happen for + * non-inplace updates that insert new record to a different zheap page. In + * general, we require one tpd page for zheap page, but for the cases when + * we need to extend the tpd entry to a different page, we will operate on + * two tpd buffers. + */ +#define MAX_TPD_BUFFERS 4 + +/* Undo block number to buffer mapping. */ +typedef struct TPDBuffers +{ + BlockNumber blk; /* block number */ + Buffer buf; /* buffer allocated for the block */ +} TPDBuffers; + +/* + * GetTPDBuffer operations + * + * TPD_BUF_FIND - Find the buffer in existing array of tpd buffers. + * TPD_BUF_FIND_OR_ENTER - Like previous, but if not found then allocate a new + * buffer and add it to tpd buffers array for future use. + * TPD_BUF_FIND_OR_KNOWN_ENTER - Like TPD_BUF_FIND, but if not found, then add + * the already known buffer to tpd buffers array for future use. + * TPD_BUF_ENTER - Allocate a new TPD buffer and add it to tpd buffers array + * for future use. + */ +typedef enum +{ + TPD_BUF_FIND, + TPD_BUF_FIND_OR_ENTER, + TPD_BUF_FIND_OR_KNOWN_ENTER, + TPD_BUF_ENTER +} TPDACTION; + +static Buffer registered_tpd_buffers[MAX_TPD_BUFFERS]; +static TPDBuffers tpd_buffers[MAX_TPD_BUFFERS]; +static int tpd_buf_idx; +static int registered_tpd_buf_idx; +static int GetTPDBuffer(Relation rel, BlockNumber blk, Buffer tpd_buf, + TPDACTION tpd_action, bool *already_exists); +static void TPDEntryUpdate(Relation relation, Buffer tpd_buf, + uint16 tpd_e_offset, OffsetNumber tpd_item_off, + char *tpd_entry, Size size_tpd_entry); +static void TPDAllocatePageAndAddEntry(Relation relation, Buffer metabuf, + Buffer pagebuf, Buffer old_tpd_buf, + OffsetNumber old_off_num, char *tpd_entry, + Size size_tpd_entry, bool add_new_tpd_page, + bool delete_old_entry); +static bool TPDBufferAlreadyRegistered(Buffer tpd_buf); +static void ReleaseLastTPDBuffer(Buffer buf); +static void LogAndClearTPDLocation(Relation relation, Buffer heapbuf, + bool *tpd_e_pruned); +void +ResetRegisteredTPDBuffers() +{ + registered_tpd_buf_idx = 0; +} + +/* + * GetTPDBuffer - Get the tpd buffer corresponding to give block number. + * + * Returns -1, if the tpd_action is TPD_BUF_FIND and buffer for the required + * block is not present in tpd buffers array, otherwise returns the index of + * buffer in the array. + * + * rel can be NULL, if user intends to just search for existing buffer. + */ +static int +GetTPDBuffer(Relation rel, BlockNumber blk, Buffer tpd_buf, + TPDACTION tpd_action, bool *already_exists) +{ + int i; + Buffer buf; + + /* The number of active TPD buffers must be less than MAX_TPD_BUFFERS. */ + Assert(tpd_buf_idx <= MAX_TPD_BUFFERS); + *already_exists = false; + + /* + * If new block needs to be allocated, then we don't need to search + * existing set of buffers. + */ + if (tpd_action != TPD_BUF_ENTER) + { + /* + * Don't do anything, if we already have a buffer pinned for the required + * block. + */ + for (i = 0; i < tpd_buf_idx; i++) + { + if (blk == tpd_buffers[i].blk) + { + *already_exists = true; + return i; + } + } + } + else + i = tpd_buf_idx; + + /* + * If the buffer doesn't exist and caller doesn't intend to allocate new + * buffer, then we are done. + */ + if (tpd_action == TPD_BUF_FIND && !(*already_exists)) + return -1; + + if (tpd_action == TPD_BUF_FIND_OR_KNOWN_ENTER) + { + Assert (i == tpd_buf_idx); + Assert (BufferIsValid(tpd_buf)); + + tpd_buffers[tpd_buf_idx].blk = BufferGetBlockNumber(tpd_buf); + tpd_buffers[tpd_buf_idx].buf = tpd_buf; + tpd_buf_idx++; + + return i; + } + + /* + * Caller must have passed relation, if it intends to read a block that is + * not already read. + */ + Assert(rel != NULL); + + /* + * We don't have the required buffer, so read it and remember in the TPD + * buffer array. + */ + if (i == tpd_buf_idx) + { + buf = ReadBuffer(rel, blk); + tpd_buffers[tpd_buf_idx].blk = BufferGetBlockNumber(buf); + tpd_buffers[tpd_buf_idx].buf = buf; + tpd_buf_idx++; + } + + return i; +} + +/* + * TPDBufferAlreadyRegistered - Check whether the buffer is already registered. + * + * Returns true if the buffer is already registered, otherwise add it to the + * registered buffer array and return false. + */ +static bool +TPDBufferAlreadyRegistered(Buffer tpd_buf) +{ + int i; + + for (i = 0; i < registered_tpd_buf_idx; i++) + { + if (tpd_buf == registered_tpd_buffers[i]) + return true; + } + + registered_tpd_buffers[registered_tpd_buf_idx++] = tpd_buf; + + return false; +} + +/* + * ReleaseLastTPDBuffer - Release last tpd buffer + */ +static void +ReleaseLastTPDBuffer(Buffer buf) +{ + Buffer last_tpd_buf PG_USED_FOR_ASSERTS_ONLY; + + last_tpd_buf = tpd_buffers[tpd_buf_idx - 1].buf; + Assert(buf == last_tpd_buf); + UnlockReleaseBuffer(buf); + tpd_buffers[tpd_buf_idx - 1].buf = InvalidBuffer; + tpd_buffers[tpd_buf_idx - 1].blk = InvalidBlockNumber; + tpd_buf_idx--; +} + +/* + * AllocateAndFormTPDEntry - Allocate and form the new TPD entry. + * + * We initialize the TPD entry and also move the last transaction slot + * information from heap page to first slot in TPD entry. + * + * reserved_slot - returns the first available slot. + */ +static char * +AllocateAndFormTPDEntry(Buffer buf, OffsetNumber offset, + Size *size_tpd_entry, int *reserved_slot) +{ + Size size_tpd_e_map; + Size size_tpd_e_slots; + int i; + OffsetNumber offnum, max_required_offset; + char *tpd_entry; + char *tpd_entry_data; + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + TransInfo *tpd_e_trans_slots; + Page page; + TPDEntryHeaderData tpe_header; + uint16 num_map_entries; + + page = BufferGetPage(buf); + if (OffsetNumberIsValid(offset)) + max_required_offset = offset; + else + max_required_offset = PageGetMaxOffsetNumber(page); + + num_map_entries = max_required_offset + ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY; + + /* form tpd entry header */ + tpe_header.blkno = BufferGetBlockNumber(buf); + tpe_header.tpe_num_map_entries = num_map_entries; + tpe_header.tpe_num_slots = INITIAL_TRANS_SLOTS_IN_TPD_ENTRY; + tpe_header.tpe_flags = TPE_ONE_BYTE; + + size_tpd_e_map = num_map_entries * sizeof(uint8); + size_tpd_e_slots = INITIAL_TRANS_SLOTS_IN_TPD_ENTRY * sizeof(TransInfo); + + /* form transaction slots for tpd entry */ + tpd_e_trans_slots = (TransInfo *) palloc(size_tpd_e_slots); + + for (i = 0; i < INITIAL_TRANS_SLOTS_IN_TPD_ENTRY; i++) + { + tpd_e_trans_slots[i].xid_epoch = 0; + tpd_e_trans_slots[i].xid = InvalidTransactionId; + tpd_e_trans_slots[i].urec_ptr = InvalidUndoRecPtr; + } + + /* + * Move the last transaction slot information from heap page to first + * transaction slot in TPD entry. + */ + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpd_e_trans_slots[0].xid_epoch = last_trans_slot_info.xid_epoch; + tpd_e_trans_slots[0].xid = last_trans_slot_info.xid; + tpd_e_trans_slots[0].urec_ptr = last_trans_slot_info.urec_ptr; + + /* form tpd entry */ + *size_tpd_entry = SizeofTPDEntryHeader + size_tpd_e_map + + size_tpd_e_slots; + + tpd_entry = (char *) palloc0(*size_tpd_entry); + + memcpy(tpd_entry, (char *) &tpe_header, SizeofTPDEntryHeader); + + tpd_entry_data = tpd_entry + SizeofTPDEntryHeader; + + /* + * Update the itemid to slot map for all the itemid's that point to last + * transaction slot in the heap page. + */ + for (offnum = FirstOffsetNumber; + offnum <= PageGetMaxOffsetNumber(page); + offnum = OffsetNumberNext(offnum)) + { + ZHeapTupleHeader tup_hdr; + ItemId itemid; + int trans_slot; + + itemid = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemid)) + continue; + + if (!ItemIdIsUsed(itemid)) + { + if (!ItemIdHasPendingXact(itemid)) + continue; + trans_slot = ItemIdGetTransactionSlot(itemid); + } + else if (ItemIdIsDeleted(itemid)) + { + trans_slot = ItemIdGetTransactionSlot(itemid); + } + else + { + tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid); + trans_slot = ZHeapTupleHeaderGetXactSlot(tup_hdr); + } + + /* + * Update the itemid to slot map in tpd entry such that all of the + * offsets corresponding to tuples that were pointing to last slot in + * heap page will now point to first slot in TPD entry. + */ + if (trans_slot == ZHEAP_PAGE_TRANS_SLOTS) + { + uint8 offset_tpd_e_loc; + + offset_tpd_e_loc = ZHEAP_PAGE_TRANS_SLOTS + 1; + + /* + * One byte access shouldn't cause unaligned access, but using memcpy + * for the sake of consistency. + */ + memcpy(tpd_entry_data + (offnum - 1), (char *) &offset_tpd_e_loc, + sizeof(uint8)); + } + } + + memcpy(tpd_entry + SizeofTPDEntryHeader + size_tpd_e_map, + (char *) tpd_e_trans_slots, size_tpd_e_slots); + + /* + * The first slot location has been already assigned to last slot moved + * from heap page. We can safely reserve the second slot location in new + * TPD entry. + */ + *reserved_slot = ZHEAP_PAGE_TRANS_SLOTS + 2; + + /* be tidy */ + pfree(tpd_e_trans_slots); + + return tpd_entry; +} + +/* + * ExtendTPDEntry - Allocate bigger TPD entry and copy the contents of old TPD + * entry to new TPD entry. + * + * We are quite conservative in extending the TPD entry because the bigger the + * entry more is the chance of space wastage. OTOH, it might have some + * performance impact because smaller the entry more is the chance of getting + * a request for extension. However, we feel that as we have a mechanism to + * reuse the transaction slots, we shouldn't get the frequent requests for + * extending the entry, at the very least not in performance critical paths. + */ +static void +ExtendTPDEntry(Relation relation, Buffer heapbuf, TransInfo *trans_slots, + OffsetNumber offnum, int buf_idx, int old_num_map_entries, + int old_num_slots, int *reserved_slot_no, UndoRecPtr *urecptr, + bool *tpd_e_pruned) +{ + TPDEntryHeaderData old_tpd_e_header, tpd_e_header; + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + Page old_tpd_page; + Page heappage; + Buffer old_tpd_buf; + Buffer metabuf = InvalidBuffer; + BlockNumber tpdblk; + OffsetNumber max_page_offnum; + Size tpdpageFreeSpace; + Size new_size_tpd_entry, + old_size_tpd_entry, + new_size_tpd_e_map, + new_size_tpd_e_slots, + old_size_tpd_e_map, + old_size_tpd_e_slots; + ItemId itemId; + OffsetNumber tpdItemOff; + int old_loc_tpd_e_map, + old_loc_trans_slots; + int max_reqd_map_entries; + int max_reqd_slots = 0; + int num_free_slots = 0; + int slot_no; + int entries_removed; + uint16 tpd_e_offset; + char *tpd_entry; + bool already_exists; + bool allocate_new_tpd_page = false; + bool update_tpd_inplace, + tpd_pruned; + + heappage = BufferGetPage(heapbuf); + max_page_offnum = PageGetMaxOffsetNumber(heappage); + + /* + * Select the maximum among required offset num, current map + * entries, and highest page offset as the number of offset-map + * entries for a new TPD entry. We do allocate few additional map + * entries so that we don't need to allocate new TPD entry soon. + * Also, we ensure that we don't try to allocate more than + * MaxZHeapTuplesPerPage offset-map entries. + */ + max_reqd_map_entries = Max(offnum, + Max(old_num_map_entries, max_page_offnum)); + max_reqd_map_entries += ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY; + max_reqd_map_entries = Min(max_reqd_map_entries, + MaxZHeapTuplesPerPage); + + /* + * If there are more than fifty percent of empty slots available, + * then we don't extend the number of transaction slots in new TPD + * entry. Otherwise also, we extend the slots quite conservately + * to avoid space wastage. + */ + if (*reserved_slot_no != InvalidXactSlotId) + { + for (slot_no = 0; slot_no < old_num_slots; slot_no++) + { + /* + * Check for the number of unreserved transaction slots in + * the TPD entry. + */ + if (trans_slots[slot_no].xid == InvalidTransactionId) + num_free_slots++; + } + + if (num_free_slots >= old_num_slots / 2) + max_reqd_slots = old_num_slots; + } + + if (max_reqd_slots <= 0) + max_reqd_slots = old_num_slots + INITIAL_TRANS_SLOTS_IN_TPD_ENTRY; + + /* + * The transaction slots in TPD entry are in addition to the + * maximum slots in the heap page. The one-byte offset-map can + * store maximum upto 255 transaction slot number. + */ + if (max_reqd_slots + ZHEAP_PAGE_TRANS_SLOTS < 256) + new_size_tpd_e_map = max_reqd_map_entries * sizeof(uint8); + else + new_size_tpd_e_map = max_reqd_map_entries * sizeof(uint32); + new_size_tpd_e_slots = max_reqd_slots * sizeof(TransInfo); + new_size_tpd_entry = SizeofTPDEntryHeader + new_size_tpd_e_map + + new_size_tpd_e_slots; + + /* TPD entries can't span in multiple blocks. */ + if (new_size_tpd_entry > MaxTPDEntrySize) + { + /* + * FIXME: what we should do if TPD entry can not fit in one page? + * currently we are forcing it to retry. + */ + elog(LOG, "TPD entry size (%lu) cannot be greater than \ + MaxTPDEntrySize (%u)", new_size_tpd_entry, MaxTPDEntrySize); + + *reserved_slot_no = InvalidXactSlotId; + return; + } + + if (buf_idx != -1) + old_tpd_buf = tpd_buffers[buf_idx].buf; + else + { + /* + * The last slot in page has the address of the required TPD + * entry. + */ + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + buf_idx = GetTPDBuffer(relation, tpdblk, InvalidBuffer, + TPD_BUF_FIND_OR_ENTER, &already_exists); + old_tpd_buf = tpd_buffers[buf_idx].buf; + + /* + * The tpd buffer must already exists as before reaching here + * we must have called TPDPageGetTransactionSlots which would + * have read the required buffer. + */ + Assert(already_exists); + } + + /* The last slot in page has the address of the required TPD entry. */ + old_tpd_page = BufferGetPage(old_tpd_buf); + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(BufferGetPage(heapbuf)); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK; + itemId = PageGetItemId(old_tpd_page, tpdItemOff); + old_size_tpd_entry = ItemIdGetLength(itemId); + + /* We have a lock on tpd page, so nobody can prune our tpd entry. */ + Assert(ItemIdIsUsed(itemId)); + + tpdpageFreeSpace = PageGetTPDFreeSpace(old_tpd_page); + + /* + * Call TPDPagePrune to ensure that it will create a space adjacent to + * current offset for the new (bigger) TPD entry, if possible. + */ + entries_removed = TPDPagePrune(relation, old_tpd_buf, NULL, tpdItemOff, + (new_size_tpd_entry - old_size_tpd_entry), + true, &update_tpd_inplace, &tpd_pruned); + /* + * If the item got pruned, then clear the TPD slot from the page and + * return. The entry can be pruned by ourselves or by anyone else + * as we release the lock during pruning if the page is empty. + */ + if (PageIsEmpty(old_tpd_page) || + !ItemIdIsUsed(itemId) || + tpd_pruned) + { + LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned); + *reserved_slot_no = InvalidXactSlotId; + *tpd_e_pruned = true; + if (metabuf != InvalidBuffer) + ReleaseBuffer(metabuf); + return; + } + + if (!update_tpd_inplace) + { + if (entries_removed > 0) + tpdpageFreeSpace = PageGetTPDFreeSpace(old_tpd_page); + + if (tpdpageFreeSpace < new_size_tpd_entry) + { + /* + * XXX Here, we can have an optimization such that instead of + * allocating a new page, we can search other TPD pages starting + * from the first_used_tpd_page till we reach last_used_tpd_page. + * It is not clear whether such an optimization can help because + * checking all the TPD pages isn't free either. + */ + metabuf = ReadBuffer(relation, ZHEAP_METAPAGE); + allocate_new_tpd_page = true; + } + else + { + /* + * We must not reach here because if the new tpd entry can fit on the same + * page, then update_tpd_inplace would have been set by TPDPagePrune. + */ + Assert(false); + } + } + + /* form tpd entry header */ + tpd_e_header.blkno = BufferGetBlockNumber(heapbuf); + tpd_e_header.tpe_num_map_entries = max_reqd_map_entries; + tpd_e_header.tpe_num_slots = max_reqd_slots; + + /* + * The transaction slots in TPD entry are in addition to the + * maximum slots in the heap page. The one-byte offset-map can + * store maximum upto 255 transaction slot number. + */ + if (max_reqd_slots + ZHEAP_PAGE_TRANS_SLOTS < 256) + tpd_e_header.tpe_flags = TPE_ONE_BYTE; + else + tpd_e_header.tpe_flags = TPE_FOUR_BYTE; + + /* + * If we reach here, then the page must be a TPD page. + */ + Assert(PageGetSpecialSize(old_tpd_page) == MAXALIGN(sizeof(TPDPageOpaqueData))); + + /* TPD entry isn't pruned */ + tpd_e_offset = ItemIdGetOffset(itemId); + + memcpy((char *) &old_tpd_e_header, old_tpd_page + tpd_e_offset, SizeofTPDEntryHeader); + + /* We should never access deleted entry. */ + Assert(!TPDEntryIsDeleted(old_tpd_e_header)); + + /* This TPD entry can't be for some other block. */ + Assert(old_tpd_e_header.blkno == BufferGetBlockNumber(heapbuf)); + + if (old_tpd_e_header.tpe_flags & TPE_ONE_BYTE) + old_size_tpd_e_map = old_tpd_e_header.tpe_num_map_entries * sizeof(uint8); + else + { + Assert(old_tpd_e_header.tpe_flags & TPE_FOUR_BYTE); + old_size_tpd_e_map = old_tpd_e_header.tpe_num_map_entries * sizeof(uint32); + } + + old_size_tpd_e_slots = old_tpd_e_header.tpe_num_slots * sizeof(TransInfo); + old_loc_tpd_e_map = tpd_e_offset + SizeofTPDEntryHeader; + old_loc_trans_slots = tpd_e_offset + SizeofTPDEntryHeader + old_size_tpd_e_map; + + /* Form new TPD entry. Whatever be the case, header will remain same. */ + tpd_entry = (char *) palloc0(new_size_tpd_entry); + memcpy(tpd_entry, (char *) &tpd_e_header, SizeofTPDEntryHeader); + + if (tpd_e_header.tpe_flags & TPE_ONE_BYTE || + (tpd_e_header.tpe_flags & TPE_FOUR_BYTE && + old_tpd_e_header.tpe_flags & TPE_FOUR_BYTE)) + { + /* + * Caller must try to extend the TPD entry iff either there is a + * need of more offset-map entries or transaction slots. + */ + Assert(tpd_e_header.tpe_num_map_entries >= old_num_map_entries); + Assert(tpd_e_header.tpe_num_slots >= old_num_slots); + + /* + * In this case we can copy the contents of old offset-map and + * old transaction slots as it is. + */ + memcpy(tpd_entry + SizeofTPDEntryHeader, + old_tpd_page + old_loc_tpd_e_map, + old_size_tpd_e_map); + memcpy(tpd_entry + SizeofTPDEntryHeader + new_size_tpd_e_map, + old_tpd_page + old_loc_trans_slots, + old_size_tpd_e_slots); + } + else if (tpd_e_header.tpe_flags & TPE_FOUR_BYTE && + old_tpd_e_header.tpe_flags & TPE_ONE_BYTE) + { + int i; + char *new_start_loc, + *old_start_loc; + + /* + * Here, we can't directly copy the offset-map because we are + * expanding it from one byte to four-bytes. We need to perform + * byte-by-byte copy for the offset-map. However, transaction + * slots can be directly copied as the size for each slot still + * remains same. + */ + Assert(old_tpd_e_header.tpe_num_map_entries == old_num_map_entries); + + new_start_loc = tpd_entry + SizeofTPDEntryHeader; + old_start_loc = old_tpd_page + old_loc_tpd_e_map; + + for (i = 0; i < old_num_map_entries; i++) + { + memcpy(new_start_loc, old_start_loc, sizeof(uint8)); + old_start_loc += sizeof(uint8); + new_start_loc += sizeof(uint32); + } + + memcpy(tpd_entry + SizeofTPDEntryHeader + new_size_tpd_e_map, + old_tpd_page + old_loc_trans_slots, + old_size_tpd_e_slots); + } + else + { + /* All the valid cases should have been dealt above. */ + Assert(false); + } + + + if (update_tpd_inplace) + { + TPDEntryUpdate(relation, old_tpd_buf, tpd_e_offset, tpdItemOff, + tpd_entry, new_size_tpd_entry); + } + else + { + /* + * Note that if we have to allocate a new page, we must delete the + * old tpd entry in old tpd buffer. + */ + TPDAllocatePageAndAddEntry(relation, metabuf, heapbuf, old_tpd_buf, + tpdItemOff, tpd_entry, new_size_tpd_entry, + allocate_new_tpd_page, + allocate_new_tpd_page); + } + + /* Release the meta buffer. */ + if (metabuf != InvalidBuffer) + ReleaseBuffer(metabuf); + + if (*reserved_slot_no == InvalidXactSlotId) + { + int slot_no; + + trans_slots = (TransInfo *) (tpd_entry + SizeofTPDEntryHeader + new_size_tpd_e_map); + + for (slot_no = 0; slot_no < tpd_e_header.tpe_num_slots; slot_no++) + { + /* Check for an unreserved transaction slot in the TPD entry */ + if (trans_slots[slot_no].xid == InvalidTransactionId) + { + *reserved_slot_no = slot_no; + break; + } + } + } + + if (*reserved_slot_no != InvalidXactSlotId) + *urecptr = trans_slots[*reserved_slot_no].urec_ptr; + + pfree(tpd_entry); + + return; +} + +/* + * TPDPageAddEntry - Add the given to TPD entry on the page and + * move the upper to point to the next free location. + * + * Return value is the offset at which it was inserted, or InvalidOffsetNumber + * if the item is not inserted for any reason. A WARNING is issued indicating + * the reason for the refusal. + * + * This function is same as PageAddItemExtended, but has different + * alignment requirements. We might want to deal with that by passing + * additional argument to PageAddItemExtended, but for now we have kept + * it as a separate function. + */ +OffsetNumber +TPDPageAddEntry(Page tpdpage, char *tpd_entry, Size size, + OffsetNumber offnum) +{ + PageHeader phdr = (PageHeader) tpdpage; + OffsetNumber limit; + ItemId itemId; + uint16 lower; + uint16 upper; + + /* + * Be wary about corrupted page pointers + */ + if (phdr->pd_lower < SizeOfPageHeaderData || + phdr->pd_lower > phdr->pd_upper || + phdr->pd_upper > phdr->pd_special || + phdr->pd_special > BLCKSZ) + ereport(PANIC, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u", + phdr->pd_lower, phdr->pd_upper, phdr->pd_special))); + + /* + * Select offsetNumber to place the new item at + */ + limit = OffsetNumberNext(PageGetMaxOffsetNumber(tpdpage)); + + lower = phdr->pd_lower + sizeof(ItemIdData); + + if (OffsetNumberIsValid(offnum)) + { + /* + * In TPD, we send valid offset number only during recovery. Hence, + * we don't need to shuffle the offsets as well. + */ + Assert(InRecovery); + if (offnum < limit) + { + itemId = PageGetItemId(phdr, offnum); + if (ItemIdIsUsed(itemId) || ItemIdHasStorage(itemId)) + { + elog(WARNING, "will not overwrite a used ItemId"); + return InvalidOffsetNumber; + } + } + } + else + { + /* offsetNumber was not passed in, so find a free slot */ + /* if no free slot, we'll put it at limit (1st open slot) */ + if (PageHasFreeLinePointers(phdr)) + { + /* + * Look for "recyclable" (unused) ItemId. We check for no storage + * as well, just to be paranoid --- unused items should never have + * storage. + */ + for (offnum = 1; offnum < limit; offnum++) + { + itemId = PageGetItemId(phdr, offnum); + if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId)) + break; + } + if (offnum >= limit) + { + /* the hint is wrong, so reset it */ + PageClearHasFreeLinePointers(phdr); + } + } + else + { + offnum = limit; + } + } + + /* Reject placing items beyond the first unused line pointer */ + if (offnum > limit) + { + elog(WARNING, "specified item offset is too large"); + return InvalidOffsetNumber; + } + + /* Reject placing items beyond tpd boundary */ + if (offnum > MaxTPDTuplesPerPage) + { + elog(WARNING, "can't put more than MaxTPDTuplesPerPage items in a tpd page"); + return InvalidOffsetNumber; + } + + /* + * Compute new lower and upper pointers for page, see if it'll fit. + * + * Note: do arithmetic as signed ints, to avoid mistakes if, say, + * alignedSize > pd_upper. + */ + if (offnum == limit) + lower = phdr->pd_lower + sizeof(ItemIdData); + else + lower = phdr->pd_lower; + + upper = (int) phdr->pd_upper - (int) size; + + if (lower > upper) + return InvalidOffsetNumber; + + /* OK to insert the item. */ + itemId = PageGetItemId(phdr, offnum); + + /* set the item pointer */ + ItemIdSetNormal(itemId, upper, size); + + /* copy the item's data onto the page */ + memcpy((char *) tpdpage + upper, tpd_entry, size); + + phdr->pd_lower = (LocationIndex) lower; + phdr->pd_upper = (LocationIndex) upper; + + return offnum; +} + +/* + * SetTPDLocation - Set TPD entry location in the last transaction slot of + * heap page and indicate the same in page. + */ +void +SetTPDLocation(Buffer heapbuffer, Buffer tpdbuffer, OffsetNumber offset) +{ + Page heappage; + PageHeader phdr; + ZHeapPageOpaque opaque; + + heappage = BufferGetPage(heapbuffer); + phdr = (PageHeader) heappage; + + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + + /* clear the last transaction slot info */ + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch = 0; + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid = + InvalidTransactionId; + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].urec_ptr = + InvalidUndoRecPtr; + /* set TPD location in last transaction slot */ + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch = + BufferGetBlockNumber(tpdbuffer); + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid = + (opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid & ~OFFSET_MASK) | offset; + + phdr->pd_flags |= PD_PAGE_HAS_TPD_SLOT; +} + +/* + * ClearTPDLocation - Clear TPD entry location in the last transaction slot of + * heap page and indicate the same in page. + */ +void +ClearTPDLocation(Buffer heapbuf) +{ + PageHeader phdr; + ZHeapPageOpaque opaque; + Page heappage; + int frozen_slots = ZHEAP_PAGE_TRANS_SLOTS - 1; + + heappage = BufferGetPage(heapbuf); + phdr = (PageHeader) heappage; + + /* + * Before clearing the TPD slot, mark all the tuples pointing to TPD slot + * as frozen. + */ + zheap_freeze_or_invalidate_tuples(heapbuf, 1, &frozen_slots, + true, false); + + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + + /* clear the last transaction slot info */ + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch = 0; + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid = + InvalidTransactionId; + opaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].urec_ptr = + InvalidUndoRecPtr; + + phdr->pd_flags &= ~PD_PAGE_HAS_TPD_SLOT; +} + +/* + * LogClearTPDLocation - Write a WAL record for clearing TPD location. + */ +static void +LogClearTPDLocation(Buffer buffer) +{ + XLogRecPtr recptr; + + XLogBeginInsert(); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + recptr = XLogInsert(RM_TPD_ID, XLOG_TPD_CLEAR_LOCATION); + + PageSetLSN(BufferGetPage(buffer), recptr); +} + +/* + * LogAndClearTPDLocation - Clear the TPD location from heap page and WAL log + * it. + */ +static void +LogAndClearTPDLocation(Relation relation, Buffer heapbuf, bool *tpd_e_pruned) +{ + START_CRIT_SECTION(); + + ClearTPDLocation(heapbuf); + MarkBufferDirty(heapbuf); + if (RelationNeedsWAL(relation)) + LogClearTPDLocation(heapbuf); + + END_CRIT_SECTION(); + + if (tpd_e_pruned) + *tpd_e_pruned = true; +} + +/* + * TPDInitPage - Initialize the TPD page. + */ +void +TPDInitPage(Page page, Size pageSize) +{ + TPDPageOpaque tpdopaque; + + PageInit(page, pageSize, sizeof(TPDPageOpaqueData)); + + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(page); + tpdopaque->tpd_prevblkno = InvalidBlockNumber; + tpdopaque->tpd_nextblkno = InvalidBlockNumber; + tpdopaque->tpd_latest_xid_epoch = 0; + tpdopaque->tpd_latest_xid = InvalidTransactionId; +} + +/* + * TPDFreePage - Remove the TPD page from the chain. + * + * Initialize the empty page and remove it from the chain. This function + * ensures that the buffers are locked such that the block that exists prior + * in chain gets locked first and meta page is locked at end after which no + * existing page is locked. This is to avoid deadlocks, see comments atop + * function TPDAllocatePageAndAddEntry. + * + * We expect that the caller must have acquired EXCLUSIVE lock on the current + * buffer (buf) and will be responsible for releasing the same. + * + * Returns true, if we are able to successfully remove the page from chain, + * false, otherwise. + */ +bool +TPDFreePage(Relation rel, Buffer buf, BufferAccessStrategy bstrategy) +{ + TPDPageOpaque tpdopaque, + prevtpdopaque, + nexttpdopaque; + ZHeapMetaPage metapage; + Page page = NULL, + prevpage = NULL, + nextpage = NULL; + BlockNumber curblkno PG_USED_FOR_ASSERTS_ONLY = InvalidBlockNumber; + BlockNumber prevblkno = InvalidBlockNumber; + BlockNumber nextblkno = InvalidBlockNumber; + Buffer prevbuf = InvalidBuffer; + Buffer nextbuf = InvalidBuffer; + Buffer metabuf = InvalidBuffer; + bool update_meta = false; + + page = BufferGetPage(buf); + curblkno = BufferGetBlockNumber(buf); + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(page); + + prevblkno = tpdopaque->tpd_prevblkno; + + if (BlockNumberIsValid(prevblkno)) + { + /* + * Before taking the lock on previous block, we need to release the + * lock on the current buffer. This is to ensure that we always lock + * the buffers in the order in which they are present in list. This + * avoids the deadlock risks. See atop TPDAllocatePageAndAddEntry. + */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM, prevblkno, RBM_NORMAL, + bstrategy); + LockBuffer(prevbuf, BUFFER_LOCK_EXCLUSIVE); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + /* + * After reaquiring the lock, check whether page is still empty, if + * not, then we don't need to do anything. As of now, there is no + * possiblity that the empty page in the chain can be reused, however, + * in future, we can use it. + */ + page = BufferGetPage(buf); + if (!PageIsEmpty(page)) + { + UnlockReleaseBuffer(prevbuf); + return false; + } + tpdopaque = (TPDPageOpaque)PageGetSpecialPointer(page); + } + + nextblkno = tpdopaque->tpd_nextblkno; + + if (BlockNumberIsValid(nextblkno)) + { + nextbuf = ReadBufferExtended(rel, MAIN_FORKNUM, nextblkno, RBM_NORMAL, + bstrategy); + LockBuffer(nextbuf, BUFFER_LOCK_EXCLUSIVE); + } + + metabuf = ReadBufferExtended(rel, MAIN_FORKNUM, ZHEAP_METAPAGE, + RBM_NORMAL, bstrategy); + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + metapage = ZHeapPageGetMeta(BufferGetPage(metabuf)); + Assert(metapage->zhm_magic == ZHEAP_MAGIC); + + START_CRIT_SECTION(); + + /* Update the current page. */ + tpdopaque->tpd_prevblkno = InvalidBlockNumber; + tpdopaque->tpd_nextblkno = InvalidBlockNumber; + tpdopaque->tpd_latest_xid_epoch = 0; + tpdopaque->tpd_latest_xid = InvalidTransactionId; + + MarkBufferDirty(buf); + + /* Update the previous page. */ + if (BufferIsValid(prevbuf)) + { + prevpage = BufferGetPage(prevbuf); + prevtpdopaque = (TPDPageOpaque) PageGetSpecialPointer(prevpage); + + prevtpdopaque->tpd_nextblkno = nextblkno; + MarkBufferDirty(prevbuf); + } + /* Update the next page. */ + if (BufferIsValid(nextbuf)) + { + nextpage = BufferGetPage(nextbuf); + nexttpdopaque = (TPDPageOpaque) PageGetSpecialPointer(nextpage); + + nexttpdopaque->tpd_prevblkno = prevblkno; + MarkBufferDirty(nextbuf); + } + + /* + * Update the metapage. If the previous or next block is invalid, the + * page to be removed could be first or last page in the chain in which + * case we need to update the metapage accordingly. + */ + if (!BlockNumberIsValid(prevblkno) || + !BlockNumberIsValid(nextblkno)) + { + if (!BlockNumberIsValid(prevblkno) && !BlockNumberIsValid(nextblkno)) + { + /* + * If there is no prevblock and nextblock, then the current page + * must be the first and the last page. + */ + Assert(metapage->zhm_first_used_tpd_page == curblkno); + Assert(metapage->zhm_last_used_tpd_page == curblkno); + metapage->zhm_first_used_tpd_page = InvalidBlockNumber; + metapage->zhm_last_used_tpd_page = InvalidBlockNumber; + } + else if (!BlockNumberIsValid(prevblkno)) + { + /* + * If there is no prevblock, then the current block must be first + * used page. + */ + Assert(BlockNumberIsValid(nextblkno)); + metapage->zhm_first_used_tpd_page = nextblkno; + } + else if (!BlockNumberIsValid(nextblkno)) + { + /* + * If next block is invalid, then the current block must be last + * used page. + */ + Assert(metapage->zhm_last_used_tpd_page == curblkno); + metapage->zhm_last_used_tpd_page = prevblkno; + } + else + { + /* one of the above two conditions must be satisfied. */ + Assert(false); + } + + MarkBufferDirty(metabuf); + update_meta = true; + } + else + { + /* + * If next block is a valid block then the last used page can't be the + * current page being removed. + */ + Assert(metapage->zhm_last_used_tpd_page != curblkno); + } + + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + xl_tpd_free_page xlrec; + uint8 info = XLOG_TPD_FREE_PAGE; + + xlrec.prevblkno = prevblkno; + xlrec.nextblkno = nextblkno; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfTPDFreePage); + if (BufferIsValid(prevbuf)) + XLogRegisterBuffer(0, prevbuf, REGBUF_STANDARD); + XLogRegisterBuffer(1, buf, REGBUF_STANDARD); + if (BufferIsValid(nextbuf)) + XLogRegisterBuffer(2, nextbuf, REGBUF_STANDARD); + if (update_meta) + { + xl_zheap_metadata xl_meta; + + info |= XLOG_TPD_INIT_PAGE; + xl_meta.first_used_tpd_page = metapage->zhm_first_used_tpd_page; + xl_meta.last_used_tpd_page = metapage->zhm_last_used_tpd_page; + XLogRegisterBuffer(3, metabuf, REGBUF_STANDARD | REGBUF_WILL_INIT); + XLogRegisterBufData(3, (char *) &xl_meta, SizeOfMetaData); + } + + recptr = XLogInsert(RM_TPD_ID, info); + + if (BufferIsValid(prevbuf)) + PageSetLSN(prevpage, recptr); + PageSetLSN(page, recptr); + if (BufferIsValid(nextbuf)) + PageSetLSN(nextpage, recptr); + if (update_meta) + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + if (BufferIsValid(prevbuf)) + UnlockReleaseBuffer(prevbuf); + if (BufferIsValid(nextbuf)) + UnlockReleaseBuffer(nextbuf); + UnlockReleaseBuffer(metabuf); + + return true; +} + +/* + * TPDEntryUpdate - Update the TPD entry inplace and write a WAL record for + * the same. + */ +static void +TPDEntryUpdate(Relation relation, Buffer tpd_buf, uint16 tpd_e_offset, + OffsetNumber tpd_item_off, char *tpd_entry, + Size size_tpd_entry) +{ + Page tpd_page = BufferGetPage(tpd_buf); + ItemId itemId = PageGetItemId(tpd_page, tpd_item_off); + + START_CRIT_SECTION(); + + memcpy((char *) (tpd_page + tpd_e_offset), + tpd_entry, + size_tpd_entry); + ItemIdChangeLen(itemId, size_tpd_entry); + + MarkBufferDirty(tpd_buf); + + if (RelationNeedsWAL(relation)) + { + XLogRecPtr recptr; + + XLogBeginInsert(); + XLogRegisterBuffer(0, tpd_buf, REGBUF_STANDARD); + XLogRegisterBufData(0, (char *) &tpd_item_off, sizeof(OffsetNumber)); + XLogRegisterBufData(0, (char *) tpd_entry, size_tpd_entry); + + recptr = XLogInsert(RM_TPD_ID, XLOG_INPLACE_UPDATE_TPD_ENTRY); + + PageSetLSN(tpd_page, recptr); + } + + END_CRIT_SECTION(); +} + +/* + * TPDAllocatePageAndAddEntry - Allocates a new tpd page if required and adds + * tpd entry. + * + * This function takes care of inserting the new tpd entry to a page and + * allows to mark old entry as deleted when requested. The typical actions + * performed in this function are (a) add a TPD entry in the newly allocated + * or an existing TPD page, (b) update the metapage to indicate the addion of + * a new page (if allocated) and for updating zhm_last_used_tpd_page, (c) mark + * the old TPD entry as prunable, (c) update the new offset number of TPD + * entry in heap page. Finally write a WAL entry and corresponding replay + * routine to cover all these operations and release all the buffers. + * + * The other aspect this function needs to ensure is the buffer locking order + * to avoid deadlocks. We operate on four buffers: metapage buffer, old tpd + * page buffer, last used tpd page buffer and new tpd page buffer. The old + * buffer is always locked by the caller and we ensure that this function first + * locks the last used tpd page buffer, then locks the metapage buffer and then + * the newly allocated page buffer. This locking can never lead to deadlock as + * old buffer block will always be lesser (or equal) than last buffer block. + * However, if anytime, we change our startegy such that after acquiring + * metapage lock, we try to acquire lock on any existing page, then we might + * need to reconsider our locking order. + */ +static void +TPDAllocatePageAndAddEntry(Relation relation, Buffer metabuf, Buffer pagebuf, + Buffer old_tpd_buf, OffsetNumber old_off_num, + char *tpd_entry, Size size_tpd_entry, + bool add_new_tpd_page, bool delete_old_entry) +{ + ZHeapMetaPage metapage = NULL; + TPDPageOpaque tpdopaque, last_tpdopaque; + TPDEntryHeader old_tpd_entry; + Buffer last_used_tpd_buf = InvalidBuffer; + Buffer tpd_buf; + Page tpdpage; + BlockNumber prevblk = InvalidBlockNumber; + BlockNumber nextblk = InvalidBlockNumber; + BlockNumber last_used_tpd_page; + OffsetNumber offset_num; + bool free_last_used_tpd_buf = false; + + if (add_new_tpd_page) + { + BlockNumber targetBlock = InvalidBlockNumber; + Size len = MaxTPDEntrySize; + int buf_idx; + bool needLock; + bool already_exists; + + /* + * While adding a new page, if we've to delete the old entry, + * the old buffer must be valid. Else, it should be invalid. + */ + Assert(!delete_old_entry || BufferIsValid(old_tpd_buf)); + Assert(delete_old_entry || !BufferIsValid(old_tpd_buf)); + + /* Before extending the relation, check the FSM for free page. */ + targetBlock = GetPageWithFreeSpace(relation, len); + + while (targetBlock != InvalidBlockNumber) + { + Page page; + Size pageFreeSpace; + + tpd_buf = ReadBuffer(relation, targetBlock); + + /* + * We need to take the lock on meta page before new page to avoid + * deadlocks. See comments atop of function. + */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + /* + * It's possible that FSM returns a zheap page on which the current + * backend already holds a lock in exclusive mode. Hence, try using + * conditional lock. If it can't get the lock immediately, extend + * the relation and allocate a new TPD block. + */ + if (ConditionalLockBuffer(tpd_buf)) + { + page = BufferGetPage(tpd_buf); + + if (PageIsEmpty(page)) + { + GetTPDBuffer(relation, targetBlock, tpd_buf, + TPD_BUF_FIND_OR_KNOWN_ENTER, &already_exists); + break; + } + + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + if (PageGetSpecialSize(page) == MAXALIGN(sizeof(TPDPageOpaqueData))) + pageFreeSpace = PageGetTPDFreeSpace(page); + else + pageFreeSpace = PageGetZHeapFreeSpace(page); + + /* + * Update FSM as to condition of this page, and ask for another page + * to try. + */ + targetBlock = RecordAndGetPageWithFreeSpace(relation, + targetBlock, + pageFreeSpace, + len); + UnlockReleaseBuffer(tpd_buf); + } + else + { + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + ReleaseBuffer(tpd_buf); + targetBlock = InvalidBlockNumber; + } + } + + /* Extend the relation, if required? */ + if (targetBlock == InvalidBlockNumber) + { + /* Acquire the extension lock, if extension is required. */ + needLock = !RELATION_IS_LOCAL(relation); + if (needLock) + LockRelationForExtension(relation, ExclusiveLock); + + buf_idx = GetTPDBuffer(relation, P_NEW, InvalidBuffer, + TPD_BUF_ENTER, &already_exists); + /* This must be a new buffer. */ + Assert(!already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE); + + if (needLock) + UnlockRelationForExtension(relation, ExclusiveLock); + } + + /* + * Lock the last tpd page in list, so that we can append new page to + * it. + */ + metapage = ZHeapPageGetMeta(BufferGetPage(metabuf)); + Assert(metapage->zhm_magic == ZHEAP_MAGIC); + +recheck_meta: + last_used_tpd_page = metapage->zhm_last_used_tpd_page; + if (metapage->zhm_last_used_tpd_page != InvalidBlockNumber) + { + last_used_tpd_page = metapage->zhm_last_used_tpd_page; + buf_idx = GetTPDBuffer(relation, last_used_tpd_page, InvalidBuffer, + TPD_BUF_FIND, &already_exists); + + if (buf_idx == -1) + { + last_used_tpd_buf = ReadBuffer(relation, + metapage->zhm_last_used_tpd_page); + /* + * To avoid deadlock, ensure that we never acquire lock on any existing + * block after acquiring meta page lock. See comments atop function. + */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + LockBuffer(last_used_tpd_buf, BUFFER_LOCK_EXCLUSIVE); + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + if (metapage->zhm_last_used_tpd_page != last_used_tpd_page) + { + UnlockReleaseBuffer(last_used_tpd_buf); + goto recheck_meta; + } + + free_last_used_tpd_buf = true; + } + else + { + /* We don't need to lock the buffer, if it is already locked */ + last_used_tpd_buf = tpd_buffers[buf_idx].buf; + } + } + } + else + { + /* old buffer must be valid */ + Assert(BufferIsValid(old_tpd_buf)); + tpd_buf = old_tpd_buf; + } + + /* NO EREPORT(ERROR) from here till changes are logged */ + START_CRIT_SECTION(); + + tpdpage = BufferGetPage(tpd_buf); + + /* Update metapage and add the new TPD page in the TPD page list. */ + if (add_new_tpd_page) + { + BlockNumber tpdblkno; + + /* Page must be new or empty. */ + Assert(PageIsEmpty(tpdpage) || PageIsNew(tpdpage)); + + TPDInitPage(tpdpage, BufferGetPageSize(tpd_buf)); + tpdblkno = BufferGetBlockNumber(tpd_buf); + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + + if (metapage->zhm_first_used_tpd_page == InvalidBlockNumber) + metapage->zhm_first_used_tpd_page = tpdblkno; + else + { + Assert(BufferIsValid(last_used_tpd_buf)); + + /* Add the new TPD page at the end of the TPD page list. */ + last_tpdopaque = (TPDPageOpaque) + PageGetSpecialPointer(BufferGetPage(last_used_tpd_buf)); + prevblk = tpdopaque->tpd_prevblkno = metapage->zhm_last_used_tpd_page; + nextblk = last_tpdopaque->tpd_nextblkno = tpdblkno; + + MarkBufferDirty(last_used_tpd_buf); + } + + metapage->zhm_last_used_tpd_page = tpdblkno; + + MarkBufferDirty(metabuf); + } + else + { + /* + * TPD chain should remain unchanged. + */ + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + prevblk = tpdopaque->tpd_prevblkno; + nextblk = tpdopaque->tpd_nextblkno; + } + + /* Mark the old tpd entry as dead before adding new entry. */ + if (delete_old_entry) + { + Page otpdpage; + ItemId old_item_id; + + /* We must be adding new TPD entry into a new page. */ + Assert(add_new_tpd_page); + Assert(old_tpd_buf != tpd_buf); + + otpdpage = BufferGetPage(old_tpd_buf); + old_item_id = PageGetItemId(otpdpage, old_off_num); + old_tpd_entry = (TPDEntryHeader) PageGetItem(otpdpage, old_item_id); + old_tpd_entry->tpe_flags |= TPE_DELETED; + MarkBufferDirty(old_tpd_buf); + } + + /* Add tpd entry to page */ + offset_num = TPDPageAddEntry(tpdpage, tpd_entry, size_tpd_entry, + InvalidOffsetNumber); + if (offset_num == InvalidOffsetNumber) + elog(PANIC, "failed to add TPD entry"); + + MarkBufferDirty(tpd_buf); + + /* + * Now that the last transaction slot from heap page has moved to TPD, + * we need to assign TPD location in the last transaction slot of heap. + */ + SetTPDLocation(pagebuf, tpd_buf, offset_num); + MarkBufferDirty(pagebuf); + + /* XLOG stuff */ + if (RelationNeedsWAL(relation)) + { + XLogRecPtr recptr; + xl_tpd_allocate_entry xlrec; + xl_zheap_metadata metadata; + int bufflags = 0; + uint8 info = XLOG_ALLOCATE_TPD_ENTRY; + + xlrec.offnum = offset_num; + xlrec.prevblk = prevblk; + xlrec.nextblk = nextblk; + xlrec.flags = 0; + + /* + * If we are adding TPD entry to a new page, we will reinit the page + * during replay. + */ + if (add_new_tpd_page) + { + info |= XLOG_TPD_INIT_PAGE; + bufflags |= REGBUF_WILL_INIT; + } + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfTPDAllocateEntry); + XLogRegisterBuffer(0, tpd_buf, REGBUF_STANDARD | bufflags); + XLogRegisterBufData(0, (char *) tpd_entry, size_tpd_entry); + XLogRegisterBuffer(1, pagebuf, REGBUF_STANDARD); + if (add_new_tpd_page) + { + XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD); + metadata.first_used_tpd_page = metapage->zhm_first_used_tpd_page; + metadata.last_used_tpd_page = metapage->zhm_last_used_tpd_page; + XLogRegisterBufData(2, (char *) &metadata, SizeOfMetaData); + + if (BufferIsValid(last_used_tpd_buf)) + XLogRegisterBuffer(3, last_used_tpd_buf, REGBUF_STANDARD); + + /* The old entry is deleted only when new page is allocated. */ + if (delete_old_entry) + { + /* + * If the last tpd buffer and the old tpd buffer are same, we + * don't need to register old_tpd_buf. + */ + if (last_used_tpd_buf == old_tpd_buf) + { + xlrec.flags = XLOG_OLD_TPD_BUF_EQ_LAST_TPD_BUF; + XLogRegisterBufData(3, (char *) &old_off_num, sizeof(OffsetNumber)); + } + else + { + XLogRegisterBuffer(4, old_tpd_buf, REGBUF_STANDARD); + XLogRegisterBufData(4, (char *) &old_off_num, sizeof(OffsetNumber)); + } + } + } + + recptr = XLogInsert(RM_TPD_ID, info); + + PageSetLSN(tpdpage, recptr); + PageSetLSN(BufferGetPage(pagebuf), recptr); + if (add_new_tpd_page) + { + PageSetLSN(BufferGetPage(metabuf), recptr); + if (BufferIsValid(last_used_tpd_buf)) + PageSetLSN(BufferGetPage(last_used_tpd_buf), recptr); + if (delete_old_entry) + PageSetLSN(BufferGetPage(old_tpd_buf), recptr); + } + } + + END_CRIT_SECTION(); + + if (add_new_tpd_page) + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + if (free_last_used_tpd_buf) + { + Assert (last_used_tpd_buf != tpd_buf); + UnlockReleaseBuffer(last_used_tpd_buf); + } +} + +/* + * TPDAllocateAndReserveTransSlot - Allocates a new TPD entry and reserve a + * transaction slot in that entry. + * + * To allocate a new TPD entry, we first check if there is a space in any + * existing TPD page starting from the last used TPD page and incase we + * don't find any such page, then allocate a new TPD page and add it to the + * existing list of TPD pages. + * + * We intentionally don't release the TPD buffer here as that will be + * released once we have updated the transaction slot with required + * information. Caller must call UnlockReleaseTPDBuffers after doing + * necessary updates. + * + * pagebuf - Caller must have an exclusive lock on this buffer. + */ +int +TPDAllocateAndReserveTransSlot(Relation relation, Buffer pagebuf, + OffsetNumber offnum, UndoRecPtr *urec_ptr) +{ + ZHeapMetaPage metapage; + Buffer metabuf; + Buffer tpd_buf = InvalidBuffer; + Page heappage; + uint32 first_used_tpd_page; + uint32 last_used_tpd_page; + char *tpd_entry; + Size size_tpd_entry; + int reserved_slot = InvalidXactSlotId; + int buf_idx; + bool allocate_new_tpd_page = false; + bool update_meta = false; + bool already_exists; + + metabuf = ReadBuffer(relation, ZHEAP_METAPAGE); + LockBuffer(metabuf, BUFFER_LOCK_SHARE); + metapage = ZHeapPageGetMeta(BufferGetPage(metabuf)); + Assert(metapage->zhm_magic == ZHEAP_MAGIC); + + first_used_tpd_page = metapage->zhm_first_used_tpd_page; + last_used_tpd_page = metapage->zhm_last_used_tpd_page; + + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + heappage = BufferGetPage(pagebuf); + + if (last_used_tpd_page != InvalidBlockNumber) + { + Size tpdpageFreeSpace; + Size size_tpd_e_map, size_tpd_entry, size_tpd_e_slots; + uint16 num_map_entries; + OffsetNumber max_required_offset; + + if (OffsetNumberIsValid(offnum)) + max_required_offset = offnum; + else + max_required_offset = PageGetMaxOffsetNumber(heappage); + num_map_entries = max_required_offset + + ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY; + + size_tpd_e_map = num_map_entries * sizeof(uint8); + size_tpd_e_slots = INITIAL_TRANS_SLOTS_IN_TPD_ENTRY * sizeof(TransInfo); + size_tpd_entry = SizeofTPDEntryHeader + size_tpd_e_map + + size_tpd_e_slots; + + buf_idx = GetTPDBuffer(relation, last_used_tpd_page, InvalidBuffer, + TPD_BUF_FIND_OR_ENTER, &already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + /* We don't need to lock the buffer, if it is already locked */ + if (!already_exists) + LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE); + tpdpageFreeSpace = PageGetTPDFreeSpace(BufferGetPage(tpd_buf)); + + if (tpdpageFreeSpace < size_tpd_entry) + { + int entries_removed; + + /* + * Prune the TPD page to make space for new TPD entries. After + * pruning, check again to see if the TPD entry can be accomodated + * on the page. We can't afford to free the page while pruning as + * we need to use it to insert the TPD entry. + */ + entries_removed = TPDPagePrune(relation, tpd_buf, NULL, + InvalidOffsetNumber, 0, false, NULL, + NULL); + + if (entries_removed > 0) + tpdpageFreeSpace = PageGetTPDFreeSpace(BufferGetPage(tpd_buf)); + + if (tpdpageFreeSpace < size_tpd_entry) + { + /* + * XXX Here, we can have an optimization such that instead of + * allocating a new page, we can search other TPD pages starting + * from the first_used_tpd_page till we reach last_used_tpd_page. + * It is not clear whether such an optimization can help because + * checking all the TPD pages isn't free either. + */ + if (!already_exists) + ReleaseLastTPDBuffer(tpd_buf); + allocate_new_tpd_page = true; + } + } + } + + if (allocate_new_tpd_page || + (last_used_tpd_page == InvalidBlockNumber && + first_used_tpd_page == InvalidBlockNumber)) + { + tpd_buf = InvalidBuffer; + update_meta = true; + } + + /* Allocate a new TPD entry */ + tpd_entry = AllocateAndFormTPDEntry(pagebuf, offnum, &size_tpd_entry, + &reserved_slot); + Assert (tpd_entry != NULL); + + TPDAllocatePageAndAddEntry(relation, metabuf, pagebuf, tpd_buf, + InvalidOffsetNumber, tpd_entry, size_tpd_entry, + update_meta, false); + + ReleaseBuffer(metabuf); + + /* + * Here, we don't release the tpdbuffer in which we have added the newly + * allocated TPD entry as that will be relased once we update the required + * trasaction slot info in it. The caller will later call TPDPageSetUndo + * to update the required information. + */ + + pfree(tpd_entry); + + /* + * As this is always a fresh transaction slot, so we can assume that + * there is no preexisting undo record pointer. + */ + *urec_ptr = InvalidUndoRecPtr; + + return reserved_slot; +} + +/* + * TPDPageGetTransactionSlots - Get the transaction slots array stored in TPD + * entry. This is a helper routine for TPDPageReserveTransSlot and + * TPDPageGetSlotIfExists. + * + * The tpd entries are stored unaligned, so we need to be careful to read + * them. We use memcpy to avoid unaligned reads. + * + * It is quite possible that the TPD entry containing required transaction slot + * information got pruned away (as all the transaction entries are all-visible) + * by the time caller tries to enquire about it. See atop + * TPDPageGetTransactionSlotInfo for more details on how we deal with pruned + * TPD entries. + * + * This function returns a pointer to an array of transaction slots, it is the + * responsibility of the caller to free it. + */ +TransInfo * +TPDPageGetTransactionSlots(Relation relation, Buffer heapbuf, + OffsetNumber offnum, bool keepTPDBufLock, + bool checkOffset, int *num_map_entries, + int *num_trans_slots, int *tpd_buf_id, + bool *tpd_e_pruned, bool *alloc_bigger_map) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + Page heappage = BufferGetPage(heapbuf); + ZHeapPageOpaque zopaque; + TransInfo *trans_slots = NULL; + TransInfo last_trans_slot_info; + Buffer tpd_buf; + Page tpdpage; + BlockNumber tpdblk; + BlockNumber lastblock; + TPDEntryHeaderData tpd_e_hdr; + Size size_tpd_e_map; + Size size_tpd_e_slots; + int loc_trans_slots; + int buf_idx; + OffsetNumber tpdItemOff; + ItemId itemId; + uint16 tpd_e_offset; + bool already_exists; + + phdr = (PageHeader) heappage; + + if (tpd_buf_id) + *tpd_buf_id = -1; + if (num_map_entries) + *num_map_entries = 0; + if (num_trans_slots) + *num_trans_slots = 0; + if (tpd_e_pruned) + *tpd_e_pruned = false; + if (alloc_bigger_map) + *alloc_bigger_map = false; + + /* Heap page must have TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + /* The last slot in page has the address of the required TPD entry. */ + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK; + + if (!InRecovery) + { + lastblock = RelationGetNumberOfBlocks(relation); + + if (lastblock <= tpdblk) + { + /* + * The required TPD block has been pruned and then truncated away + * which means all transaction slots on that page are older than + * oldestXidHavingUndo. So, we can assume the transaction slot is + * frozen aka transaction is all-visible and can clear the slot from + * heap tuples. + */ + LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned); + goto failed_and_buf_not_locked; + } + } + + /* + * Fetch the required TPD entry. We need to lock the buffer in exclusive + * mode as we later want to set the values in one of the transaction slot. + */ + buf_idx = GetTPDBuffer(relation, tpdblk, InvalidBuffer, + TPD_BUF_FIND_OR_ENTER, &already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + /* We don't need to lock the buffer, if it is already locked */ + if (!already_exists) + { + LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE); + if (tpd_buf_id) + *tpd_buf_id = buf_idx; + } + + tpdpage = BufferGetPage(tpd_buf); + + /* Check whether TPD entry can exist on page? */ + if (PageIsEmpty(tpdpage)) + { + LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned); + goto failed; + } + if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData))) + { + LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned); + goto failed; + } + + itemId = PageGetItemId(tpdpage, tpdItemOff); + + /* TPD entry has been pruned */ + if (!ItemIdIsUsed(itemId)) + { + BufferDesc *bufHdr = GetBufferDescriptor(heapbuf - 1); + + if (BufferIsLocal(heapbuf) || + LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), + LW_EXCLUSIVE)) + LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned); + goto failed; + } + + tpd_e_offset = ItemIdGetOffset(itemId); + + memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader); + + /* + * This TPD entry is for some other block, so we can't continue. This + * indicates that the TPD entry corresponding to heap block has been + * pruned and some other TPD entry has been moved at its location. + */ + if (tpd_e_hdr.blkno != BufferGetBlockNumber(heapbuf)) + { + BufferDesc *bufHdr = GetBufferDescriptor(heapbuf - 1); + + if (BufferIsLocal(heapbuf) || + LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), + LW_EXCLUSIVE)) + LogAndClearTPDLocation(relation, heapbuf, tpd_e_pruned); + goto failed; + } + + /* We should never access deleted entry. */ + Assert(!TPDEntryIsDeleted(tpd_e_hdr)); + + /* Allow caller to allocate a bigger TPD entry instead. */ + if (checkOffset && offnum > tpd_e_hdr.tpe_num_map_entries) + { + /* + * If the caller has requested to check offset, it must be prepared to + * allocate a TPD entry. + */ + Assert(alloc_bigger_map); + *alloc_bigger_map = true; + } + + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint8); + else + { + Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE); + size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint32); + } + + if (num_map_entries) + *num_map_entries = tpd_e_hdr.tpe_num_map_entries; + if (num_trans_slots) + *num_trans_slots = tpd_e_hdr.tpe_num_slots; + size_tpd_e_slots = tpd_e_hdr.tpe_num_slots * sizeof(TransInfo); + loc_trans_slots = tpd_e_offset + SizeofTPDEntryHeader + size_tpd_e_map; + + trans_slots = (TransInfo *) palloc(size_tpd_e_slots); + memcpy((char *) trans_slots, tpdpage + loc_trans_slots, size_tpd_e_slots); + +failed: + if (!keepTPDBufLock) + { + /* + * If we don't want to retain the buffer lock, it must have been taken + * now. We can't release the already existing lock taken. + */ + Assert(!already_exists); + ReleaseLastTPDBuffer(tpd_buf); + + if (tpd_buf_id) + *tpd_buf_id = -1; + } + +failed_and_buf_not_locked: + return trans_slots; +} + +/* + * TPDPageReserveTransSlot - Reserve the available transaction in current TPD + * entry if any, otherwise, return InvalidXactSlotId. + * + * We intentionally don't release the TPD buffer here as that will be + * released once we have updated the transaction slot with required + * information. However, if no free slot is available, then we release the + * buffer. Caller must call UnlockReleaseTPDBuffers after doing necessary + * updates if it is able to reserve a slot. + */ +int +TPDPageReserveTransSlot(Relation relation, Buffer buf, OffsetNumber offnum, + UndoRecPtr *urec_ptr, bool *lock_reacquired) +{ + TransInfo *trans_slots; + int slot_no; + int num_map_entries; + int num_slots; + int result_slot_no = InvalidXactSlotId; + int buf_idx; + bool tpd_e_pruned; + bool alloc_bigger_map; + + trans_slots = TPDPageGetTransactionSlots(relation, buf, offnum, + true, true, &num_map_entries, + &num_slots, &buf_idx, + &tpd_e_pruned, &alloc_bigger_map); + if (tpd_e_pruned) + { + Assert(trans_slots == NULL); + Assert(num_slots == 0); + } + + for (slot_no = 0; slot_no < num_slots; slot_no++) + { + /* Check for an unreserved transaction slot in the TPD entry */ + if (trans_slots[slot_no].xid == InvalidTransactionId) + { + result_slot_no = slot_no; + *urec_ptr = trans_slots[slot_no].urec_ptr; + goto extend_entry_if_required; + } + } + + /* no transaction slot available, try to reuse some existing slot */ + if (num_slots > 0 && + PageFreezeTransSlots(relation, buf, lock_reacquired, trans_slots, num_slots)) + { + pfree(trans_slots); + + /* + * If the lock is re-acquired inside, then the callers must recheck + * that whether they can still perform the required operation. + */ + if (*lock_reacquired) + return InvalidXactSlotId; + + trans_slots = TPDPageGetTransactionSlots(relation, buf, offnum, true, + true, &num_map_entries, + &num_slots, &buf_idx, + &tpd_e_pruned, &alloc_bigger_map); + /* + * We are already holding TPD buffer lock so the TPD entry can not be + * pruned away. + */ + Assert(!tpd_e_pruned); + + for (slot_no = 0; slot_no < num_slots; slot_no++) + { + if (trans_slots[slot_no].xid == InvalidTransactionId) + { + *urec_ptr = trans_slots[slot_no].urec_ptr; + result_slot_no = slot_no; + goto extend_entry_if_required; + } + } + + /* + * After freezing transaction slots, we should get at least one free + * slot. + */ + Assert(result_slot_no != InvalidXactSlotId); + } + +extend_entry_if_required: + + /* + * Allocate a bigger TPD entry if either we need a bigger offset-map + * or there is no unreserved slot available provided TPD entry is not + * pruned in which case we can use last slot on the heap page. + */ + if (!tpd_e_pruned && + (alloc_bigger_map || result_slot_no == InvalidXactSlotId)) + { + ExtendTPDEntry(relation, buf, trans_slots, offnum, buf_idx, + num_map_entries, num_slots, &result_slot_no, urec_ptr, + &tpd_e_pruned); + } + + /* be tidy */ + if (trans_slots != NULL) + pfree(trans_slots); + + /* + * The transaction slots in TPD entry are in addition to the maximum slots + * in the heap page. + */ + if (result_slot_no != InvalidXactSlotId) + result_slot_no += (ZHEAP_PAGE_TRANS_SLOTS + 1); + else if (buf_idx != -1) + ReleaseLastTPDBuffer(tpd_buffers[buf_idx].buf); + + /* + * As TPD entry is pruned, so last transaction slot must be free on the + * heap page. + */ + if (tpd_e_pruned) + { + Assert(result_slot_no == InvalidXactSlotId); + result_slot_no = ZHEAP_PAGE_TRANS_SLOTS; + *urec_ptr = InvalidUndoRecPtr; + } + + return result_slot_no; +} + +/* + * TPDPageGetSlotIfExists - Get the existing slot for the required transaction + * if exists, otherwise, return InvalidXactSlotId. + * + * This is similar to the TPDPageReserveTransSlot except that here we find the + * exisiting transaction slot instead of reserving a new one. + * + * keepTPDBufLock - This indicates whether we need to retain the lock on TPD + * buffer if we are able to reserve a transaction slot. + */ +int +TPDPageGetSlotIfExists(Relation relation, Buffer heapbuf, OffsetNumber offnum, + uint32 epoch, TransactionId xid, UndoRecPtr *urec_ptr, + bool keepTPDBufLock, bool checkOffset) +{ + TransInfo *trans_slots; + int slot_no; + int num_map_entries; + int num_slots; + int result_slot_no = InvalidXactSlotId; + int buf_idx; + bool tpd_e_pruned; + bool alloc_bigger_map; + + trans_slots = TPDPageGetTransactionSlots(relation, + heapbuf, + offnum, + keepTPDBufLock, + checkOffset, + &num_map_entries, + &num_slots, + &buf_idx, + &tpd_e_pruned, + &alloc_bigger_map); + if (tpd_e_pruned) + { + Assert(trans_slots == NULL); + Assert(num_slots == 0); + } + + for (slot_no = 0; slot_no < num_slots; slot_no++) + { + /* Check if already have a slot in the TPD entry */ + if (trans_slots[slot_no].xid_epoch == epoch && + trans_slots[slot_no].xid == xid) + { + result_slot_no = slot_no; + *urec_ptr = trans_slots[slot_no].urec_ptr; + break; + } + } + + /* + * Allocate a bigger TPD entry if we get the required slot in TPD entry, + * but it requires a bigger offset-map. + */ + if (result_slot_no != InvalidXactSlotId && alloc_bigger_map) + { + ExtendTPDEntry(relation, heapbuf, trans_slots, offnum, buf_idx, + num_map_entries, num_slots, &result_slot_no, urec_ptr, + &tpd_e_pruned); + } + + /* be tidy */ + if (trans_slots) + pfree(trans_slots); + + /* + * The transaction slots in TPD entry are in addition to the maximum slots + * in the heap page. + */ + if (result_slot_no != InvalidXactSlotId) + result_slot_no += (ZHEAP_PAGE_TRANS_SLOTS + 1); + else if (buf_idx != -1) + ReleaseLastTPDBuffer(tpd_buffers[buf_idx].buf); + + return result_slot_no; +} + +/* + * TPDPageGetTransactionSlotInfo - Get the required transaction information from + * heap page's TPD entry. + * + * It is quite possible that the TPD entry containing required transaction slot + * information got pruned away (as all the transaction entries are all-visible) + * by the time caller tries to enquire about it. One might expect that if the + * TPD entry is pruned, the corresponding affected tuples should be updated to + * reflect the same, however, we don't do that due to multiple reasons (a) we + * don't access heap pages from TPD layer, it can lead to deadlock, (b) it + * might lead to dirtying a lot of pages and random I/O. However, the first + * time we detect it and we have exclusive lock on page, we update the + * corresponding heap page. + * + * We can consider TPD entry to be pruned under following conditions: (a) the + * tpd block doesn't exist (pruned and truncated by vacuum), (b) the tpd block + * is empty which means all the entries in it are pruned, (c) the tpd block + * has been reused as a heap page, (d) the corresponding TPD entry has been + * pruned away and either the itemid is unused or is reused for some other + * block's TPD entry. + * + * NoTPDBufLock - This indicates that caller doesn't have lock on required tpd + * buffer in which case we need to read and lock the required buffer. + */ +int +TPDPageGetTransactionSlotInfo(Buffer heapbuf, int trans_slot, + OffsetNumber offset, uint32 *epoch, + TransactionId *xid, UndoRecPtr *urec_ptr, + bool NoTPDBufLock, bool keepTPDBufLock) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + ZHeapPageOpaque zopaque; + TransInfo trans_slot_info, last_trans_slot_info; + RelFileNode rnode; + Buffer tpdbuffer; + Page tpdpage; + Page heappage; + BlockNumber tpdblk, heapblk; + ForkNumber forknum; + TPDEntryHeaderData tpd_e_hdr; + Size size_tpd_e_map; + uint32 tpd_e_num_map_entries; + int trans_slot_loc; + int trans_slot_id = trans_slot; + char *tpd_entry_data; + OffsetNumber tpdItemOff; + ItemId itemId; + uint16 tpd_e_offset; + char relpersistence; + + heappage = BufferGetPage(heapbuf); + phdr = (PageHeader) heappage; + + /* Heap page must have a TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK; + + if (NoTPDBufLock) + { + SMgrRelation smgr; + BlockNumber lastblock; + + BufferGetTag(heapbuf, &rnode, &forknum, &heapblk); + + if (InRecovery) + relpersistence = RELPERSISTENCE_PERMANENT; + else + { + Oid reloid; + + reloid = RelidByRelfilenode(rnode.spcNode, rnode.relNode); + relpersistence = get_rel_persistence(reloid); + } + + smgr = smgropen(rnode, + relpersistence == RELPERSISTENCE_TEMP ? + MyBackendId : InvalidBackendId); + + lastblock = smgrnblocks(smgr, forknum); + + /* required block exists? */ + if (tpdblk < lastblock) + { + tpdbuffer = ReadBufferWithoutRelcache(rnode, forknum, tpdblk, RBM_NORMAL, + NULL, relpersistence); + if (keepTPDBufLock) + LockBuffer(tpdbuffer, BUFFER_LOCK_EXCLUSIVE); + else + LockBuffer(tpdbuffer, BUFFER_LOCK_SHARE); + } + else + { + /* + * The required TPD block has been pruned and then truncated away + * which means all transaction slots on that page are older than + * oldestXidHavingUndo. So, we can assume the transaction slot is + * frozen aka transaction is all-visible. + */ + goto slot_is_frozen_and_buf_not_locked; + } + } + else + { + int buf_idx; + bool already_exists PG_USED_FOR_ASSERTS_ONLY; + + buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND, + &already_exists); + /* We must get a valid buffer. */ + Assert(buf_idx != -1); + Assert(already_exists); + tpdbuffer = tpd_buffers[buf_idx].buf; + } + + tpdpage = BufferGetPage(tpdbuffer); + + /* Check whether TPD entry can exist on page? */ + if (PageIsEmpty(tpdpage)) + goto slot_is_frozen; + if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData))) + goto slot_is_frozen; + + itemId = PageGetItemId(tpdpage, tpdItemOff); + + /* TPD entry has been pruned */ + if (!ItemIdIsUsed(itemId)) + { + /* + * Ideally, we can clear the TPD location from heap page, but for that + * we need to have an exclusive lock on the heap page. As this API + * can be called with shared lock on a heap page, we can't perform + * that action. + * + * XXX If it ever turns out to be a performance problem, we can + * release the current lock and acuire the exclusive lock on heap + * page. Also we need to ensure that the lock on TPD page also needs + * to be released and reacquired as we always follow the protocol of + * acquiring the lock on heap page first and then on TPD page, doing + * it otherway can lead to undetected deadlock. + */ + goto slot_is_frozen; + } + + tpd_e_offset = ItemIdGetOffset(itemId); + + memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader); + + /* + * This TPD entry is for some other block, so we can't continue. This + * indicates that the TPD entry corresponding to heap block has been + * pruned and some other TPD entry has been moved at its location. + */ + if (tpd_e_hdr.blkno != BufferGetBlockNumber(heapbuf)) + goto slot_is_frozen; + + /* We should never access deleted entry. */ + Assert(!TPDEntryIsDeleted(tpd_e_hdr)); + + tpd_e_num_map_entries = tpd_e_hdr.tpe_num_map_entries; + tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader; + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint8); + else + { + Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE); + size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint32); + } + + /* + * If the caller has passed transaction slot number that belongs to TPD + * entry, then we directly go and fetch the required info from the slot. + */ + if (offset != InvalidOffsetNumber) + { + /* + * The item for which we want to get the transaction slot information + * must be present in this TPD entry. + */ + Assert (offset <= tpd_e_num_map_entries); + + /* Get TPD entry map */ + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + { + uint8 offset_tpd_e_loc; + + /* + * One byte access shouldn't cause unaligned access, but using memcpy + * for the sake of consistency. + */ + memcpy((char *) &offset_tpd_e_loc, tpd_entry_data + (offset - 1), + sizeof(uint8)); + trans_slot_id = offset_tpd_e_loc; + } + else + { + uint32 offset_tpd_e_loc; + + memcpy((char *) &offset_tpd_e_loc, + tpd_entry_data + (sizeof(uint32) * (offset - 1)), + sizeof(uint32)); + trans_slot_id = offset_tpd_e_loc; + } + } + + /* Transaction must belong to TPD entry. */ + Assert(trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS); + + /* Get the required transaction slot information. */ + trans_slot_loc = (trans_slot_id - ZHEAP_PAGE_TRANS_SLOTS - 1) * + sizeof(TransInfo); + memcpy((char *) &trans_slot_info, + tpd_entry_data + size_tpd_e_map + trans_slot_loc, + sizeof(TransInfo)); + + /* Update the required output */ + if (epoch) + *epoch = trans_slot_info.xid_epoch; + if (xid) + *xid = trans_slot_info.xid; + if (urec_ptr) + *urec_ptr = trans_slot_info.urec_ptr; + + if (NoTPDBufLock && !keepTPDBufLock) + UnlockReleaseBuffer(tpdbuffer); + + return trans_slot_id; + +slot_is_frozen: + if (NoTPDBufLock && !keepTPDBufLock) + UnlockReleaseBuffer(tpdbuffer); + +slot_is_frozen_and_buf_not_locked: + trans_slot_id = ZHTUP_SLOT_FROZEN; + if (epoch) + *epoch = 0; + if (xid) + *xid = InvalidTransactionId; + if (urec_ptr) + *urec_ptr = InvalidUndoRecPtr; + + return trans_slot_id; +} + +/* + * TPDPageSetTransactionSlotInfo - Set the transaction information for a given + * transaction slot in the TPD entry. + * + * Caller must ensure that it has required lock on tpd buffer which is going to + * be updated here. We can't lock the buffer here as this API is supposed to + * be called from critical section and lock acquisition can fail. + */ +void +TPDPageSetTransactionSlotInfo(Buffer heapbuf, int trans_slot_id, + uint32 epoch, TransactionId xid, + UndoRecPtr urec_ptr) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + ZHeapPageOpaque zopaque; + TransInfo trans_slot_info, last_trans_slot_info; + BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY; + Buffer tpd_buf; + Page tpdpage; + Page heappage; + BlockNumber tpdblk; + TPDEntryHeaderData tpd_e_hdr; + TPDPageOpaque tpdopaque; + uint64 tpd_latest_xid_epoch, current_xid_epoch; + Size size_tpd_e_map; + int trans_slot_loc; + int buf_idx; + char *tpd_entry_data; + OffsetNumber tpdItemOff; + ItemId itemId; + uint16 tpd_e_offset; + bool already_exists PG_USED_FOR_ASSERTS_ONLY; + + heappage = BufferGetPage(heapbuf); + phdr = (PageHeader) heappage; + + /* Heap page must have a TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK; + + buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND, + &already_exists); + /* We must get a valid buffer. */ + Assert(buf_idx != -1); + Assert(already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + Assert(BufferIsValid(tpd_buf)); + tpdbufhdr = GetBufferDescriptor(tpd_buf - 1); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr), + LW_EXCLUSIVE)); + Assert(BufferGetBlockNumber(tpd_buf) == tpdblk); + + tpdpage = BufferGetPage(tpd_buf); + itemId = PageGetItemId(tpdpage, tpdItemOff); + + /* + * TPD entry can't go away as we acquire the lock while reserving the slot + * from TPD entry and keep it till we set the required transaction + * information in the slot. + */ + Assert(ItemIdIsUsed(itemId)); + + tpd_e_offset = ItemIdGetOffset(itemId); + + memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader); + + /* TPD entry can't be pruned. */ + Assert(tpd_e_hdr.blkno == BufferGetBlockNumber(heapbuf)); + + /* We should never access deleted entry. */ + Assert(!TPDEntryIsDeleted(tpd_e_hdr)); + + tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader; + + /* Get TPD entry map */ + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint8); + else + size_tpd_e_map = tpd_e_hdr.tpe_num_map_entries * sizeof(uint32); + + /* Set the required transaction slot information. */ + trans_slot_loc = (trans_slot_id - ZHEAP_PAGE_TRANS_SLOTS - 1) * + sizeof(TransInfo); + trans_slot_info.xid_epoch = epoch; + trans_slot_info.xid = xid; + trans_slot_info.urec_ptr = urec_ptr; + + memcpy(tpd_entry_data + size_tpd_e_map + trans_slot_loc, + (char *) &trans_slot_info, + sizeof(TransInfo)); + + /* Update latest transaction information on the page. */ + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + tpd_latest_xid_epoch = (uint64) tpdopaque->tpd_latest_xid_epoch; + tpd_latest_xid_epoch = MakeEpochXid(tpd_latest_xid_epoch, + tpdopaque->tpd_latest_xid); + current_xid_epoch = (uint64) epoch; + current_xid_epoch = MakeEpochXid(current_xid_epoch, xid); + if (tpd_latest_xid_epoch < current_xid_epoch) + { + tpdopaque->tpd_latest_xid_epoch = epoch; + tpdopaque->tpd_latest_xid = xid; + } + + MarkBufferDirty(tpd_buf); +} + +/* + * GetTPDEntryData - Helper function for TPDPageGetOffsetMap and + * TPDPageSetOffsetMap. + * + * Caller must ensure that it has acquired lock on the TPD buffer. + */ +static char * +GetTPDEntryData(Buffer heapbuf, int *num_entries, int *entry_size, + Buffer *tpd_buffer) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY; + Buffer tpd_buf; + Page tpdpage; + Page heappage; + BlockNumber tpdblk; + TPDEntryHeaderData tpd_e_hdr; + int buf_idx; + char *tpd_entry_data; + OffsetNumber tpdItemOff; + ItemId itemId; + uint16 tpd_e_offset; + bool already_exists PG_USED_FOR_ASSERTS_ONLY; + + heappage = BufferGetPage(heapbuf); + phdr = (PageHeader) heappage; + + /* Heap page must have a TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK; + + /* + * Here we don't need to check if the tpd block is pruned and truncated + * away because the tpd buffer must be locked before. + */ + + buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND, + &already_exists); + /* We must get a valid buffer. */ + Assert(buf_idx != -1); + Assert(already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + Assert(BufferIsValid(tpd_buf)); + tpdbufhdr = GetBufferDescriptor(tpd_buf - 1); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr), + LW_EXCLUSIVE)); + Assert(BufferGetBlockNumber(tpd_buf) == tpdblk); + + tpdpage = BufferGetPage(tpd_buf); + + /* Check whether TPD entry can exist on page? */ + if (PageIsEmpty(tpdpage)) + return NULL; + if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData))) + return NULL; + + itemId = PageGetItemId(tpdpage, tpdItemOff); + + /* TPD entry is already pruned away. */ + if (!ItemIdIsUsed(itemId)) + return NULL; + + tpd_e_offset = ItemIdGetOffset(itemId); + + memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader); + + /* TPD entry is pruned away. */ + if (tpd_e_hdr.blkno != BufferGetBlockNumber(heapbuf)) + return NULL; + + /* We should never access deleted entry. */ + Assert(!TPDEntryIsDeleted(tpd_e_hdr)); + + tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader; + *num_entries = tpd_e_hdr.tpe_num_map_entries; + + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + *entry_size = sizeof(uint8); + else + *entry_size = sizeof(uint32); + + if (tpd_buffer) + *tpd_buffer = tpd_buf; + + return tpd_entry_data; +} + +/* + * TPDPageSetOffsetMapSlot - Set the transaction slot for given offset in TPD + * offset map. + * + * Caller must ensure that it has required lock on tpd buffer which is going to + * be updated here. We can't lock the buffer here as this API is supposed to + * be called from critical section and lock acquisition can fail. + */ +void +TPDPageSetOffsetMapSlot(Buffer heapbuf, int trans_slot_id, + OffsetNumber offset) +{ + char *tpd_entry_data; + int num_entries = 0, + entry_size = 0; + Buffer tpd_buf = InvalidBuffer; + + tpd_entry_data = GetTPDEntryData(heapbuf, &num_entries, &entry_size, + &tpd_buf); + + /* + * Caller would have checked that the entry is not pruned after taking + * lock on the tpd page. + */ + Assert(tpd_entry_data); + + Assert (offset <= num_entries); + + if (entry_size == sizeof(uint8)) + { + uint8 offset_tpd_e_loc = trans_slot_id; + + /* + * One byte access shouldn't cause unaligned access, but using memcpy + * for the sake of consistency. + */ + memcpy(tpd_entry_data + (offset - 1), + (char *) &offset_tpd_e_loc, + sizeof(uint8)); + } + else + { + uint32 offset_tpd_e_loc; + + offset_tpd_e_loc = trans_slot_id; + memcpy(tpd_entry_data + (sizeof(uint32) * (offset - 1)), + (char *) &offset_tpd_e_loc, + sizeof(uint32)); + } + + MarkBufferDirty(tpd_buf); +} + +/* + * TPDPageGetOffsetMap - Get the Offset map array of the TPD entry. + * + * This function copy the offset map into tpd_offset_map array allocated by the + * caller. + */ +void +TPDPageGetOffsetMap(Buffer heapbuf, char *tpd_offset_map, int map_size) +{ + char *tpd_entry_data; + int num_entries, entry_size; + + tpd_entry_data = GetTPDEntryData(heapbuf, &num_entries, &entry_size, NULL); + + /* + * Caller would have checked that the entry is not pruned after taking + * lock on the tpd page. + */ + Assert(tpd_entry_data); + + Assert(map_size == num_entries * entry_size); + + memcpy(tpd_offset_map, tpd_entry_data, map_size); +} + +/* + * TPDPageGetOffsetMapSize - Get the Offset map size of the TPD entry. + * + * Caller must ensure that it has acquired lock on tpd buffer corresponding to + * passed heap buffer. + * + * Returns 0, if the tpd entry gets pruned away, otherwise, return the size of + * TPD offset-map. + */ +int +TPDPageGetOffsetMapSize(Buffer heapbuf) +{ + int num_entries, entry_size; + + if (GetTPDEntryData(heapbuf, &num_entries, &entry_size, NULL) == NULL) + return 0; + + return (num_entries * entry_size); +} + +/* + * TPDPageSetOffsetMap - Overwrite TPD offset map array with input offset map + * array. + * + * This function returns a pointer to an array of offset map, it is the + * responsibility of the caller to free it. + * + * Caller must ensure that it has acquired lock on the TPD buffer which is + * going to be updated here. + */ +void +TPDPageSetOffsetMap(Buffer heapbuf, char *tpd_offset_map) +{ + char *tpd_entry_data; + int num_entries = 0, + entry_size = 0; + Buffer tpd_buf = InvalidBuffer; + + /* This function should only be called during recovery. */ + Assert(InRecovery); + + tpd_entry_data = GetTPDEntryData(heapbuf, &num_entries, &entry_size, + &tpd_buf); + + /* Entry can't be pruned during recovery. */ + Assert(tpd_entry_data); + + memcpy(tpd_entry_data, tpd_offset_map, num_entries * entry_size); + + MarkBufferDirty(tpd_buf); +} + +/* + * TPDPageSetUndo - Set the transaction information for a given transaction + * slot in the TPD entry. The difference between this function and + * TPDPageSetTransactionSlotInfo is that here along with transaction + * info, we update the offset to transaction slot map in the TPD entry as + * well. + * + * Caller is responsible for WAL logging this operation and release the TPD + * buffers. We have thought of WAL logging this as a separate operation, but + * that won't work as the undorecord pointer can be bogus during WAL replay; + * that is because we regenerate the undo during WAL replay and it is quite + * possible that the system crashes after flushing this WAL record but before + * flushing WAL of actual heap operation. Similarly, doing it after heap + * operation is not feasible as in that case the tuple's transaction + * information can get lost. + */ +void +TPDPageSetUndo(Buffer heapbuf, int trans_slot_id, bool set_tpd_map_slot, + uint32 epoch, TransactionId xid, UndoRecPtr urec_ptr, + OffsetNumber *usedoff, int ucnt) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + Page heappage = BufferGetPage(heapbuf); + ZHeapPageOpaque zopaque; + TransInfo trans_slot_info, last_trans_slot_info; + BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY; + Buffer tpd_buf; + Page tpdpage; + BlockNumber tpdblk; + TPDEntryHeaderData tpd_e_hdr; + TPDPageOpaque tpdopaque; + uint64 tpd_latest_xid_epoch, current_xid_epoch; + Size size_tpd_e_map; + uint32 tpd_e_num_map_entries; + int trans_slot_loc; + int buf_idx; + int i; + char *tpd_entry_data; + OffsetNumber tpdItemOff; + ItemId itemId; + uint16 tpd_e_offset; + bool already_exists; + + phdr = (PageHeader) heappage; + + /* Heap page must have TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + tpdItemOff = last_trans_slot_info.xid & OFFSET_MASK; + + buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND, + &already_exists); + + /* We must get a valid buffer. */ + Assert(buf_idx != -1); + Assert(already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + + /* + * Fetch the required TPD entry. Ensure that we are operating on the + * right buffer. + */ + tpdbufhdr = GetBufferDescriptor(tpd_buf - 1); + Assert(BufferIsValid(tpd_buf)); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr), + LW_EXCLUSIVE)); + Assert(BufferGetBlockNumber(tpd_buf) == tpdblk); + + tpdpage = BufferGetPage(tpd_buf); + itemId = PageGetItemId(tpdpage, tpdItemOff); + + /* + * TPD entry can't go away as we acquire the lock while reserving the slot + * from TPD entry and keep it till we set the required transaction + * information in the slot. + */ + Assert(ItemIdIsUsed(itemId)); + + tpd_e_offset = ItemIdGetOffset(itemId); + + memcpy((char *) &tpd_e_hdr, tpdpage + tpd_e_offset, SizeofTPDEntryHeader); + + /* TPD entry can't be pruned. */ + Assert(tpd_e_hdr.blkno == BufferGetBlockNumber(heapbuf)); + + /* We should never access deleted entry. */ + Assert(!TPDEntryIsDeleted(tpd_e_hdr)); + + tpd_e_num_map_entries = tpd_e_hdr.tpe_num_map_entries; + tpd_entry_data = tpdpage + tpd_e_offset + SizeofTPDEntryHeader; + + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint8); + else + size_tpd_e_map = tpd_e_num_map_entries * sizeof(uint32); + + /* + * Update TPD entry map for all the modified offsets if we + * have asked to do so. + */ + if (set_tpd_map_slot) + { + /* */ + if (tpd_e_hdr.tpe_flags & TPE_ONE_BYTE) + { + uint8 offset_tpd_e_loc; + + offset_tpd_e_loc = (uint8) trans_slot_id; + + for (i = 0; i < ucnt; i++) + { + /* + * The item for which we want to update the transaction slot information + * must be present in this TPD entry. + */ + Assert (usedoff[i] <= tpd_e_num_map_entries); + /* + * One byte access shouldn't cause unaligned access, but using memcpy + * for the sake of consistency. + */ + memcpy(tpd_entry_data + (usedoff[i] - 1), + (char *) &offset_tpd_e_loc, + sizeof(uint8)); + } + } + else + { + uint32 offset_tpd_e_loc; + + Assert(tpd_e_hdr.tpe_flags & TPE_FOUR_BYTE); + + offset_tpd_e_loc = trans_slot_id; + for (i = 0; i < ucnt; i++) + { + /* + * The item for which we want to update the transaction slot + * information must be present in this TPD entry. + */ + Assert (usedoff[i] <= tpd_e_num_map_entries); + memcpy(tpd_entry_data + (sizeof(uint32) * (usedoff[i] - 1)), + (char *) &offset_tpd_e_loc, + sizeof(uint32)); + } + } + } + + /* Update the required transaction slot information. */ + trans_slot_loc = (trans_slot_id - ZHEAP_PAGE_TRANS_SLOTS - 1) * + sizeof(TransInfo); + trans_slot_info.xid_epoch = epoch; + trans_slot_info.xid = xid; + trans_slot_info.urec_ptr = urec_ptr; + memcpy(tpd_entry_data + size_tpd_e_map + trans_slot_loc, + (char *) &trans_slot_info, + sizeof(TransInfo)); + /* Update latest transaction information on the page. */ + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + tpd_latest_xid_epoch = (uint64) tpdopaque->tpd_latest_xid_epoch; + tpd_latest_xid_epoch = MakeEpochXid(tpd_latest_xid_epoch, + tpdopaque->tpd_latest_xid); + current_xid_epoch = (uint64) epoch; + current_xid_epoch = MakeEpochXid(current_xid_epoch, xid); + if (tpd_latest_xid_epoch < current_xid_epoch) + { + tpdopaque->tpd_latest_xid_epoch = epoch; + tpdopaque->tpd_latest_xid = xid; + } + + MarkBufferDirty(tpd_buf); +} + +/* + * TPDPageLock - Routine to lock the TPD page corresponding to heap page + * + * Caller should not already hold the lock. + * + * Returns false, if couldn't acquire lock because the page is pruned, + * otherwise, true. + */ +bool +TPDPageLock(Relation relation, Buffer heapbuf) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + Page heappage = BufferGetPage(heapbuf); + Page tpdpage; + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + Buffer tpd_buf; + BlockNumber tpdblk, + lastblock; + int buf_idx; + bool already_exists; + + phdr = (PageHeader) heappage; + + /* Heap page must have TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + /* The last in page has the address of the required TPD entry. */ + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + tpdblk = last_trans_slot_info.xid_epoch; + + lastblock = RelationGetNumberOfBlocks(relation); + + if (lastblock <= tpdblk) + { + /* + * The required TPD block has been pruned and then truncated away + * which means all transaction slots on that page are older than + * oldestXidHavingUndo. So, we can't lock the page. + */ + goto failed; + } + + /* + * Fetch the required TPD entry. We need to lock the buffer in exclusive + * mode as we later want to set the values in one of the transaction slot. + */ + buf_idx = GetTPDBuffer(relation, tpdblk, InvalidBuffer, + TPD_BUF_FIND_OR_ENTER, &already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + tpdpage = BufferGetPage(tpd_buf); + + Assert(!already_exists); + LockBuffer(tpd_buf, BUFFER_LOCK_EXCLUSIVE); + + /* Check whether TPD entry can exist on page? */ + if (PageIsEmpty(tpdpage)) + { + ReleaseLastTPDBuffer(tpd_buf); + goto failed; + } + else if (PageGetSpecialSize(tpdpage) != MAXALIGN(sizeof(TPDPageOpaqueData))) + { + ReleaseLastTPDBuffer(tpd_buf); + goto failed; + } + + return true; + +failed: + /* + * The required TPD block has been pruned which means all transaction slots + * on that page are older than oldestXidHavingUndo. So, we can assume the + * TPD transaction slots are frozen aka transactions are all-visible and + * can clear the TPD slots from heap tuples. + */ + LogAndClearTPDLocation(relation, heapbuf, NULL); + return false; +} + +/* + * XLogReadTPDBuffer - Read the TPD buffer. + */ +XLogRedoAction +XLogReadTPDBuffer(XLogReaderState *record, uint8 block_id) +{ + Buffer tpd_buf; + XLogRedoAction action; + bool already_exists; + + action = XLogReadBufferForRedo(record, block_id, &tpd_buf); + + /* + * Remember the buffer, so that it can be release later via + * UnlockReleaseTPDBuffers. + */ + GetTPDBuffer(NULL, BufferGetBlockNumber(tpd_buf), tpd_buf, + TPD_BUF_FIND_OR_KNOWN_ENTER, &already_exists); + + return action; +} + +/* + * RegisterTPDBuffer - Register the TPD buffer + * + * returns the block_id that can be used to register additional buffers in the + * caller. + */ +uint8 +RegisterTPDBuffer(Page heappage, uint8 block_id) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY; + Buffer tpd_buf; + BlockNumber tpdblk; + int buf_idx; + bool already_exists; + + phdr = (PageHeader) heappage; + + /* Heap page must have TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + /* Get the tpd block number from last transaction slot in heap page. */ + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + tpdblk = last_trans_slot_info.xid_epoch; + + buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND, + &already_exists); + + /* We must get a valid buffer. */ + Assert(buf_idx != -1); + Assert(already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + + /* Return same block id if this buffer is already registered. */ + if (TPDBufferAlreadyRegistered(tpd_buf)) + return block_id; + + /* We must be in critical section to perform this action. */ + Assert(CritSectionCount > 0); + tpdbufhdr = GetBufferDescriptor(tpd_buf - 1); + /* The TPD buffer must be valid and locked by me. */ + Assert(BufferIsValid(tpd_buf)); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr), + LW_EXCLUSIVE)); + + XLogRegisterBuffer(block_id++, tpd_buf, REGBUF_STANDARD); + + return block_id; +} + +/* + * TPDPageSetLSN - Set LSN on TPD pages. + */ +void +TPDPageSetLSN(Page heappage, XLogRecPtr recptr) +{ + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY; + Buffer tpd_buf; + BlockNumber tpdblk; + int buf_idx; + bool already_exists; + + phdr = (PageHeader) heappage; + + /* Heap page must have TPD entry. */ + Assert(phdr->pd_flags & PD_PAGE_HAS_TPD_SLOT); + + /* Get the tpd block number from last transaction slot in heap page. */ + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(heappage); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + tpdblk = last_trans_slot_info.xid_epoch; + + buf_idx = GetTPDBuffer(NULL, tpdblk, InvalidBuffer, TPD_BUF_FIND, + &already_exists); + + /* We must get a valid buffer. */ + Assert(buf_idx != -1); + Assert(already_exists); + tpd_buf = tpd_buffers[buf_idx].buf; + + /* Reset the registered buffer index. */ + registered_tpd_buf_idx = 0; + + /* + * Before recording the LSN, ensure that the TPD buffer must be valid and + * locked by me. + */ + tpdbufhdr = GetBufferDescriptor(tpd_buf - 1); + Assert(BufferIsValid(tpd_buf)); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr), + LW_EXCLUSIVE)); + Assert(BufferGetBlockNumber(tpd_buf) == tpdblk); + + PageSetLSN(BufferGetPage(tpd_buf), recptr); +} + +/* + * ResetTPDBuffers - Reset TPD buffer index. Required at the time of + * transaction abort or release TPD buffers. + */ +void +ResetTPDBuffers(void) +{ + int i; + + for (i = 0; i < tpd_buf_idx; i++) + { + tpd_buffers[i].buf = InvalidBuffer; + tpd_buffers[i].blk = InvalidBlockNumber; + } + + tpd_buf_idx = 0; +} +/* + * UnlockReleaseTPDBuffers - Release all the TPD buffers locked by me. + */ +void +UnlockReleaseTPDBuffers(void) +{ + Buffer tpd_buf; + BufferDesc *tpdbufhdr PG_USED_FOR_ASSERTS_ONLY; + int i; + + for (i = 0; i < tpd_buf_idx; i++) + { + tpd_buf = tpd_buffers[i].buf; + Assert(BufferIsValid(tpd_buf)); + tpdbufhdr = GetBufferDescriptor(tpd_buf - 1); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(tpdbufhdr), + LW_EXCLUSIVE)); + UnlockReleaseBuffer(tpd_buf); + } + + ResetTPDBuffers(); +} + +/* + * PageGetTPDFreeSpace + * Returns the size of the free (allocatable) space on a page. + * + * As of now, this is just a wrapper over PageGetFreeSpace, however in future, + * the space management in TPD pages could be different. + */ +Size +PageGetTPDFreeSpace(Page page) +{ + int space; + + /* + * Use signed arithmetic here so that we behave sensibly if pd_lower > + * pd_upper. + */ + space = PageGetFreeSpace(page); + + return (Size) space; +} diff --git a/src/backend/access/zheap/tpdxlog.c b/src/backend/access/zheap/tpdxlog.c new file mode 100644 index 0000000000..7b69ac47d3 --- /dev/null +++ b/src/backend/access/zheap/tpdxlog.c @@ -0,0 +1,522 @@ +/*------------------------------------------------------------------------- + * + * tpdxlog.c + * WAL replay logic for tpd. + * + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/zheap/tpdxlog.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/tpd_xlog.h" +#include "access/xlogutils.h" +#include "access/zheapam_xlog.h" + +/* + * replay of tpd entry allocation + */ +static void +tpd_xlog_allocate_entry(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_tpd_allocate_entry *xlrec; + Buffer tpdbuffer; + Buffer heap_page_buffer; + Buffer metabuf = InvalidBuffer; + Buffer last_used_buf = InvalidBuffer; + Buffer old_tpd_buf = InvalidBuffer; + Page tpdpage; + TPDPageOpaque tpdopaque; + XLogRedoAction action; + + xlrec = (xl_tpd_allocate_entry *) XLogRecGetData(record); + + /* + * If we inserted the first and only tpd entry on the page, re-initialize + * the page from scratch. + */ + if (XLogRecGetInfo(record) & XLOG_TPD_INIT_PAGE) + { + tpdbuffer = XLogInitBufferForRedo(record, 0); + tpdpage = BufferGetPage(tpdbuffer); + TPDInitPage(tpdpage, BufferGetPageSize(tpdbuffer)); + action = BLK_NEEDS_REDO; + } + else + action = XLogReadBufferForRedo(record, 0, &tpdbuffer); + if (action == BLK_NEEDS_REDO) + { + char *tpd_entry; + Size size_tpd_entry; + OffsetNumber offnum; + + tpd_entry = XLogRecGetBlockData(record, 0, &size_tpd_entry); + tpdpage = BufferGetPage(tpdbuffer); + offnum = TPDPageAddEntry(tpdpage, tpd_entry, size_tpd_entry, + xlrec->offnum); + if (offnum == InvalidOffsetNumber) + elog(PANIC, "failed to add TPD entry"); + MarkBufferDirty(tpdbuffer); + PageSetLSN(tpdpage, lsn); + + /* The TPD entry must be added at the provided offset. */ + Assert(offnum == xlrec->offnum); + + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + tpdopaque->tpd_prevblkno = xlrec->prevblk; + + MarkBufferDirty(tpdbuffer); + PageSetLSN(tpdpage, lsn); + } + else if (action == BLK_RESTORED) + { + /* + * Note that we still update the page even if it was restored from a full + * page image, because the special space is not included in the image. + */ + tpdpage = BufferGetPage(tpdbuffer); + + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(tpdpage); + tpdopaque->tpd_prevblkno = xlrec->prevblk; + + MarkBufferDirty(tpdbuffer); + PageSetLSN(tpdpage, lsn); + } + + if (XLogReadBufferForRedo(record, 1, &heap_page_buffer) == BLK_NEEDS_REDO) + { + /* Set the TPD location in last transaction slot of heap page. */ + SetTPDLocation(heap_page_buffer, tpdbuffer, xlrec->offnum); + MarkBufferDirty(heap_page_buffer); + + PageSetLSN(BufferGetPage(heap_page_buffer), lsn); + } + + /* replay the record for meta page */ + if (XLogRecHasBlockRef(record, 2)) + { + xl_zheap_metadata *xlrecmeta; + char *ptr; + Size len; + + metabuf = XLogInitBufferForRedo(record, 2); + ptr = XLogRecGetBlockData(record, 2, &len); + + Assert(len == SizeOfMetaData); + Assert(BufferGetBlockNumber(metabuf) == ZHEAP_METAPAGE); + xlrecmeta = (xl_zheap_metadata *) ptr; + + zheap_init_meta_page(metabuf, xlrecmeta->first_used_tpd_page, + xlrecmeta->last_used_tpd_page); + MarkBufferDirty(metabuf); + PageSetLSN(BufferGetPage(metabuf), lsn); + + /* + * We can have reference of block 3, iff we have reference for block + * 2. + */ + if (XLogRecHasBlockRef(record, 3)) + { + action = XLogReadBufferForRedo(record, 3, &last_used_buf); + /* + * Note that we still update the page even if it was restored from a full + * page image, because the special space is not included in the image. + */ + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + Page last_used_page; + TPDPageOpaque last_tpdopaque; + + last_used_page = BufferGetPage(last_used_buf); + last_tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(last_used_page); + last_tpdopaque->tpd_nextblkno = xlrec->nextblk; + + /* old and last tpd buffer are same. */ + if (xlrec->flags & XLOG_OLD_TPD_BUF_EQ_LAST_TPD_BUF) + { + TPDEntryHeader old_tpd_entry; + Page otpdpage; + char *data; + OffsetNumber *off_num; + Size datalen PG_USED_FOR_ASSERTS_ONLY; + ItemId old_item_id; + + if (action == BLK_NEEDS_REDO) + { + data = XLogRecGetBlockData(record, 3, &datalen); + + off_num = (OffsetNumber *)data; + Assert(datalen == sizeof(OffsetNumber)); + + otpdpage = BufferGetPage(last_used_buf); + old_item_id = PageGetItemId(otpdpage, *off_num); + old_tpd_entry = (TPDEntryHeader)PageGetItem(otpdpage, old_item_id); + old_tpd_entry->tpe_flags |= TPE_DELETED; + } + + /* We can't have a separate reference for old tpd buffer. */ + Assert(!XLogRecHasBlockRef(record, 4)); + } + + MarkBufferDirty(last_used_buf); + PageSetLSN(last_used_page, lsn); + } + } + + /* + * We can have reference of block 4, iff we have reference for block + * 2. + */ + if (XLogRecHasBlockRef(record, 4)) + { + TPDEntryHeader old_tpd_entry; + Page otpdpage; + char *data; + OffsetNumber *off_num; + Size datalen PG_USED_FOR_ASSERTS_ONLY; + ItemId old_item_id; + + action = XLogReadBufferForRedo(record, 4, &old_tpd_buf); + + if (action == BLK_NEEDS_REDO) + { + data = XLogRecGetBlockData(record, 4, &datalen); + + off_num = (OffsetNumber *) data; + Assert(datalen == sizeof(OffsetNumber)); + + otpdpage = BufferGetPage(old_tpd_buf); + old_item_id = PageGetItemId(otpdpage, *off_num); + old_tpd_entry = (TPDEntryHeader) PageGetItem(otpdpage, old_item_id); + old_tpd_entry->tpe_flags |= TPE_DELETED; + + MarkBufferDirty(old_tpd_buf); + PageSetLSN(BufferGetPage(old_tpd_buf), lsn); + } + } + } + + if (BufferIsValid(tpdbuffer)) + UnlockReleaseBuffer(tpdbuffer); + if (BufferIsValid(heap_page_buffer)) + UnlockReleaseBuffer(heap_page_buffer); + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); + if (BufferIsValid(last_used_buf)) + UnlockReleaseBuffer(last_used_buf); + if (BufferIsValid(old_tpd_buf)) + UnlockReleaseBuffer(old_tpd_buf); +} + +/* + * replay inplace update of TPD entry + */ +static void +tpd_xlog_inplace_update_entry(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer tpdbuf; + XLogRedoAction action; + + /* + * If we have a full-page image, restore it (using a cleanup lock) and + * we're done. + */ + action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, + &tpdbuf); + if (action == BLK_NEEDS_REDO) + { + Page tpdpage = (Page) BufferGetPage(tpdbuf); + ItemId item_id; + OffsetNumber *off_num; + char *data; + char *new_tpd_entry; + Size datalen, + size_new_tpd_entry; + uint16 tpd_e_offset; + + data = XLogRecGetBlockData(record, 0, &datalen); + off_num = (OffsetNumber *) data; + new_tpd_entry = (char *) ((char *) data + sizeof(OffsetNumber)); + size_new_tpd_entry = datalen - sizeof(OffsetNumber); + + item_id = PageGetItemId(tpdpage, *off_num); + tpd_e_offset = ItemIdGetOffset(item_id); + memcpy((char *) (tpdpage + tpd_e_offset), + new_tpd_entry, + size_new_tpd_entry); + ItemIdChangeLen(item_id, size_new_tpd_entry); + + MarkBufferDirty(tpdbuf); + PageSetLSN(tpdpage, lsn); + } + if (BufferIsValid(tpdbuf)) + UnlockReleaseBuffer(tpdbuf); +} + +/* + * replay of pruning tpd page + */ +static void +tpd_xlog_clean(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_tpd_clean *xlrec = (xl_tpd_clean *) XLogRecGetData(record); + Buffer tpdbuf; + XLogRedoAction action; + + /* + * If we have a full-page image, restore it (using a cleanup lock) and + * we're done. + */ + action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, + &tpdbuf); + if (action == BLK_NEEDS_REDO) + { + Page tpdpage = (Page) BufferGetPage(tpdbuf); + Page tmppage; + OffsetNumber *end; + OffsetNumber *nowunused; + OffsetNumber *target_offnum; + OffsetNumber tmp_target_off; + Size *space_required; + Size tmp_spc_rqd; + Size datalen; + int nunused; + + if (xlrec->flags & XLZ_CLEAN_CONTAINS_OFFSET) + { + target_offnum = (OffsetNumber *) ((char *) xlrec + SizeOfTPDClean); + space_required = (Size *) ((char *) target_offnum + sizeof(OffsetNumber)); + } + else + { + target_offnum = &tmp_target_off; + *target_offnum = InvalidOffsetNumber; + space_required = &tmp_spc_rqd; + *space_required = 0; + } + + nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen); + end = (OffsetNumber *) ((char *) nowunused + datalen); + nunused = (end - nowunused); + + if (nunused >= 0) + { + /* Update all item pointers per the record, and repair fragmentation */ + TPDPagePruneExecute(tpdbuf, nowunused, nunused); + } + + tmppage = PageGetTempPageCopy(tpdpage); + TPDPageRepairFragmentation(tpdpage, tmppage, *target_offnum, + *space_required); + + /* + * Note: we don't worry about updating the page's prunability hints. + * At worst this will cause an extra prune cycle to occur soon. + */ + + MarkBufferDirty(tpdbuf); + PageSetLSN(tpdpage, lsn); + + pfree(tmppage); + } + if (BufferIsValid(tpdbuf)) + UnlockReleaseBuffer(tpdbuf); +} + +/* + * replay for clearing tpd location from heap page. + */ +static void +tpd_xlog_clear_location(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer buffer; + + if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO) + { + Page page = (Page) BufferGetPage(buffer); + + ClearTPDLocation(buffer); + MarkBufferDirty(buffer); + PageSetLSN(page, lsn); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); +} + +/* + * replay for freeing tpd page. + */ +static void +tpd_xlog_free_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + RelFileNode rnode; + xl_tpd_free_page *xlrec = (xl_tpd_free_page *) XLogRecGetData(record); + Buffer buffer = InvalidBuffer, + prevbuf = InvalidBuffer, + nextbuf = InvalidBuffer, + metabuf = InvalidBuffer; + BlockNumber blkno; + Page page; + XLogRedoAction action; + Size freespace; + + if (XLogRecHasBlockRef(record, 0)) + { + action = XLogReadBufferForRedo(record, 0, &prevbuf); + + /* + * Note that we still update the page even if it was restored from a full + * page image, because the special space is not included in the image. + */ + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + TPDPageOpaque prevtpdopaque; + Page prevpage = (Page) BufferGetPage(prevbuf); + + prevtpdopaque = (TPDPageOpaque) PageGetSpecialPointer(prevpage); + prevtpdopaque->tpd_nextblkno = xlrec->nextblkno; + + MarkBufferDirty(prevbuf); + PageSetLSN(prevpage, lsn); + } + } + + XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno); + action = XLogReadBufferForRedo(record, 1, &buffer); + page = (Page) BufferGetPage(buffer); + + /* + * Note that we still update the page even if it was restored from a full + * page image, because the special space is not included in the image. + */ + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + TPDPageOpaque tpdopaque; + + tpdopaque = (TPDPageOpaque) PageGetSpecialPointer(page); + + tpdopaque->tpd_prevblkno = InvalidBlockNumber; + tpdopaque->tpd_nextblkno = InvalidBlockNumber; + tpdopaque->tpd_latest_xid_epoch = 0; + tpdopaque->tpd_latest_xid = InvalidTransactionId; + + MarkBufferDirty(buffer); + PageSetLSN(page, lsn); + } + + Assert(PageIsEmpty(page)); + Assert(blkno == BufferGetBlockNumber(buffer)); + freespace = PageGetTPDFreeSpace(page); + + if (XLogRecHasBlockRef(record, 2)) + { + action = XLogReadBufferForRedo(record, 2, &nextbuf); + + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + TPDPageOpaque nexttpdopaque; + Page nextpage = (Page) BufferGetPage(nextbuf); + + nexttpdopaque = (TPDPageOpaque) PageGetSpecialPointer(nextpage); + nexttpdopaque->tpd_prevblkno = xlrec->prevblkno; + + MarkBufferDirty(nextbuf); + PageSetLSN(nextpage, lsn); + } + } + + if (XLogRecHasBlockRef(record, 3)) + { + xl_zheap_metadata *xlrecmeta; + char *ptr; + Size len; + + metabuf = XLogInitBufferForRedo(record, 3); + ptr = XLogRecGetBlockData(record, 3, &len); + + Assert(len == SizeOfMetaData); + Assert(BufferGetBlockNumber(metabuf) == ZHEAP_METAPAGE); + xlrecmeta = (xl_zheap_metadata *) ptr; + + zheap_init_meta_page(metabuf, xlrecmeta->first_used_tpd_page, + xlrecmeta->last_used_tpd_page); + MarkBufferDirty(metabuf); + PageSetLSN(BufferGetPage(metabuf), lsn); + } + + if (BufferIsValid(prevbuf)) + UnlockReleaseBuffer(prevbuf); + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + if (BufferIsValid(nextbuf)) + UnlockReleaseBuffer(nextbuf); + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); + + /* Record the empty page in FSM. */ + XLogRecordPageWithFreeSpace(rnode, blkno, freespace); +} + +/* + * replay of pruning all the entries in tpd page. + */ +static void +tpd_xlog_clean_all_entries(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer buffer; + + if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO) + { + Page page = (Page) BufferGetPage(buffer); + + ((PageHeader) page)->pd_lower = SizeOfPageHeaderData; + ((PageHeader) page)->pd_upper = ((PageHeader) page)->pd_special; + + MarkBufferDirty(buffer); + PageSetLSN(page, lsn); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); +} + +void +tpd_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info & XLOG_TPD_OPMASK) + { + case XLOG_ALLOCATE_TPD_ENTRY: + tpd_xlog_allocate_entry(record); + break; + case XLOG_INPLACE_UPDATE_TPD_ENTRY: + tpd_xlog_inplace_update_entry(record); + break; + case XLOG_TPD_CLEAN: + tpd_xlog_clean(record); + break; + case XLOG_TPD_CLEAR_LOCATION: + tpd_xlog_clear_location(record); + break; + case XLOG_TPD_FREE_PAGE: + tpd_xlog_free_page(record); + break; + case XLOG_TPD_CLEAN_ALL_ENTRIES: + tpd_xlog_clean_all_entries(record); + break; + default: + elog(PANIC, "tpd_redo: unknown op code %u", info); + } +} diff --git a/src/backend/access/zheap/zheapam.c b/src/backend/access/zheap/zheapam.c new file mode 100644 index 0000000000..c916bde3fb --- /dev/null +++ b/src/backend/access/zheap/zheapam.c @@ -0,0 +1,11877 @@ +/*------------------------------------------------------------------------- + * + * zheapam.c + * zheap access method code + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/heap/zheapam.c + * + * + * INTERFACE ROUTINES + * zheap_insert - insert zheap tuple into a relation + * + * NOTES + * This file contains the zheap_ routines which implement + * the POSTGRES zheap access method used for relations backed + * by undo storage. + * + * In zheap, we never generate subtransaction id and rather always use top + * transaction id. The sub-transaction id is mainly required to detect the + * visibility of tuple when the sub-transaction state is different from + * main transaction state, say due to Rollback To SavePoint. In zheap, we + * always perform undo actions to make sure that the tuple state reaches to + * the state where it is at the start of subtransaction in such a case. + * This will also help in avoiding the transaction slots to grow inside a + * page and will have lesser clog entries. Another advantage is that it + * will help us retaining the undo records for one transaction together + * in undo log instead of those being interleaved which will avoid having + * more undo records that have UREC_INFO_TRANSACTION. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/bufmask.h" +#include "access/htup_details.h" +#include "access/parallel.h" +#include "access/relscan.h" +#include "access/sysattr.h" +#include "access/xact.h" +#include "access/relscan.h" +#include "access/tableam.h" +#include "access/tpd.h" +#include "access/tuptoaster.h" +#include "access/undoinsert.h" +#include "access/undolog.h" +#include "access/undolog_xlog.h" +#include "access/undorecord.h" +#include "access/visibilitymap.h" +#include "access/zheap.h" +#include "access/zhio.h" +#include "access/zhtup.h" +#include "access/zheapam_xlog.h" +#include "access/zheap.h" +#include "access/zheapscan.h" +#include "access/zmultilocker.h" +#include "catalog/catalog.h" +#include "executor/tuptable.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "postmaster/undoloop.h" +#include "storage/bufmgr.h" +#include "storage/lmgr.h" +#include "storage/predicate.h" +#include "storage/procarray.h" +#include "utils/datum.h" +#include "utils/expandeddatum.h" +#include "utils/inval.h" +#include "utils/memdebug.h" +#include "utils/rel.h" +#include "utils/tqual.h" + + /* + * Possible lock modes for a tuple. + */ +typedef enum LockOper +{ + /* SELECT FOR 'KEY SHARE/SHARE/NO KEY UPDATE/UPDATE' */ + LockOnly, + /* Via EvalPlanQual where after locking we will update it */ + LockForUpdate, + /* Update/Delete */ + ForUpdate +} LockOper; + +extern bool synchronize_seqscans; + +static ZHeapTuple zheap_prepare_insert(Relation relation, ZHeapTuple tup, + int options); +static Bitmapset * +ZHeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols, + ZHeapTuple oldtup, ZHeapTuple newtup); + +static void RelationPutZHeapTuple(Relation relation, Buffer buffer, + ZHeapTuple tuple); +static void log_zheap_update(Relation reln, UnpackedUndoRecord undorecord, + UnpackedUndoRecord newundorecord, UndoRecPtr urecptr, + UndoRecPtr newurecptr, Buffer oldbuf, Buffer newbuf, + ZHeapTuple oldtup, ZHeapTuple newtup, + int old_tup_trans_slot_id, int trans_slot_id, + int new_trans_slot_id, bool inplace_update, + bool all_visible_cleared, bool new_all_visible_cleared, + xl_undolog_meta *undometa); +static HTSU_Result +zheap_lock_updated_tuple(Relation rel, ZHeapTuple tuple, ItemPointer ctid, + TransactionId xid, LockTupleMode mode, LockOper lockopr, + CommandId cid, bool *rollback_and_relocked); +static void zheap_lock_tuple_guts(Relation rel, Buffer buf, ZHeapTuple zhtup, + TransactionId tup_xid, TransactionId xid, + LockTupleMode mode, LockOper lockopr, uint32 epoch, + int tup_trans_slot_id, int trans_slot_id, + TransactionId single_locker_xid, int single_locker_trans_slot, + UndoRecPtr prev_urecptr, CommandId cid, + bool any_multi_locker_member_alive); +static void compute_new_xid_infomask(ZHeapTuple zhtup, Buffer buf, + TransactionId tup_xid, int tup_trans_slot, + uint16 old_infomask, TransactionId add_to_xid, + int trans_slot, TransactionId single_locker_xid, + LockTupleMode mode, LockOper lockoper, + uint16 *result_infomask, int *result_trans_slot); +static ZHeapFreeOffsetRanges * +ZHeapGetUsableOffsetRanges(Buffer buffer, ZHeapTuple *tuples, int ntuples, + Size saveFreeSpace); +static inline void CheckAndLockTPDPage(Relation relation, int new_trans_slot_id, + int old_trans_slot_id, Buffer newbuf, + Buffer oldbuf); + +/* + * zheap_compute_data_size + * Determine size of the data area of a tuple to be constructed. + * + * We can't start with zero offset for first attribute as that has a + * hidden assumption that tuple header is MAXALIGNED which is not true + * for zheap. For example, if the first attribute requires alignment + * (say it is four-byte varlena), then the code would assume the offset + * is aligned incase we start with zero offset for first attribute. So, + * always start with the actual byte from where the first attribute starts. + */ +Size +zheap_compute_data_size(TupleDesc tupleDesc, Datum *values, bool *isnull, + int t_hoff) +{ + Size data_length = t_hoff; + int i; + int numberOfAttributes = tupleDesc->natts; + + for (i = 0; i < numberOfAttributes; i++) + { + Datum val; + Form_pg_attribute atti; + + if (isnull[i]) + continue; + + val = values[i]; + atti = TupleDescAttr(tupleDesc, i); + + if (atti->attbyval) + { + /* byval attributes are stored unaligned in zheap. */ + data_length += atti->attlen; + } + else if (ATT_IS_PACKABLE(atti) && + VARATT_CAN_MAKE_SHORT(DatumGetPointer(val))) + { + /* + * we're anticipating converting to a short varlena header, so + * adjust length and don't count any alignment + */ + data_length += VARATT_CONVERTED_SHORT_SIZE(DatumGetPointer(val)); + } + else if (atti->attlen == -1 && + VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val))) + { + /* + * we want to flatten the expanded value so that the constructed + * tuple doesn't depend on it + */ + data_length = att_align_nominal(data_length, atti->attalign); + data_length += EOH_get_flat_size(DatumGetEOHP(val)); + } + else + { + data_length = att_align_datum(data_length, atti->attalign, + atti->attlen, val); + data_length = att_addlength_datum(data_length, atti->attlen, + val); + } + } + + return data_length - t_hoff; +} + +/* + * zheap_fill_tuple + * Load data portion of a tuple from values/isnull arrays + * + * We also fill the null bitmap (if any) and set the infomask bits + * that reflect the tuple's data contents. + * + * This function is same as heap_fill_tuple except for datatype of infomask + * parameter. + * + * NOTE: it is now REQUIRED that the caller have pre-zeroed the data area. + */ +void +zheap_fill_tuple(TupleDesc tupleDesc, + Datum *values, bool *isnull, + char *data, Size data_size, + uint16 *infomask, bits8 *bit) +{ + bits8 *bitP; + int bitmask; + int i; + int numberOfAttributes = tupleDesc->natts; + +#ifdef USE_ASSERT_CHECKING + char *start = data; +#endif + + if (bit != NULL) + { + bitP = &bit[-1]; + bitmask = HIGHBIT; + } + else + { + /* just to keep compiler quiet */ + bitP = NULL; + bitmask = 0; + } + + *infomask &= ~(ZHEAP_HASNULL | ZHEAP_HASVARWIDTH | ZHEAP_HASEXTERNAL); + + for (i = 0; i < numberOfAttributes; i++) + { + Form_pg_attribute att = TupleDescAttr(tupleDesc, i); + Size data_length; + + if (bit != NULL) + { + if (bitmask != HIGHBIT) + bitmask <<= 1; + else + { + bitP += 1; + *bitP = 0x0; + bitmask = 1; + } + + if (isnull[i]) + { + *infomask |= ZHEAP_HASNULL; + continue; + } + + *bitP |= bitmask; + } + + /* + * XXX we use the att_align macros on the pointer value itself, not on + * an offset. This is a bit of a hack. + */ + + if (att->attbyval) + { + /* pass-by-value */ + //data = (char *) att_align_nominal(data, att->attalign); + //store_att_byval(data, values[i], att->attlen); + memcpy(data, (char *) &values[i], att->attlen); + data_length = att->attlen; + } + else if (att->attlen == -1) + { + /* varlena */ + Pointer val = DatumGetPointer(values[i]); + + *infomask |= ZHEAP_HASVARWIDTH; + if (VARATT_IS_EXTERNAL(val)) + { + if (VARATT_IS_EXTERNAL_EXPANDED(val)) + { + /* + * we want to flatten the expanded value so that the + * constructed tuple doesn't depend on it + */ + ExpandedObjectHeader *eoh = DatumGetEOHP(values[i]); + + data = (char *) att_align_nominal(data, + att->attalign); + data_length = EOH_get_flat_size(eoh); + EOH_flatten_into(eoh, data, data_length); + } + else + { + *infomask |= ZHEAP_HASEXTERNAL; + /* no alignment, since it's short by definition */ + data_length = VARSIZE_EXTERNAL(val); + memcpy(data, val, data_length); + } + } + else if (VARATT_IS_SHORT(val)) + { + /* no alignment for short varlenas */ + data_length = VARSIZE_SHORT(val); + memcpy(data, val, data_length); + } + else if (VARLENA_ATT_IS_PACKABLE(att) && + VARATT_CAN_MAKE_SHORT(val)) + { + /* convert to short varlena -- no alignment */ + data_length = VARATT_CONVERTED_SHORT_SIZE(val); + SET_VARSIZE_SHORT(data, data_length); + memcpy(data + 1, VARDATA(val), data_length - 1); + } + else + { + /* full 4-byte header varlena */ + data = (char *) att_align_nominal(data, + att->attalign); + data_length = VARSIZE(val); + memcpy(data, val, data_length); + } + } + else if (att->attlen == -2) + { + /* cstring ... never needs alignment */ + *infomask |= ZHEAP_HASVARWIDTH; + Assert(att->attalign == 'c'); + data_length = strlen(DatumGetCString(values[i])) + 1; + memcpy(data, DatumGetPointer(values[i]), data_length); + } + else + { + /* fixed-length pass-by-reference */ + data = (char *) att_align_nominal(data, att->attalign); + Assert(att->attlen > 0); + data_length = att->attlen; + memcpy(data, DatumGetPointer(values[i]), data_length); + } + + data += data_length; + } + + Assert((data - start) == data_size); +} + +/* + * zheap_form_tuple + * construct a tuple from the given values[] and isnull[] arrays. + * + * This is similar to heap_form_tuple except for tuple header. Currently, + * we don't do anything special for Datum tuples, but eventually we need + * to do something about it. + */ +ZHeapTuple +zheap_form_tuple(TupleDesc tupleDescriptor, + Datum *values, + bool *isnull) +{ + ZHeapTuple tuple; /* return tuple */ + ZHeapTupleHeader td; /* tuple data */ + Size len, + data_len; + int hoff; + bool hasnull = false; + int numberOfAttributes = tupleDescriptor->natts; + int i; + + if (numberOfAttributes > MaxTupleAttributeNumber) + ereport(ERROR, + (errcode(ERRCODE_TOO_MANY_COLUMNS), + errmsg("number of columns (%d) exceeds limit (%d)", + numberOfAttributes, MaxTupleAttributeNumber))); + + /* + * Check for nulls + */ + for (i = 0; i < numberOfAttributes; i++) + { + if (isnull[i]) + { + hasnull = true; + break; + } + } + + /* + * Determine total space needed + */ + len = offsetof(ZHeapTupleHeaderData, t_bits); + + if (hasnull) + len += BITMAPLEN(numberOfAttributes); + + /* + * We don't MAXALIGN the tuple headers as we always make the copy of tuple + * to support in-place updates. + */ + hoff = len; + + data_len = zheap_compute_data_size(tupleDescriptor, values, isnull, hoff); + + len += data_len; + + /* + * Allocate and zero the space needed. Note that the tuple body and + * ZHeapTupleData management structure are allocated in one chunk. + */ + tuple = MemoryContextAllocExtended(CurrentMemoryContext, + ZHEAPTUPLESIZE + len, + MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO); + tuple->t_data = td = (ZHeapTupleHeader) ((char *) tuple + ZHEAPTUPLESIZE); + + /* + * And fill in the information. Note we fill the Datum fields even though + * this tuple may never become a Datum. This lets HeapTupleHeaderGetDatum + * identify the tuple type if needed. + */ + tuple->t_len = len; + ItemPointerSetInvalid(&(tuple->t_self)); + tuple->t_tableOid = InvalidOid; + + ZHeapTupleHeaderSetNatts(td, numberOfAttributes); + td->t_hoff = hoff; + + zheap_fill_tuple(tupleDescriptor, + values, + isnull, + (char *) td + hoff, + data_len, + &td->t_infomask, + (hasnull ? td->t_bits : NULL)); + + return tuple; +} + +/* + * zheap_deform_tuple - similar to heap_deform_tuple, but for zheap tuples. + * + * Note that for zheap, cached offsets are not used and we always start + * deforming with the actual byte from where the first attribute starts. See + * atop zheap_compute_data_size. + */ +void +zheap_deform_tuple(ZHeapTuple tuple, TupleDesc tupleDesc, + Datum *values, bool *isnull) +{ + ZHeapTupleHeader tup = tuple->t_data; + bool hasnulls = ZHeapTupleHasNulls(tuple); + int tdesc_natts = tupleDesc->natts; + int natts; /* number of atts to extract */ + int attnum; + char *tp; /* ptr to tuple data */ + long off; /* offset in tuple data */ + bits8 *bp = tup->t_bits; /* ptr to null bitmap in tuple */ + + natts = ZHeapTupleHeaderGetNatts(tup); + + /* + * In inheritance situations, it is possible that the given tuple actually + * has more fields than the caller is expecting. Don't run off the end of + * the caller's arrays. + */ + natts = Min(natts, tdesc_natts); + + tp = (char *) tup; + + off = tup->t_hoff; + + for (attnum = 0; attnum < natts; attnum++) + { + Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum); + + if (hasnulls && att_isnull(attnum, bp)) + { + values[attnum] = (Datum) 0; + isnull[attnum] = true; + continue; + } + + isnull[attnum] = false; + + if (thisatt->attlen == -1) + { + off = att_align_pointer(off, thisatt->attalign, -1, + tp + off); + } + else if (!thisatt->attbyval) + { + /* not varlena, so safe to use att_align_nominal */ + off = att_align_nominal(off, thisatt->attalign); + } + + /* + * Support fetching attributes for zheap. The main difference as + * compare to heap tuples is that we don't align passbyval attributes. + * To compensate that we use memcpy to fetch passbyval attributes. + */ + if (thisatt->attbyval) + memcpy(&values[attnum], tp + off, thisatt->attlen); + else + values[attnum] = PointerGetDatum((char *) (tp + off)); + + off = att_addlength_pointer(off, thisatt->attlen, tp + off); + } + + /* + * If tuple doesn't have all the atts indicated by tupleDesc, read the + * rest as nulls or missing values as appropriate. + */ + for (; attnum < tdesc_natts; attnum++) + values[attnum] = getmissingattr(tupleDesc, attnum + 1, &isnull[attnum]); +} + +void +slot_deform_ztuple(TupleTableSlot *slot, ZHeapTuple tuple, uint32 *offp, int natts) +{ + TupleDesc tupleDesc = slot->tts_tupleDescriptor; + Datum *values = slot->tts_values; + bool *isnull = slot->tts_isnull; + ZHeapTupleHeader tup = tuple->t_data; + bool hasnulls = ZHeapTupleHasNulls(tuple); + int attnum; + char *tp; /* ptr to tuple data */ + uint32 off; /* offset in tuple data */ + bits8 *bp = tup->t_bits; /* ptr to null bitmap in tuple */ + + /* We can only fetch as many attributes as the tuple has. */ + natts = Min(HeapTupleHeaderGetNatts(tuple->t_data), natts); + + /* + * Check whether the first call for this tuple, and initialize or restore + * loop state. + */ + attnum = slot->tts_nvalid; + if (attnum == 0) + off = 0; /* Start from the first attribute */ + else + off = *offp; /* Restore state from previous execution */ + + tp = (char *) tup + tup->t_hoff + off; + + for (; attnum < natts; attnum++) + { + Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum); + + if (hasnulls && att_isnull(attnum, bp)) + { + values[attnum] = (Datum) 0; + isnull[attnum] = true; + continue; + } + + isnull[attnum] = false; + + if (thisatt->attlen == -1) + { + tp = (char *) att_align_pointer(tp, thisatt->attalign, -1, + tp); + } + else if (!thisatt->attbyval) + { + /* not varlena, so safe to use att_align_nominal */ + tp = (char *) att_align_nominal(tp, thisatt->attalign); + } + /* XXX: We don't align for byval attributes in zheap. */ + + /* + * Support fetching attributes for zheap. The main difference as + * compare to heap tuples is that we don't align passbyval attributes. + * To compensate that we use memcpy to fetch the source of passbyval + * attributes. + */ + if (thisatt->attbyval) + { + Datum datum; + + memcpy(&datum, tp, thisatt->attlen); + values[attnum] = fetch_att(&datum, true, thisatt->attlen); + } + else + values[attnum] = PointerGetDatum(tp); + + tp = att_addlength_pointer(tp, thisatt->attlen, tp); + } + + /* + * Save state for next execution + */ + slot->tts_nvalid = attnum; + *offp = tp - ((char *) tup + tup->t_hoff); + /* For zheap, cached offsets are not used. */ + /* ZBORKED: should just stop setting this */ + slot->tts_flags |= TTS_FLAG_SLOW; +} + +/* + * Subroutine for zheap_insert(). Prepares a tuple for insertion. + * + * This is similar to heap_prepare_insert except that we don't set + * information in tuple header as that needs to be either set in + * TPD entry or undorecord for this tuple. + */ +static ZHeapTuple +zheap_prepare_insert(Relation relation, ZHeapTuple tup, int options) +{ + + /* + * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL. + * If we skip writing/using WAL, we must force the relation down to disk + * (using heap_sync) before it's safe to commit the transaction. This + * requires writing out any dirty buffers of that relation and then doing + * a forced fsync. For zheap, we've to fsync the corresponding undo buffers + * as well. It is difficult to keep track of dirty undo buffers and fsync + * them at end of the operation in some function similar to heap_sync. + * But, if we're freezing the tuple during insertion, we can use the + * HEAP_INSERT_SKIP_WAL optimization since we don't write undo for the same. + */ + Assert(!(options & HEAP_INSERT_SKIP_WAL) || (options & HEAP_INSERT_FROZEN)); + + /* + * Parallel operations are required to be strictly read-only in a parallel + * worker. Parallel inserts are not safe even in the leader in the + * general case, because group locking means that heavyweight locks for + * relation extension or GIN page locks will not conflict between members + * of a lock group, but we don't prohibit that case here because there are + * useful special cases that we can safely allow, such as CREATE TABLE AS. + */ + if (IsParallelWorker()) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg("cannot insert tuples in a parallel worker"))); + + tup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + tup->t_data->t_infomask2 &= ~ZHEAP_XACT_SLOT; + + if (options & HEAP_INSERT_FROZEN) + ZHeapTupleHeaderSetXactSlot(tup->t_data, ZHTUP_SLOT_FROZEN); + tup->t_tableOid = RelationGetRelid(relation); + + /* + * If the new tuple is too big for storage or contains already toasted + * out-of-line attributes from some other relation, invoke the toaster. + */ + if (relation->rd_rel->relkind != RELKIND_RELATION && + relation->rd_rel->relkind != RELKIND_MATVIEW) + { + /* toast table entries should never be recursively toasted */ + Assert(!ZHeapTupleHasExternal(tup)); + return tup; + } + else if (ZHeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD) + return ztoast_insert_or_update(relation, tup, NULL, options); + else + return tup; +} + +/* + * xid_infomask_changed - It checks whether the relevant status for a tuple + * xid has changed. + * + * Note the Xid field itself must be compared separately. + */ +static inline bool +xid_infomask_changed(uint16 new_infomask, uint16 old_infomask) +{ + const uint16 interesting = ZHEAP_XID_LOCK_ONLY; + + if ((new_infomask & interesting) != (old_infomask & interesting)) + return true; + + return false; +} + +/* + * zheap_exec_pending_rollback - Execute pending rollback actions for the + * given buffer (page). + * + * This function expects that the input buffer is locked. We will release and + * reacquire the buffer lock in this function, the same can be done in all the + * callers of this function, but that is just a code duplication, so we instead + * do it here. + */ +bool +zheap_exec_pending_rollback(Relation rel, Buffer buffer, int slot_no, + TransactionId xwait) +{ + UndoRecPtr urec_ptr; + TransactionId xid; + uint32 epoch; + int out_slot_no PG_USED_FOR_ASSERTS_ONLY; + + out_slot_no = GetTransactionSlotInfo(buffer, + InvalidOffsetNumber, + slot_no, + &epoch, + &xid, + &urec_ptr, + true, + true); + + /* As the rollback is pending, the slot can't be frozen. */ + Assert(out_slot_no != ZHTUP_SLOT_FROZEN); + + if (xwait != xid) + return false; + + /* + * Release buffer lock before applying undo actions. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + process_and_execute_undo_actions_page(urec_ptr, rel, buffer, epoch, xid, slot_no); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + return true; +} + +/* + * zbuffer_exec_pending_rollback - apply any pending rollback on the input buffer + * + * This method traverses all the transaction slots of the current page including + * tpd slots and applies any pending aborts on the page. + * + * It expects the caller has an exclusive lock on the relation. It also returns + * the corresponding TPD block number in case it has rolled back any transactions + * from the corresponding TPD page, if any. + */ +void +zbuffer_exec_pending_rollback(Relation rel, Buffer buf, BlockNumber *tpd_blkno) +{ + int slot_no; + int total_trans_slots = 0; + uint64 epoch; + TransactionId xid; + UndoRecPtr urec_ptr; + TransInfo *trans_slots = NULL; + bool any_tpd_slot_rolled_back = false; + + Assert(tpd_blkno != NULL); + + /* + * Fetch all the transaction information from the page and its corresponding + * TPD page. + */ + trans_slots = GetTransactionsSlotsForPage(rel, buf, &total_trans_slots, tpd_blkno); + + for (slot_no = 0; slot_no < total_trans_slots; slot_no++) + { + epoch = trans_slots[slot_no].xid_epoch; + xid = trans_slots[slot_no].xid; + urec_ptr = trans_slots[slot_no].urec_ptr; + + /* + * There shouldn't be any other in-progress transaction as we hold an + * exclusive lock on the relation. + */ + Assert(TransactionIdIsCurrentTransactionId(xid) || + !TransactionIdIsInProgress(xid)); + + /* If the transaction is aborted, apply undo actions */ + if (TransactionIdIsValid(xid) && TransactionIdDidAbort(xid)) + { + /* Remember if we've rolled back a transactio from a TPD-slot. */ + if ((slot_no >= ZHEAP_PAGE_TRANS_SLOTS - 1) && + BlockNumberIsValid(*tpd_blkno)) + any_tpd_slot_rolled_back = true; + process_and_execute_undo_actions_page(urec_ptr, rel, buf, epoch, + xid, slot_no); + } + } + + /* + * If we've not rolled back anything from TPD slot, there is no + * need set the TPD buffer. + */ + if (!any_tpd_slot_rolled_back) + *tpd_blkno = InvalidBlockNumber; + + /* be tidy */ + pfree(trans_slots); +} + +/* + * zheap_insert - insert tuple into a zheap + * + * The functionality related to heap is quite similar to heap_insert, + * additionaly this function inserts an undo record and updates the undo + * pointer in page header or in TPD entry for this page. + * + * XXX - Visibility map and page is all visible checks are required to support + * index-only scans on zheap. + */ +void +zheap_insert(Relation relation, ZHeapTuple tup, CommandId cid, + int options, BulkInsertState bistate) +{ + TransactionId xid = InvalidTransactionId; + uint32 epoch = 0; + ZHeapTuple zheaptup; + UnpackedUndoRecord undorecord; + Buffer buffer; + Buffer vmbuffer = InvalidBuffer; + bool all_visible_cleared = false; + int trans_slot_id = InvalidXactSlotId; + Page page; + UndoRecPtr urecptr = InvalidUndoRecPtr, + prev_urecptr = InvalidUndoRecPtr; + xl_undolog_meta undometa; + uint8 vm_status = 0; + bool lock_reacquired; + bool skip_undo; + + /* + * We can skip inserting undo records if the tuples are to be marked + * as frozen. + */ + skip_undo = (options & HEAP_INSERT_FROZEN); + + if (!skip_undo) + { + /* We don't need a transaction id if we are skipping undo */ + xid = GetTopTransactionId(); + epoch = GetEpochForXid(xid); + } + + /* + * Assign an OID, and toast the tuple if necessary. + * + * Note: below this point, heaptup is the data we actually intend to store + * into the relation; tup is the caller's original untoasted data. + */ + zheaptup = zheap_prepare_insert(relation, tup, options); + +reacquire_buffer: + /* + * Find buffer to insert this tuple into. If the page is all visible, + * this will also pin the requisite visibility map page. + */ + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + + buffer = RelationGetBufferForZTuple(relation, zheaptup->t_len, + InvalidBuffer, options, bistate, + &vmbuffer, NULL); + page = BufferGetPage(buffer); + + if (!skip_undo) + { + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + * + * We don't yet know the offset number of the inserting tuple so just pass + * the max offset number + 1 so that if it need to get slot from the TPD + * it can ensure that the TPD has sufficient map entries. + */ + trans_slot_id = PageReserveTransactionSlot(relation, + buffer, + PageGetMaxOffsetNumber(page) + 1, + epoch, + xid, + &prev_urecptr, + &lock_reacquired); + if (lock_reacquired) + { + UnlockReleaseBuffer(buffer); + goto reacquire_buffer; + } + + if (trans_slot_id == InvalidXactSlotId) + { + UnlockReleaseBuffer(buffer); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + goto reacquire_buffer; + } + + /* transaction slot must be reserved before adding tuple to page */ + Assert(trans_slot_id != InvalidXactSlotId); + } + + if (options & HEAP_INSERT_SPECULATIVE) + { + /* + * We can't skip writing undo speculative insertions as we have to + * write the token in undo. + */ + Assert(!skip_undo); + + /* Mark the tuple as speculatively inserted tuple. */ + zheaptup->t_data->t_infomask |= ZHEAP_SPECULATIVE_INSERT; + } + + /* + * See heap_insert to know why checking conflicts is important + * before actually inserting the tuple. + */ + CheckForSerializableConflictIn(relation, NULL, InvalidBuffer); + + if (!skip_undo) + { + /* + * Prepare an undo record. Unlike other operations, insert operation + * doesn't have a prior version to store in undo, so ideally, we don't + * need to store any additional information like + * UREC_INFO_PAYLOAD_CONTAINS_SLOT for TPD entries. However, for the sake + * of consistency with inserts via non-inplace updates, we keep the + * additional information in this operation. Also, we need such an + * information in future where we need to know more information for undo + * tuples and it would be good for forensic purpose as well. + */ + undorecord.uur_type = UNDO_INSERT; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = relation->rd_id; + undorecord.uur_prevxid = FrozenTransactionId; + undorecord.uur_xid = xid; + undorecord.uur_cid = cid; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = prev_urecptr; + undorecord.uur_block = BufferGetBlockNumber(buffer); + undorecord.uur_tuple.len = 0; + + /* + * Store the speculative insertion token in undo, so that we can retrieve + * it during visibility check of the speculatively inserted tuples. + * + * Note that we don't need to WAL log this value as this is a temporary + * information required only on master node to detect conflicts for + * Insert .. On Conflict. + */ + if (options & HEAP_INSERT_SPECULATIVE) + { + uint32 specToken; + + undorecord.uur_payload.len = sizeof(uint32); + specToken = GetSpeculativeInsertionToken(); + initStringInfo(&undorecord.uur_payload); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *)&specToken, + sizeof(uint32)); + } + else + undorecord.uur_payload.len = 0; + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(relation), + &undometa); + } + + /* + * If there is a valid vmbuffer get its status. The vmbuffer will not + * be valid if operated page is newly extended, see + * RelationGetBufferForZTuple. Also, anyway by default vm status + * bits are clear for those pages hence no need to clear it again! + */ + if (BufferIsValid(vmbuffer)) + vm_status = visibilitymap_get_status(relation, + BufferGetBlockNumber(buffer), + &vmbuffer); + + /* + * Lock the TPD page before starting critical section. We might need + * to access it in ZPageAddItemExtended. Note that if the transaction + * slot belongs to TPD entry, then the TPD page must be locked during + * slot reservation. + * + * XXX We can optimize this by avoid taking TPD page lock unless the page + * has some unused item which requires us to fetch the transaction + * information from TPD. + */ + if (trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS && + ZHeapPageHasTPDSlot((PageHeader) page) && + PageHasFreeLinePointers((PageHeader) page)) + TPDPageLock(relation, buffer); + + /* NO EREPORT(ERROR) from here till changes are logged */ + START_CRIT_SECTION(); + + if (!(options & HEAP_INSERT_FROZEN)) + ZHeapTupleHeaderSetXactSlot(zheaptup->t_data, trans_slot_id); + + RelationPutZHeapTuple(relation, buffer, zheaptup); + + if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) || + (vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE)) + { + all_visible_cleared = true; + visibilitymap_clear(relation, + ItemPointerGetBlockNumber(&(zheaptup->t_self)), + vmbuffer, VISIBILITYMAP_VALID_BITS); + } + + if (!skip_undo) + { + Assert(undorecord.uur_block == ItemPointerGetBlockNumber(&(zheaptup->t_self))); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&(zheaptup->t_self)); + InsertPreparedUndo(); + PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, xid, + urecptr, NULL, 0); + } + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(relation)) + { + xl_undo_header xlundohdr; + xl_zheap_insert xlrec; + xl_zheap_header xlhdr; + XLogRecPtr recptr; + Page page = BufferGetPage(buffer); + uint8 info = XLOG_ZHEAP_INSERT; + int bufflags = 0; + XLogRecPtr RedoRecPtr; + bool doPageWrites; + + /* + * If this is a catalog, we need to transmit combocids to properly + * decode, so log that as well. + */ + if (RelationIsAccessibleInLogicalDecoding(relation)) + { + /* + * Fixme: This won't work as it needs to access cmin/cmax which + * we probably needs to retrieve from TPD or UNDO. + */ + /*log_heap_new_cid(relation, zheaptup);*/ + } + + /* + * If this is the single and first tuple on page, we can reinit the + * page instead of restoring the whole thing. Set flag, and hide + * buffer references from XLogInsert. + */ + if (ItemPointerGetOffsetNumber(&(zheaptup->t_self)) == FirstOffsetNumber && + PageGetMaxOffsetNumber(page) == FirstOffsetNumber) + { + info |= XLOG_ZHEAP_INIT_PAGE; + bufflags |= REGBUF_WILL_INIT; + } + + /* + * Store the information required to generate undo record during + * replay. + */ + xlundohdr.reloid = relation->rd_id; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = prev_urecptr; + + /* Heap related part. */ + xlrec.offnum = ItemPointerGetOffsetNumber(&zheaptup->t_self); + xlrec.flags = 0; + + if (all_visible_cleared) + xlrec.flags |= XLZ_INSERT_ALL_VISIBLE_CLEARED; + if (options & HEAP_INSERT_SPECULATIVE) + xlrec.flags |= XLZ_INSERT_IS_SPECULATIVE; + if (skip_undo) + xlrec.flags |= XLZ_INSERT_IS_FROZEN; + Assert(ItemPointerGetBlockNumber(&zheaptup->t_self) == BufferGetBlockNumber(buffer)); + + /* + * For logical decoding, we need the tuple even if we're doing a full + * page write, so make sure it's included even if we take a full-page + * image. (XXX We could alternatively store a pointer into the FPW). + * + * Fixme - Current zheap doesn't support logical decoding, once it is + * supported, we need to test and remove this Fixme. + */ + if (RelationIsLogicallyLogged(relation)) + { + xlrec.flags |= XLZ_INSERT_CONTAINS_NEW_TUPLE; + bufflags |= REGBUF_KEEP_DATA; + } + +prepare_xlog: + if (!skip_undo) + { + /* + * LOG undolog meta if this is the first WAL after the checkpoint. + */ + LogUndoMetaData(&undometa); + } + + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + + XLogBeginInsert(); + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + XLogRegisterData((char *) &xlrec, SizeOfZHeapInsert); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + /* + * We can't have a valid transaction slot when we are skipping + * undo. + */ + Assert(!skip_undo); + xlrec.flags |= XLZ_INSERT_CONTAINS_TPD_SLOT; + XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id)); + } + + xlhdr.t_infomask2 = zheaptup->t_data->t_infomask2; + xlhdr.t_infomask = zheaptup->t_data->t_infomask; + xlhdr.t_hoff = zheaptup->t_data->t_hoff; + + /* + * note we mark xlhdr as belonging to buffer; if XLogInsert decides to + * write the whole page to the xlog, we don't need to store + * xl_heap_header in the xlog. + */ + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags); + XLogRegisterBufData(0, (char *) &xlhdr, SizeOfZHeapHeader); + /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */ + XLogRegisterBufData(0, + (char *) zheaptup->t_data + SizeofZHeapTupleHeader, + zheaptup->t_len - SizeofZHeapTupleHeader); + if (xlrec.flags & XLZ_INSERT_CONTAINS_TPD_SLOT) + (void) RegisterTPDBuffer(page, 1); + + /* filtering by origin on a row level is much more efficient */ + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); + + recptr = XLogInsertExtended(RM_ZHEAP_ID, info, RedoRecPtr, + doPageWrites); + if (recptr == InvalidXLogRecPtr) + { + ResetRegisteredTPDBuffers(); + goto prepare_xlog; + } + + PageSetLSN(page, recptr); + if (xlrec.flags & XLZ_INSERT_CONTAINS_TPD_SLOT) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + UnlockReleaseBuffer(buffer); + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + if (!skip_undo) + { + /* be tidy */ + if (undorecord.uur_payload.len > 0) + pfree(undorecord.uur_payload.data); + UnlockReleaseUndoBuffers(); + } + UnlockReleaseTPDBuffers(); + + /* + * If tuple is cachable, mark it for invalidation from the caches in case + * we abort. Note it is OK to do this after releasing the buffer, because + * the zheaptup data structure is all in local memory, not in the shared + * buffer. + * + * Fixme - Cache invalidation API expects HeapTup, so either we need an + * eqvivalent API for ZHeapTup or need to teach cache invalidation API's + * to work with both the formats. + */ + /* CacheInvalidateHeapTuple(relation, zheaptup, NULL); */ + + /* Note: speculative insertions are counted too, even if aborted later */ + pgstat_count_heap_insert(relation, 1); + + /* + * If zheaptup is a private copy, release it. Don't forget to copy t_self + * back to the caller's image, too. + */ + if (zheaptup != tup) + { + tup->t_self = zheaptup->t_self; + + /* + * Since, in ZHeap we have speculative flag in the tuple header only, + * copy the speculative flag to the new tuple if required. + */ + if (ZHeapTupleHeaderIsSpeculative(zheaptup->t_data)) + tup->t_data->t_infomask |= ZHEAP_SPECULATIVE_INSERT; + + zheap_freetuple(zheaptup); + } +} + +/* + * simple_zheap_delete - delete a zheap tuple + * + * This routine may be used to delete a tuple when concurrent updates of + * the target tuple are not expected (for example, because we have a lock + * on the relation associated with the tuple). Any failure is reported + * via ereport(). + */ +void +simple_zheap_delete(Relation relation, ItemPointer tid, Snapshot snapshot) +{ + HTSU_Result result; + HeapUpdateFailureData hufd; + + result = zheap_delete(relation, tid, + GetCurrentCommandId(true), InvalidSnapshot, snapshot, + true, /* wait for commit */ + &hufd, false /* changingPart */); + switch (result) + { + case HeapTupleSelfUpdated: + /* Tuple was already updated in current command? */ + elog(ERROR, "tuple already updated by self"); + break; + + case HeapTupleMayBeUpdated: + /* done successfully */ + break; + + case HeapTupleUpdated: + elog(ERROR, "tuple concurrently updated"); + break; + + default: + elog(ERROR, "unrecognized zheap_delete status: %u", result); + break; + } +} + +/* + * zheap_delete - delete a tuple + * + * The functionality related to heap is quite similar to heap_delete, + * additionaly this function inserts an undo record and updates the undo + * pointer in page header or in TPD entry for this page. + * + * XXX - Visibility map and page is all visible checks are required to support + * index-only scans on zheap. + */ +HTSU_Result +zheap_delete(Relation relation, ItemPointer tid, + CommandId cid, Snapshot crosscheck, Snapshot snapshot, bool wait, + HeapUpdateFailureData *hufd, bool changingPart) +{ + HTSU_Result result; + TransactionId xid = GetTopTransactionId(); + TransactionId tup_xid, + oldestXidHavingUndo, + single_locker_xid; + SubTransactionId tup_subxid = InvalidSubTransactionId; + CommandId tup_cid; + ItemId lp; + ZHeapTupleData zheaptup; + UnpackedUndoRecord undorecord; + Page page; + BlockNumber blkno; + OffsetNumber offnum; + Buffer buffer; + Buffer vmbuffer = InvalidBuffer; + UndoRecPtr urecptr, prev_urecptr; + ItemPointerData ctid; + uint32 epoch = GetEpochForXid(xid); + int tup_trans_slot_id, + trans_slot_id, + new_trans_slot_id, + single_locker_trans_slot; + uint16 new_infomask, temp_infomask; + bool have_tuple_lock = false; + bool in_place_updated_or_locked = false; + bool all_visible_cleared = false; + bool any_multi_locker_member_alive = false; + bool lock_reacquired; + bool hasSubXactLock = false; + bool hasPayload = false; + xl_undolog_meta undometa; + uint8 vm_status; + + Assert(ItemPointerIsValid(tid)); + + /* + * Forbid this during a parallel operation, lest it allocate a combocid. + * Other workers might need that combocid for visibility checks, and we + * have no provision for broadcasting it to them. + */ + if (IsInParallelMode()) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg("cannot delete tuples during a parallel operation"))); + + blkno = ItemPointerGetBlockNumber(tid); + buffer = ReadBuffer(relation, blkno); + page = BufferGetPage(buffer); + + /* + * Before locking the buffer, pin the visibility map page mainly to avoid + * doing I/O after locking the buffer. + */ + visibilitymap_pin(relation, blkno, &vmbuffer); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + offnum = ItemPointerGetOffsetNumber(tid); + lp = PageGetItemId(page, offnum); + Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp)); + + /* + * If TID is already delete marked due to pruning, then get new ctid, so + * that we can delete the new tuple. We will get new ctid if the tuple + * was non-inplace-updated otherwise we will get same TID. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto zheap_tuple_updated; + } + + zheaptup.t_tableOid = RelationGetRelid(relation); + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + zheaptup.t_self = *tid; + + ctid = *tid; + +check_tup_satisfies_update: + any_multi_locker_member_alive = true; + result = ZHeapTupleSatisfiesUpdate(relation, &zheaptup, cid, buffer, &ctid, + &tup_trans_slot_id, &tup_xid, &tup_subxid, + &tup_cid, &single_locker_xid, + &single_locker_trans_slot, false, false, + snapshot, &in_place_updated_or_locked); + + if (result == HeapTupleInvisible) + { + UnlockReleaseBuffer(buffer); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("attempted to delete invisible tuple"))); + } + else if ((result == HeapTupleBeingUpdated || + ((result == HeapTupleMayBeUpdated) && + ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask))) && + wait) + { + List *mlmembers = NIL; + TransactionId xwait; + SubTransactionId xwait_subxid; + int xwait_trans_slot; + uint16 infomask; + bool isCommitted; + bool can_continue = false; + + lock_reacquired = false; + xwait_subxid = tup_subxid; + + if (TransactionIdIsValid(single_locker_xid)) + { + xwait = single_locker_xid; + xwait_trans_slot = single_locker_trans_slot; + } + else + { + xwait = tup_xid; + xwait_trans_slot = tup_trans_slot_id; + } + + infomask = zheaptup.t_data->t_infomask; + + /* + * Sleep until concurrent transaction ends -- except when there's a + * single locker and it's our own transaction. Note we don't care + * which lock mode the locker has, because we need the strongest one. + * + * Before sleeping, we need to acquire tuple lock to establish our + * priority for the tuple (see zheap_lock_tuple). LockTuple will + * release us when we are next-in-line for the tuple. + * + * If we are forced to "start over" below, we keep the tuple lock; + * this arranges that we stay at the head of the line while rechecking + * tuple state. + */ + if (ZHeapTupleHasMultiLockers(infomask)) + { + LockTupleMode old_lock_mode; + TransactionId update_xact; + bool upd_xact_aborted = false; + + /* + * In ZHeapTupleSatisfiesUpdate, it's not possible to know if current + * transaction has already locked the tuple for update because of + * multilocker flag. In that case, we've to check whether the current + * transaction has already locked the tuple for update. + */ + + /* + * Get the transaction slot and undo record pointer if we are already in a + * transaction. + */ + trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid, + &prev_urecptr, false, false, + NULL); + + if (trans_slot_id != InvalidXactSlotId) + { + List *mlmembers; + ListCell *lc; + + /* + * If any subtransaction of the current top transaction already holds + * a lock as strong as or stronger than what we're requesting, we + * effectively hold the desired lock already. We *must* succeed + * without trying to take the tuple lock, else we will deadlock + * against anyone wanting to acquire a stronger lock. + */ + mlmembers = ZGetMultiLockMembersForCurrentXact(&zheaptup, + trans_slot_id, prev_urecptr); + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + /* + * Only members of our own transaction must be present in + * the list. + */ + Assert(TransactionIdIsCurrentTransactionId(mlmember->xid)); + + if (mlmember->mode >= LockTupleExclusive) + { + result = HeapTupleMayBeUpdated; + /* + * There is no other active locker on the tuple except + * current transaction id, so we can delete the tuple. + */ + goto zheap_tuple_updated; + } + } + + list_free_deep(mlmembers); + } + + old_lock_mode = get_old_lock_mode(infomask); + + /* + * For aborted updates, we must allow to reverify the tuple in + * case it's values got changed. See the similar handling in + * zheap_update. + */ + if (!ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask)) + ZHeapTupleGetTransInfo(&zheaptup, buffer, NULL, NULL, &update_xact, + NULL, NULL, false); + else + update_xact = InvalidTransactionId; + + if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode), + HWLOCKMODE_from_locktupmode(LockTupleExclusive))) + { + /* + * There is a potential conflict. It is quite possible + * that by this time the locker has already been committed. + * So we need to check for conflict with all the possible + * lockers and wait for each of them after releasing a + * buffer lock and acquiring a lock on a tuple. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + mlmembers = ZGetMultiLockMembers(relation, &zheaptup, buffer, + true); + + /* + * If there is no multi-lock members apart from the current transaction + * then no need for tuplock, just go ahead. + */ + if (mlmembers != NIL) + { + heap_acquire_tuplock(relation, &(zheaptup.t_self), LockTupleExclusive, + LockWaitBlock, &have_tuple_lock); + ZMultiLockMembersWait(relation, mlmembers, &zheaptup, buffer, + update_xact, LockTupleExclusive, false, + XLTW_Delete, NULL, &upd_xact_aborted); + } + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * If the aborted xact is for update, then we need to reverify + * the tuple. + */ + if (upd_xact_aborted) + goto check_tup_satisfies_update; + lock_reacquired = true; + + /* + * There was no UPDATE in the Multilockers. No + * TransactionIdIsInProgress() call needed here, since we called + * ZMultiLockMembersWait() above. + */ + if (!TransactionIdIsValid(update_xact)) + can_continue = true; + } + } + else if (!TransactionIdIsCurrentTransactionId(xwait)) + { + /* + * Wait for regular transaction to end; but first, acquire tuple + * lock. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + heap_acquire_tuplock(relation, &(zheaptup.t_self), LockTupleExclusive, + LockWaitBlock, &have_tuple_lock); + if (xwait_subxid != InvalidSubTransactionId) + SubXactLockTableWait(xwait, xwait_subxid, relation, + &zheaptup.t_self, XLTW_Delete); + else + XactLockTableWait(xwait, relation, &zheaptup.t_self, + XLTW_Delete); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + lock_reacquired = true; + } + + if (lock_reacquired) + { + TransactionId current_tup_xid; + + /* + * By the time, we require the lock on buffer, some other xact + * could have updated this tuple. We need take care of the cases + * when page is pruned after we release the buffer lock. For this, + * we check if ItemId is not deleted and refresh the tuple offset + * position in page. If TID is already delete marked due to + * pruning, then get new ctid, so that we can update the new + * tuple. + * + * We also need to ensure that no new lockers have been added in + * the meantime, if there is any new locker, then start again. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto zheap_tuple_updated; + } + + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + + if (ZHeapTupleHasMultiLockers(infomask)) + { + List *new_mlmembers; + new_mlmembers = ZGetMultiLockMembers(relation, &zheaptup, + buffer, false); + + /* + * Ensure, no new lockers have been added, if so, then start + * again. + */ + if (!ZMultiLockMembersSame(mlmembers, new_mlmembers)) + { + list_free_deep(mlmembers); + list_free_deep(new_mlmembers); + goto check_tup_satisfies_update; + } + + any_multi_locker_member_alive = + ZIsAnyMultiLockMemberRunning(new_mlmembers, &zheaptup, + buffer); + list_free_deep(mlmembers); + list_free_deep(new_mlmembers); + } + + /* + * xwait is done, but if xwait had just locked the tuple then some + * other xact could update/lock this tuple before we get to this + * point. Check for xid change, and start over if so. We need to + * do some special handling for lockers because their xid is never + * stored on the tuples. If there was a single locker on the + * tuple and that locker is gone and some new locker has locked + * the tuple, we won't be able to identify that by infomask/xid on + * the tuple, rather we need to fetch the locker xid. + */ + ZHeapTupleGetTransInfo(&zheaptup, buffer, NULL, NULL, + ¤t_tup_xid, NULL, NULL, false); + if (xid_infomask_changed(zheaptup.t_data->t_infomask, infomask) || + !TransactionIdEquals(current_tup_xid, xwait)) + { + if (ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask) && + !ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask) && + TransactionIdIsValid(single_locker_xid)) + { + TransactionId current_single_locker_xid = InvalidTransactionId; + + (void) GetLockerTransInfo(relation, &zheaptup, buffer, NULL, + NULL, ¤t_single_locker_xid, + NULL, NULL); + if (!TransactionIdEquals(single_locker_xid, + current_single_locker_xid)) + goto check_tup_satisfies_update; + + } + else + goto check_tup_satisfies_update; + } + + /* Aborts of multi-lockers are already dealt above. */ + if(!ZHeapTupleHasMultiLockers(infomask)) + { + bool has_update = false; + + if (!ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask)) + has_update = true; + + isCommitted = TransactionIdDidCommit(xwait); + + /* + * For aborted transaction, if the undo actions are not applied + * yet, then apply them before modifying the page. + */ + if (!isCommitted) + zheap_exec_pending_rollback(relation, + buffer, + xwait_trans_slot, + xwait); + + /* + * For aborted updates, we must allow to reverify the tuple in + * case it's values got changed. + */ + if (!isCommitted && has_update) + goto check_tup_satisfies_update; + + if (!has_update) + can_continue = true; + } + } + else + { + /* + * We can proceed with the delete, when there's a single locker + * and it's our own transaction. + */ + if (ZHEAP_XID_IS_LOCKED_ONLY(zheaptup.t_data->t_infomask)) + can_continue = true; + } + + /* + * We may overwrite if previous xid is aborted or committed, but only + * locked the tuple without updating it. + */ + if (result != HeapTupleMayBeUpdated) + result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated; + } + else if (result == HeapTupleUpdated + && ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask)) + { + /* + * Get the transaction slot and undo record pointer if we are already in a + * transaction. + */ + trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid, + &prev_urecptr, false, false, + NULL); + + if (trans_slot_id != InvalidXactSlotId) + { + List *mlmembers; + ListCell *lc; + + /* + * If any subtransaction of the current top transaction already holds + * a lock as strong as or stronger than what we're requesting, we + * effectively hold the desired lock already. We *must* succeed + * without trying to take the tuple lock, else we will deadlock + * against anyone wanting to acquire a stronger lock. + */ + mlmembers = ZGetMultiLockMembersForCurrentXact(&zheaptup, + trans_slot_id, prev_urecptr); + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + /* + * Only members of our own transaction must be present in + * the list. + */ + Assert(TransactionIdIsCurrentTransactionId(mlmember->xid)); + + if (mlmember->mode >= LockTupleExclusive) + { + result = HeapTupleMayBeUpdated; + /* + * There is no other active locker on the tuple except + * current transaction id, so we can delete the tuple. + */ + break; + } + } + + list_free_deep(mlmembers); + } + + } + + if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated) + { + /* Perform additional check for transaction-snapshot mode RI updates */ + if (!ZHeapTupleSatisfies(&zheaptup, crosscheck, buffer, NULL)) + result = HeapTupleUpdated; + } + +zheap_tuple_updated: + if (result != HeapTupleMayBeUpdated) + { + Assert(result == HeapTupleSelfUpdated || + result == HeapTupleUpdated || + result == HeapTupleBeingUpdated); + Assert(ItemIdIsDeleted(lp) || + IsZHeapTupleModified(zheaptup.t_data->t_infomask)); + + /* If item id is deleted, tuple can't be marked as moved. */ + if (!ItemIdIsDeleted(lp) && + ZHeapTupleIsMoved(zheaptup.t_data->t_infomask)) + ItemPointerSetMovedPartitions(&hufd->ctid); + else + hufd->ctid = ctid; + hufd->xmax = tup_xid; + if (result == HeapTupleSelfUpdated) + hufd->cmax = tup_cid; + else + hufd->cmax = InvalidCommandId; + UnlockReleaseBuffer(buffer); + hufd->in_place_updated_or_locked = in_place_updated_or_locked; + if (have_tuple_lock) + UnlockTupleTuplock(relation, &(zheaptup.t_self), LockTupleExclusive); + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + return result; + } + + /* + * Acquire subtransaction lock, if current transaction is a + * subtransaction. + */ + if (IsSubTransaction()) + { + SubXactLockTableInsert(GetCurrentSubTransactionId()); + hasSubXactLock = true; + } + + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + */ + trans_slot_id = PageReserveTransactionSlot(relation, buffer, + PageGetMaxOffsetNumber(page), + epoch, xid, &prev_urecptr, + &lock_reacquired); + if (lock_reacquired) + goto check_tup_satisfies_update; + + if (trans_slot_id == InvalidXactSlotId) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release the + * buffer lock. For this we check if ItemId is not deleted and refresh + * the tuple offset position in page. If TID is already delete marked + * due to pruning, then get new ctid, so that we can delete the new + * tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto zheap_tuple_updated; + } + + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + goto check_tup_satisfies_update; + } + + /* transaction slot must be reserved before adding tuple to page */ + Assert(trans_slot_id != InvalidXactSlotId); + + /* + * It's possible that tuple slot is now marked as frozen. Hence, we refetch + * the tuple here. + */ + Assert(!ItemIdIsDeleted(lp)); + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + + /* + * If the slot is marked as frozen, the latest modifier of the tuple must be + * frozen. + */ + if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (zheaptup.t_data)) == ZHTUP_SLOT_FROZEN) + { + tup_trans_slot_id = ZHTUP_SLOT_FROZEN; + tup_xid = InvalidTransactionId; + } + + temp_infomask = zheaptup.t_data->t_infomask; + + /* Compute the new xid and infomask to store into the tuple. */ + compute_new_xid_infomask(&zheaptup, buffer, tup_xid, tup_trans_slot_id, + temp_infomask, xid, trans_slot_id, + single_locker_xid, LockTupleExclusive, ForUpdate, + &new_infomask, &new_trans_slot_id); + /* + * There must not be any stronger locker than the current operation, + * otherwise it would have waited for it to finish. + */ + Assert(new_trans_slot_id == trans_slot_id); + + /* + * If the last transaction that has updated the tuple is already too + * old, then consider it as frozen which means it is all-visible. This + * ensures that we don't need to store epoch in the undo record to check + * if the undo tuple belongs to previous epoch and hence all-visible. See + * comments atop of file ztqual.c. + */ + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + if (TransactionIdPrecedes(tup_xid, oldestXidHavingUndo)) + tup_xid = FrozenTransactionId; + + CheckForSerializableConflictIn(relation, &(zheaptup.t_self), buffer); + + /* + * Prepare an undo record. We need to separately store the latest + * transaction id that has changed the tuple to ensure that we don't + * try to process the tuple in undo chain that is already discarded. + * See GetTupleFromUndo. + */ + undorecord.uur_type = UNDO_DELETE; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = relation->rd_id; + undorecord.uur_prevxid = tup_xid; + undorecord.uur_xid = xid; + undorecord.uur_cid = cid; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = prev_urecptr; + undorecord.uur_block = blkno; + undorecord.uur_offset = offnum; + + initStringInfo(&undorecord.uur_tuple); + + /* + * Copy the entire old tuple including it's header in the undo record. + * We need this to reconstruct the tuple if current tuple is not + * visible to some other transaction. We choose to write the complete + * tuple in undo record for delete operation so that we can reuse the + * space after the transaction performing the operation commits. + */ + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &zheaptup.t_len, + sizeof(uint32)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &zheaptup.t_self, + sizeof(ItemPointerData)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &zheaptup.t_tableOid, + sizeof(Oid)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) zheaptup.t_data, + zheaptup.t_len); + /* + * Store the transaction slot number for undo tuple in undo record, if + * the slot belongs to TPD entry. We can always get the current tuple's + * transaction slot number by referring offset->slot map in TPD entry, + * however that won't be true for tuple in undo. + */ + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT; + initStringInfo(&undorecord.uur_payload); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &tup_trans_slot_id, + sizeof(tup_trans_slot_id)); + hasPayload = true; + } + + /* + * Store subtransaction id in undo record. See SubXactLockTableWait + * to know why we need to store subtransaction id in undo. + */ + if (hasSubXactLock) + { + SubTransactionId subxid = GetCurrentSubTransactionId(); + + if (!hasPayload) + { + initStringInfo(&undorecord.uur_payload); + hasPayload = true; + } + + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT; + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &subxid, + sizeof(subxid)); + } + + if (!hasPayload) + undorecord.uur_payload.len = 0; + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(relation), + &undometa); + /* We must have a valid vmbuffer. */ + Assert(BufferIsValid(vmbuffer)); + vm_status = visibilitymap_get_status(relation, + BufferGetBlockNumber(buffer), &vmbuffer); + + START_CRIT_SECTION(); + + /* + * If all the members were lockers and are all gone, we can do away + * with the MULTI_LOCKERS bit. + */ + + if (ZHeapTupleHasMultiLockers(zheaptup.t_data->t_infomask) && + !any_multi_locker_member_alive) + zheaptup.t_data->t_infomask &= ~ZHEAP_MULTI_LOCKERS; + + if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) || + (vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE)) + { + all_visible_cleared = true; + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), + vmbuffer, VISIBILITYMAP_VALID_BITS); + } + + InsertPreparedUndo(); + PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, xid, + urecptr, NULL, 0); + + /* + * If this transaction commits, the tuple will become DEAD sooner or + * later. If the transaction finally aborts, the subsequent page pruning + * will be a no-op and the hint will be cleared. + */ + ZPageSetPrunable(page, xid); + + ZHeapTupleHeaderSetXactSlot(zheaptup.t_data, new_trans_slot_id); + zheaptup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + zheaptup.t_data->t_infomask |= ZHEAP_DELETED | new_infomask; + + /* Signal that this is actually a move into another partition */ + if (changingPart) + ZHeapTupleHeaderSetMovedPartitions(zheaptup.t_data); + + MarkBufferDirty(buffer); + + /* + * Do xlog stuff + */ + if (RelationNeedsWAL(relation)) + { + ZHeapTupleHeader zhtuphdr = NULL; + xl_undo_header xlundohdr; + xl_zheap_delete xlrec; + xl_zheap_header xlhdr; + XLogRecPtr recptr; + XLogRecPtr RedoRecPtr; + uint32 totalundotuplen = 0; + Size dataoff; + bool doPageWrites; + + /* + * Store the information required to generate undo record during + * replay. + */ + xlundohdr.reloid = undorecord.uur_reloid; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = prev_urecptr; + + xlrec.prevxid = tup_xid; + xlrec.offnum = ItemPointerGetOffsetNumber(&zheaptup.t_self); + xlrec.infomask = zheaptup.t_data->t_infomask; + xlrec.trans_slot_id = trans_slot_id; + xlrec.flags = all_visible_cleared ? XLZ_DELETE_ALL_VISIBLE_CLEARED : 0; + + if (changingPart) + xlrec.flags |= XLZ_DELETE_IS_PARTITION_MOVE; + if (hasSubXactLock) + xlrec.flags |= XLZ_DELETE_CONTAINS_SUBXACT; + + /* + * If full_page_writes is enabled, and the buffer image is not + * included in the WAL then we can rely on the tuple in the page to + * regenerate the undo tuple during recovery as the tuple state must + * be same as now, otherwise we need to store it explicitly. + * + * Since we don't yet have the insert lock, including the page + * image decision could change later and in that case we need prepare + * the WAL record again. + */ +prepare_xlog: + /* LOG undolog meta if this is the first WAL after the checkpoint. */ + LogUndoMetaData(&undometa); + + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + if (!doPageWrites || XLogCheckBufferNeedsBackup(buffer)) + { + xlrec.flags |= XLZ_HAS_DELETE_UNDOTUPLE; + + totalundotuplen = *((uint32 *) &undorecord.uur_tuple.data[0]); + dataoff = sizeof(uint32) + sizeof(ItemPointerData) + sizeof(Oid); + zhtuphdr = (ZHeapTupleHeader) &undorecord.uur_tuple.data[dataoff]; + + xlhdr.t_infomask2 = zhtuphdr->t_infomask2; + xlhdr.t_infomask = zhtuphdr->t_infomask; + xlhdr.t_hoff = zhtuphdr->t_hoff; + } + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + xlrec.flags |= XLZ_DELETE_CONTAINS_TPD_SLOT; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + XLogRegisterData((char *) &xlrec, SizeOfZHeapDelete); + if (xlrec.flags & XLZ_DELETE_CONTAINS_TPD_SLOT) + XLogRegisterData((char *) &tup_trans_slot_id, + sizeof(tup_trans_slot_id)); + if (xlrec.flags & XLZ_HAS_DELETE_UNDOTUPLE) + { + XLogRegisterData((char *) &xlhdr, SizeOfZHeapHeader); + /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */ + XLogRegisterData((char *) zhtuphdr + SizeofZHeapTupleHeader, + totalundotuplen - SizeofZHeapTupleHeader); + } + + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + (void) RegisterTPDBuffer(page, 1); + + /* filtering by origin on a row level is much more efficient */ + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); + + recptr = XLogInsertExtended(RM_ZHEAP_ID, XLOG_ZHEAP_DELETE, + RedoRecPtr, doPageWrites); + if (recptr == InvalidXLogRecPtr) + { + ResetRegisteredTPDBuffers(); + goto prepare_xlog; + } + PageSetLSN(page, recptr); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + /* be tidy */ + pfree(undorecord.uur_tuple.data); + if (undorecord.uur_payload.len > 0) + pfree(undorecord.uur_payload.data); + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + + UnlockReleaseUndoBuffers(); + /* + * If the tuple has toasted out-of-line attributes, we need to delete + * those items too. We have to do this before releasing the buffer + * because we need to look at the contents of the tuple, but it's OK to + * release the content lock on the buffer first. + */ + if (relation->rd_rel->relkind != RELKIND_RELATION && + relation->rd_rel->relkind != RELKIND_MATVIEW) + { + /* toast table entries should never be recursively toasted */ + Assert(!ZHeapTupleHasExternal(&zheaptup)); + } + else if (ZHeapTupleHasExternal(&zheaptup)) + ztoast_delete(relation, &zheaptup, false); + + /* Now we can release the buffer */ + ReleaseBuffer(buffer); + UnlockReleaseTPDBuffers(); + + /* + * Release the lmgr tuple lock, if we had it. + */ + if (have_tuple_lock) + UnlockTupleTuplock(relation, &(zheaptup.t_self), LockTupleExclusive); + + pgstat_count_heap_delete(relation); + + return HeapTupleMayBeUpdated; +} + +/* + * zheap_update - update a tuple + * + * This function either updates the tuple in-place or it deletes the old + * tuple and new tuple for non-in-place updates. Additionaly this function + * inserts an undo record and updates the undo pointer in page header or in + * TPD entry for this page. + * + * XXX - Visibility map and page is all visible needs to be maintained for + * index-only scans on zheap. + * + * For input and output values, see heap_update. + */ +HTSU_Result +zheap_update(Relation relation, ItemPointer otid, ZHeapTuple newtup, + CommandId cid, Snapshot crosscheck, Snapshot snapshot, bool wait, + HeapUpdateFailureData *hufd, LockTupleMode *lockmode) +{ + HTSU_Result result; + TransactionId xid = GetTopTransactionId(); + TransactionId tup_xid, + save_tup_xid, + oldestXidHavingUndo, + single_locker_xid; + SubTransactionId tup_subxid = InvalidSubTransactionId; + CommandId tup_cid; + Bitmapset *inplace_upd_attrs = NULL; + Bitmapset *key_attrs = NULL; + Bitmapset *interesting_attrs = NULL; + Bitmapset *modified_attrs = NULL; + ItemId lp; + ZHeapTupleData oldtup; + ZHeapTuple zheaptup; + UndoRecPtr urecptr, prev_urecptr, new_prev_urecptr; + UndoRecPtr new_urecptr = InvalidUndoRecPtr; + UnpackedUndoRecord undorecord, new_undorecord; + Page page; + BlockNumber block; + ItemPointerData ctid; + Buffer buffer, + newbuf, + vmbuffer = InvalidBuffer, + vmbuffer_new = InvalidBuffer; + Size newtupsize, + oldtupsize, + pagefree; + uint32 epoch = GetEpochForXid(xid); + int tup_trans_slot_id, + trans_slot_id, + new_trans_slot_id, + result_trans_slot_id, + single_locker_trans_slot; + uint16 old_infomask; + uint16 new_infomask, temp_infomask; + uint16 infomask_old_tuple = 0; + uint16 infomask_new_tuple = 0; + OffsetNumber old_offnum, max_offset; + bool all_visible_cleared = false; + bool new_all_visible_cleared = false; + bool have_tuple_lock = false; + bool is_index_updated = false; + bool use_inplace_update = false; + bool in_place_updated_or_locked = false; + bool key_intact = false; + bool checked_lockers = false; + bool locker_remains = false; + bool any_multi_locker_member_alive = false; + bool lock_reacquired; + bool need_toast; + bool hasSubXactLock = false; + xl_undolog_meta undometa; + uint8 vm_status; + uint8 vm_status_new = 0; + + Assert(ItemPointerIsValid(otid)); + + /* + * Forbid this during a parallel operation, lest it allocate a combocid. + * Other workers might need that combocid for visibility checks, and we + * have no provision for broadcasting it to them. + */ + if (IsInParallelMode()) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg("cannot update tuples during a parallel operation"))); + + /* + * Fetch the list of attributes to be checked for various operations. + * + * For in-place update considerations, this is wasted effort if we fail to + * update or have to put the new tuple on a different page. But we must + * compute the list before obtaining buffer lock --- in the worst case, if + * we are doing an update on one of the relevant system catalogs, we could + * deadlock if we try to fetch the list later. Note, that as of now + * system catalogs are always stored in heap, so we might not hit the + * deadlock case, but it can be supported in future. In any case, the + * relcache caches the data so this is usually pretty cheap. + * + * Note that we get a copy here, so we need not worry about relcache flush + * happening midway through. + */ + inplace_upd_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_HOT); + key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY); + + block = ItemPointerGetBlockNumber(otid); + buffer = ReadBuffer(relation, block); + page = BufferGetPage(buffer); + + interesting_attrs = NULL; + + /* + * Before locking the buffer, pin the visibility map page mainly to avoid + * doing I/O after locking the buffer. + */ + visibilitymap_pin(relation, block, &vmbuffer); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + old_offnum = ItemPointerGetOffsetNumber(otid); + lp = PageGetItemId(page, old_offnum); + Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp)); + + /* + * If TID is already delete marked due to pruning, then get new ctid, so + * that we can update the new tuple. We will get new ctid if the tuple + * was non-inplace-updated otherwise we will get same TID. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *otid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + + /* + * Since tuple data is gone let's be conservative about lock mode. + * + * XXX We could optimize here by checking whether the key column is + * not updated and if so, then use lower lock level, but this case + * should be rare enough that it won't matter. + */ + *lockmode = LockTupleExclusive; + goto zheap_tuple_updated; + } + + /* + * Fill in enough data in oldtup for ZHeapDetermineModifiedColumns to work + * properly. + */ + oldtup.t_tableOid = RelationGetRelid(relation); + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + oldtup.t_len = ItemIdGetLength(lp); + oldtup.t_self = *otid; + + /* the new tuple is ready, except for this: */ + newtup->t_tableOid = RelationGetRelid(relation); + + interesting_attrs = bms_add_members(interesting_attrs, inplace_upd_attrs); + interesting_attrs = bms_add_members(interesting_attrs, key_attrs); + + /* Determine columns modified by the update. */ + modified_attrs = ZHeapDetermineModifiedColumns(relation, interesting_attrs, + &oldtup, newtup); + + is_index_updated = bms_overlap(modified_attrs, inplace_upd_attrs); + + if (relation->rd_rel->relkind != RELKIND_RELATION && + relation->rd_rel->relkind != RELKIND_MATVIEW) + { + /* toast table entries should never be recursively toasted */ + Assert(!ZHeapTupleHasExternal(&oldtup)); + Assert(!ZHeapTupleHasExternal(newtup)); + need_toast = false; + } + else + need_toast = (newtup->t_len >= TOAST_TUPLE_THRESHOLD || + ZHeapTupleHasExternal(&oldtup) || + ZHeapTupleHasExternal(newtup)); + + oldtupsize = SHORTALIGN(oldtup.t_len); + newtupsize = SHORTALIGN(newtup->t_len); + + /* + * inplace updates can be done only if the length of new tuple is lesser + * than or equal to old tuple and there are no index column updates and + * the tuple does not require TOAST-ing. + */ + if ((newtupsize <= oldtupsize) && !is_index_updated && !need_toast) + use_inplace_update = true; + else + use_inplace_update = false; + + /* + * Similar to heap, if we're not updating any "key" column, we can grab a + * weaker lock type. See heap_update. + */ + if (!bms_overlap(modified_attrs, key_attrs)) + { + *lockmode = LockTupleNoKeyExclusive; + key_intact = true; + } + else + { + *lockmode = LockTupleExclusive; + key_intact = false; + } + + /* + * ctid needs to be fetched from undo chain. You might think that it will + * be always same as the passed in ctid as the old tuple is already visible + * out snapshot. However, it is quite possible that after checking the + * visibility of old tuple, some concurrent session would have performed + * non in-place update and in such a case we need can only get it via + * undo. + */ + ctid = *otid; + +check_tup_satisfies_update: + checked_lockers = false; + locker_remains = false; + any_multi_locker_member_alive = true; + result = ZHeapTupleSatisfiesUpdate(relation, &oldtup, cid, buffer, &ctid, + &tup_trans_slot_id, &tup_xid, &tup_subxid, + &tup_cid, &single_locker_xid, + &single_locker_trans_slot, false, false, + snapshot, &in_place_updated_or_locked); + + if (result == HeapTupleInvisible) + { + UnlockReleaseBuffer(buffer); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("attempted to update invisible tuple"))); + } + else if ((result == HeapTupleBeingUpdated || + ((result == HeapTupleMayBeUpdated) && + ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask))) && + wait) + { + List *mlmembers; + TransactionId xwait; + SubTransactionId xwait_subxid; + int xwait_trans_slot; + uint16 infomask; + bool can_continue = false; + + xwait_subxid = tup_subxid; + + if (TransactionIdIsValid(single_locker_xid)) + { + xwait = single_locker_xid; + xwait_trans_slot = single_locker_trans_slot; + } + else + { + xwait = tup_xid; + xwait_trans_slot = tup_trans_slot_id; + } + + /* must copy state data before unlocking buffer */ + infomask = oldtup.t_data->t_infomask; + + if (ZHeapTupleHasMultiLockers(infomask)) + { + TransactionId update_xact; + LockTupleMode old_lock_mode; + int remain = 0; + bool isAborted; + bool upd_xact_aborted = false; + + /* + * In ZHeapTupleSatisfiesUpdate, it's not possible to know if current + * transaction has already locked the tuple for update because of + * multilocker flag. In that case, we've to check whether the current + * transaction has already locked the tuple for update. + */ + + /* + * Get the transaction slot and undo record pointer if we are already in a + * transaction. + */ + trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid, + &prev_urecptr, false, false, + NULL); + + if (trans_slot_id != InvalidXactSlotId) + { + List *mlmembers; + ListCell *lc; + + /* + * If any subtransaction of the current top transaction already holds + * a lock as strong as or stronger than what we're requesting, we + * effectively hold the desired lock already. We *must* succeed + * without trying to take the tuple lock, else we will deadlock + * against anyone wanting to acquire a stronger lock. + */ + mlmembers = ZGetMultiLockMembersForCurrentXact(&oldtup, + trans_slot_id, prev_urecptr); + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + /* + * Only members of our own transaction must be present in + * the list. + */ + Assert(TransactionIdIsCurrentTransactionId(mlmember->xid)); + + if (mlmember->mode >= *lockmode) + { + result = HeapTupleMayBeUpdated; + + /* + * There is no other active locker on the tuple except + * current transaction id, so we can update the tuple. + * However, we need to propagate lockers information. + */ + checked_lockers = true; + locker_remains = true; + goto zheap_tuple_updated; + } + } + + list_free_deep(mlmembers); + } + + old_lock_mode = get_old_lock_mode(infomask); + + /* + * For the conflicting lockers, we need to be careful about + * applying pending undo actions for aborted transactions; if we + * leave any transaction whether locker or updater, it can lead to + * inconsistency. Basically, in such a case after waiting for all + * the conflicting transactions we might clear the multilocker + * flag and proceed with update and it is quite possible that after + * the update, undo worker rollbacks some of the previous locker + * which can overwrite the tuple (Note, till multilocker bit is set, + * the rollback actions won't overwrite the tuple). + * + * OTOH for non-conflicting lockers, as we don't clear the + * multi-locker flag, there is no urgency to perform undo actions + * for aborts of lockers. The work involved in finding and + * aborting lockers is non-trivial (w.r.t performance), so it is + * better to avoid it. + * + * After abort, if it is only a locker, then it will be completely + * gone; but if it is an update, then after applying pending + * actions, the tuple might get changed and we must allow to + * reverify the tuple in case it's values got changed. + */ + if (!ZHEAP_XID_IS_LOCKED_ONLY(oldtup.t_data->t_infomask)) + ZHeapTupleGetTransInfo(&oldtup, buffer, NULL, NULL, &update_xact, + NULL, NULL, false); + else + update_xact = InvalidTransactionId; + + if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode), + HWLOCKMODE_from_locktupmode(*lockmode))) + { + TransactionId current_tup_xid; + + /* + * There is a potential conflict. It is quite possible + * that by this time the locker has already been committed. + * So we need to check for conflict with all the possible + * lockers and wait for each of them after releasing a + * buffer lock and acquiring a lock on a tuple. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + mlmembers = ZGetMultiLockMembers(relation, &oldtup, buffer, + true); + + /* + * If there is no multi-lock members apart from the current transaction + * then no need for tuplock, just go ahead. + */ + if (mlmembers != NIL) + { + heap_acquire_tuplock(relation, &(oldtup.t_self), *lockmode, + LockWaitBlock, &have_tuple_lock); + ZMultiLockMembersWait(relation, mlmembers, &oldtup, buffer, + update_xact, *lockmode, false, + XLTW_Update, &remain, + &upd_xact_aborted); + } + checked_lockers = true; + locker_remains = remain != 0; + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * If the aborted xact is for update, then we need to reverify + * the tuple. + */ + if (upd_xact_aborted) + goto check_tup_satisfies_update; + + /* + * Also take care of cases when page is pruned after we + * release the buffer lock. For this we check if ItemId is not + * deleted and refresh the tuple offset position in page. If + * TID is already delete marked due to pruning, then get new + * ctid, so that we can update the new tuple. + * + * We also need to ensure that no new lockers have been added + * in the meantime, if there is any new locker, then start + * again. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *otid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto zheap_tuple_updated; + } + + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + oldtup.t_len = ItemIdGetLength(lp); + + if (ZHeapTupleHasMultiLockers(infomask)) + { + List *new_mlmembers; + new_mlmembers = ZGetMultiLockMembers(relation, &oldtup, + buffer, false); + + /* + * Ensure, no new lockers have been added, if so, then start + * again. + */ + if (!ZMultiLockMembersSame(mlmembers, new_mlmembers)) + { + list_free_deep(mlmembers); + list_free_deep(new_mlmembers); + goto check_tup_satisfies_update; + } + + any_multi_locker_member_alive = + ZIsAnyMultiLockMemberRunning(new_mlmembers, &oldtup, + buffer); + list_free_deep(mlmembers); + list_free_deep(new_mlmembers); + } + + /* + * xwait is done, but if xwait had just locked the tuple then some + * other xact could update this tuple before we get to this point. + * Check for xid change, and start over if so. + */ + ZHeapTupleGetTransInfo(&oldtup, buffer, NULL, NULL, ¤t_tup_xid, + NULL, NULL, false); + if (xid_infomask_changed(oldtup.t_data->t_infomask, infomask) || + !TransactionIdEquals(current_tup_xid, xwait)) + goto check_tup_satisfies_update; + } + else if (TransactionIdIsValid(update_xact)) + { + isAborted = TransactionIdDidAbort(update_xact); + + /* + * For aborted transaction, if the undo actions are not applied + * yet, then apply them before modifying the page. + */ + if (isAborted && + zheap_exec_pending_rollback(relation, buffer, + xwait_trans_slot, xwait)) + goto check_tup_satisfies_update; + } + + /* + * There was no UPDATE in the Multilockers. No + * TransactionIdIsInProgress() call needed here, since we called + * ZMultiLockMembersWait() above. + */ + if (!TransactionIdIsValid(update_xact)) + can_continue = true; + } + else if (TransactionIdIsCurrentTransactionId(xwait)) + { + /* + * The only locker is ourselves; we can avoid grabbing the tuple + * lock here, but must preserve our locking information. + */ + checked_lockers = true; + locker_remains = true; + can_continue = true; + } + else if (ZHEAP_XID_IS_KEYSHR_LOCKED(infomask) && key_intact) + { + /* + * If it's just a key-share locker, and we're not changing the key + * columns, we don't need to wait for it to end; but we need to + * preserve it as locker. + */ + checked_lockers = true; + locker_remains = true; + can_continue = true; + } + else + { + bool isCommitted; + bool has_update = false; + TransactionId current_tup_xid; + + /* + * Wait for regular transaction to end; but first, acquire tuple + * lock. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + heap_acquire_tuplock(relation, &(oldtup.t_self), *lockmode, + LockWaitBlock, &have_tuple_lock); + if (xwait_subxid != InvalidSubTransactionId) + SubXactLockTableWait(xwait, xwait_subxid, relation, + &oldtup.t_self, XLTW_Update); + else + XactLockTableWait(xwait, relation, &oldtup.t_self, + XLTW_Update); + checked_lockers = true; + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release the + * buffer lock. For this we check if ItemId is not deleted and refresh + * the tuple offset position in page. If TID is already delete marked + * due to pruning, then get new ctid, so that we can update the new + * tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *otid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto zheap_tuple_updated; + } + + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + oldtup.t_len = ItemIdGetLength(lp); + + /* + * xwait is done, but if xwait had just locked the tuple then some + * other xact could update/lock this tuple before we get to this + * point. Check for xid change, and start over if so. We need to + * do some special handling for lockers because their xid is never + * stored on the tuples. If there was a single locker on the + * tuple and that locker is gone and some new locker has locked + * the tuple, we won't be able to identify that by infomask/xid on + * the tuple, rather we need to fetch the locker xid. + */ + ZHeapTupleGetTransInfo(&oldtup, buffer, NULL, NULL, + ¤t_tup_xid, NULL, NULL, false); + if (xid_infomask_changed(oldtup.t_data->t_infomask, infomask) || + !TransactionIdEquals(current_tup_xid, xwait)) + { + if (ZHEAP_XID_IS_LOCKED_ONLY(oldtup.t_data->t_infomask) && + !ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask) && + TransactionIdIsValid(single_locker_xid)) + { + TransactionId current_single_locker_xid = InvalidTransactionId; + + (void) GetLockerTransInfo(relation, &oldtup, buffer, NULL, + NULL, ¤t_single_locker_xid, + NULL, NULL); + if (!TransactionIdEquals(single_locker_xid, + current_single_locker_xid)) + goto check_tup_satisfies_update; + + } + else + goto check_tup_satisfies_update; + } + + if (!ZHEAP_XID_IS_LOCKED_ONLY(oldtup.t_data->t_infomask)) + has_update = true; + + /* + * We may overwrite if previous xid is aborted, or if it is committed + * but only locked the tuple without updating it. + */ + isCommitted = TransactionIdDidCommit(xwait); + + /* + * For aborted transaction, if the undo actions are not applied + * yet, then apply them before modifying the page. + */ + if (!isCommitted) + zheap_exec_pending_rollback(relation, buffer, + xwait_trans_slot, xwait); + + /* + * For aborted updates, we must allow to reverify the tuple in + * case it's values got changed. + */ + if (!isCommitted && has_update) + goto check_tup_satisfies_update; + + if (!has_update) + can_continue = true; + } + + /* + * We may overwrite if previous xid is aborted or committed, but only + * locked the tuple without updating it. + */ + if (result != HeapTupleMayBeUpdated) + result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated; + } + else if (result == HeapTupleMayBeUpdated) + { + /* + * There is no active locker on the tuple, so we avoid grabbing + * the lock on new tuple. + */ + checked_lockers = true; + locker_remains = false; + } + else if (result == HeapTupleUpdated && + ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask)) + { + /* + * If a tuple is updated and is visible to our snapshot, we allow to update + * it; Else, we return HeapTupleUpdated and visit EvalPlanQual path to + * check whether the quals still match. In that path, we also lock the + * tuple so that nobody can update it before us. + * + * In ZHeapTupleSatisfiesUpdate, it's not possible to know if current + * transaction has already locked the tuple for update because of + * multilocker flag. In that case, we've to check whether the current + * transaction has already locked the tuple for update. + */ + + /* + * Get the transaction slot and undo record pointer if we are already in a + * transaction. + */ + trans_slot_id = PageGetTransactionSlotId(relation, buffer, epoch, xid, + &prev_urecptr, false, false, + NULL); + + if (trans_slot_id != InvalidXactSlotId) + { + List *mlmembers; + ListCell *lc; + + /* + * If any subtransaction of the current top transaction already holds + * a lock as strong as or stronger than what we're requesting, we + * effectively hold the desired lock already. We *must* succeed + * without trying to take the tuple lock, else we will deadlock + * against anyone wanting to acquire a stronger lock. + */ + mlmembers = ZGetMultiLockMembersForCurrentXact(&oldtup, + trans_slot_id, prev_urecptr); + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + /* + * Only members of our own transaction must be present in + * the list. + */ + Assert(TransactionIdIsCurrentTransactionId(mlmember->xid)); + + if (mlmember->mode >= *lockmode) + { + result = HeapTupleMayBeUpdated; + + /* + * There is no other active locker on the tuple except + * current transaction id, so we can update the tuple. + */ + checked_lockers = true; + locker_remains = false; + break; + } + } + + list_free_deep(mlmembers); + } + + } + + if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated) + { + /* Perform additional check for transaction-snapshot mode RI updates */ + if (!ZHeapTupleSatisfies(&oldtup, crosscheck, buffer, NULL)) + result = HeapTupleUpdated; + } + +zheap_tuple_updated: + if (result != HeapTupleMayBeUpdated) + { + Assert(result == HeapTupleSelfUpdated || + result == HeapTupleUpdated || + result == HeapTupleBeingUpdated); + Assert(ItemIdIsDeleted(lp) || + IsZHeapTupleModified(oldtup.t_data->t_infomask)); + + /* If item id is deleted, tuple can't be marked as moved. */ + if (!ItemIdIsDeleted(lp) && + ZHeapTupleIsMoved(oldtup.t_data->t_infomask)) + ItemPointerSetMovedPartitions(&hufd->ctid); + else + hufd->ctid = ctid; + hufd->xmax = tup_xid; + if (result == HeapTupleSelfUpdated) + hufd->cmax = tup_cid; + else + hufd->cmax = InvalidCommandId; + UnlockReleaseBuffer(buffer); + hufd->in_place_updated_or_locked = in_place_updated_or_locked; + if (have_tuple_lock) + UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode); + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + bms_free(inplace_upd_attrs); + bms_free(key_attrs); + return result; + } + + /* Acquire subtransaction lock, if current transaction is a subtransaction. */ + if (IsSubTransaction()) + { + SubXactLockTableInsert(GetCurrentSubTransactionId()); + hasSubXactLock = true; + } + + /* + * If it is a non inplace update then check we have sufficient free space + * to insert in same page. If not try defragmentation and recheck the + * freespace again. + */ + if (!use_inplace_update && !is_index_updated && !need_toast) + { + bool pruned; + + /* Here, we pass delta space required to accomodate the new tuple. */ + pruned = zheap_page_prune_opt(relation, buffer, old_offnum, + (newtupsize - oldtupsize)); + + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + + /* + * Check if the non-inplace update is due to non-index update and we + * are able to perform pruning, then we must be able to perform + * inplace update. + */ + if (pruned) + use_inplace_update = true; + } + + max_offset = PageGetMaxOffsetNumber(BufferGetPage(buffer)); + pagefree = PageGetZHeapFreeSpace(page); + + /* + * Incase of the non in-place update we also need to + * reserve a map for the new tuple. + */ + if (!use_inplace_update) + max_offset += 1; + + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + */ + trans_slot_id = PageReserveTransactionSlot(relation, buffer, max_offset, + epoch, xid, &prev_urecptr, + &lock_reacquired); + if (lock_reacquired) + goto check_tup_satisfies_update; + + if (trans_slot_id == InvalidXactSlotId) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release the + * buffer lock. For this we check if ItemId is not deleted and refresh + * the tuple offset position in page. If TID is already delete marked + * due to pruning, then get new ctid, so that we can update the new + * tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *otid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto zheap_tuple_updated; + } + + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + oldtup.t_len = ItemIdGetLength(lp); + + goto check_tup_satisfies_update; + } + + /* transaction slot must be reserved before adding tuple to page */ + Assert(trans_slot_id != InvalidXactSlotId); + + /* + * It's possible that tuple slot is now marked as frozen. Hence, we refetch + * the tuple here. + */ + Assert(!ItemIdIsDeleted(lp)); + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + oldtup.t_len = ItemIdGetLength(lp); + + /* + * If the slot is marked as frozen, the latest modifier of the tuple must be + * frozen. + */ + if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (oldtup.t_data)) == ZHTUP_SLOT_FROZEN) + { + tup_trans_slot_id = ZHTUP_SLOT_FROZEN; + tup_xid = InvalidTransactionId; + } + + /* + * Save the xid that has updated the tuple to compute infomask for + * tuple. + */ + save_tup_xid = tup_xid; + + /* + * If the last transaction that has updated the tuple is already too + * old, then consider it as frozen which means it is all-visible. This + * ensures that we don't need to store epoch in the undo record to check + * if the undo tuple belongs to previous epoch and hence all-visible. See + * comments atop of file ztqual.c. + */ + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + if (TransactionIdPrecedes(tup_xid, oldestXidHavingUndo)) + { + tup_xid = FrozenTransactionId; + } + + /* + * updated tuple doesn't fit on current page or the toaster needs + * to be activated + */ + if ((!use_inplace_update && newtupsize > pagefree) || need_toast) + { + uint16 lock_old_infomask; + BlockNumber oldblk, newblk; + int slot_id; + + /* + * To prevent concurrent sessions from updating the tuple, we have to + * temporarily mark it locked, while we release the lock. + */ + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = relation->rd_id; + undorecord.uur_prevxid = tup_xid; + undorecord.uur_xid = xid; + undorecord.uur_cid = cid; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = prev_urecptr; + undorecord.uur_block = ItemPointerGetBlockNumber(&(oldtup.t_self)); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&(oldtup.t_self)); + + initStringInfo(&undorecord.uur_tuple); + initStringInfo(&undorecord.uur_payload); + + /* + * Here, we are storing old tuple header which is required to + * reconstruct the old copy of tuple. + */ + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) oldtup.t_data, + SizeofZHeapTupleHeader); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) (lockmode), + sizeof(LockTupleMode)); + /* + * Store the transaction slot number for undo tuple in undo record, if + * the slot belongs to TPD entry. We can always get the current tuple's + * transaction slot number by referring offset->slot map in TPD entry, + * however that won't be true for tuple in undo. + */ + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT; + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &tup_trans_slot_id, + sizeof(tup_trans_slot_id)); + } + + /* + * Store subtransaction id in undo record. See SubXactLockTableWait + * to know why we need to store subtransaction id in undo. + */ + if (hasSubXactLock) + { + SubTransactionId subxid = GetCurrentSubTransactionId(); + + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT; + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &subxid, + sizeof(subxid)); + } + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(relation), + &undometa); + + temp_infomask = oldtup.t_data->t_infomask; + + /* Compute the new xid and infomask to store into the tuple. */ + compute_new_xid_infomask(&oldtup, buffer, save_tup_xid, + tup_trans_slot_id, temp_infomask, + xid, trans_slot_id, single_locker_xid, + *lockmode, LockForUpdate, &lock_old_infomask, + &result_trans_slot_id); + + if (ZHeapTupleHasMultiLockers(lock_old_infomask)) + undorecord.uur_type = UNDO_XID_MULTI_LOCK_ONLY; + else + undorecord.uur_type = UNDO_XID_LOCK_FOR_UPDATE; + + START_CRIT_SECTION(); + + /* + * If all the members were lockers and are all gone, we can do away + * with the MULTI_LOCKERS bit. + */ + + if (ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask) && + !any_multi_locker_member_alive) + oldtup.t_data->t_infomask &= ~ZHEAP_MULTI_LOCKERS; + + InsertPreparedUndo(); + + /* + * We never set the locker slot on the tuple, so pass set_tpd_map_slot + * flag as false from the locker. From all other places it should + * always be passed as true so that the proper slot get set in the TPD + * offset map if its a TPD slot. + */ + PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, + xid, urecptr, NULL, 0); + + ZHeapTupleHeaderSetXactSlot(oldtup.t_data, result_trans_slot_id); + + oldtup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + oldtup.t_data->t_infomask |= lock_old_infomask; + + /* Set prev_urecptr to the latest undo record in the slot. */ + prev_urecptr = urecptr; + + MarkBufferDirty(buffer); + + /* + * Do xlog stuff + */ + if (RelationNeedsWAL(relation)) + { + xl_zheap_lock xlrec; + xl_undo_header xlundohdr; + XLogRecPtr recptr; + XLogRecPtr RedoRecPtr; + bool doPageWrites; + + /* + * Store the information required to generate undo record during + * replay. + */ + xlundohdr.reloid = undorecord.uur_reloid; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = undorecord.uur_blkprev; + + xlrec.prev_xid = tup_xid; + xlrec.offnum = ItemPointerGetOffsetNumber(&(oldtup.t_self)); + xlrec.infomask = oldtup.t_data->t_infomask; + xlrec.trans_slot_id = result_trans_slot_id; + xlrec.flags = 0; + + if (result_trans_slot_id != trans_slot_id) + { + Assert(result_trans_slot_id == tup_trans_slot_id); + xlrec.flags |= XLZ_LOCK_TRANS_SLOT_FOR_UREC; + } + else if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + xlrec.flags |= XLZ_LOCK_CONTAINS_TPD_SLOT; + + if (hasSubXactLock) + xlrec.flags |= XLZ_LOCK_CONTAINS_SUBXACT; + if (undorecord.uur_type == UNDO_XID_LOCK_FOR_UPDATE) + xlrec.flags |= XLZ_LOCK_FOR_UPDATE; + +prepare_xlog: + /* LOG undolog meta if this is the first WAL after the checkpoint. */ + LogUndoMetaData(&undometa); + + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + + XLogBeginInsert(); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + (void) RegisterTPDBuffer(page, 1); + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + XLogRegisterData((char *) &xlrec, SizeOfZHeapLock); + + /* + * We always include old tuple header for undo in WAL record + * irrespective of full page image is taken or not. This is done + * since savings for not including a zheap tuple header are less + * compared to code complexity. However in future, if required we + * can do it similar to what we have done in zheap_update or + * zheap_delete. + */ + XLogRegisterData((char *) undorecord.uur_tuple.data, + SizeofZHeapTupleHeader); + XLogRegisterData((char *) (lockmode), sizeof(LockTupleMode)); + if (xlrec.flags & XLZ_LOCK_TRANS_SLOT_FOR_UREC) + XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id)); + else if (xlrec.flags & XLZ_LOCK_CONTAINS_TPD_SLOT) + XLogRegisterData((char *) &tup_trans_slot_id, sizeof(tup_trans_slot_id)); + + recptr = XLogInsertExtended(RM_ZHEAP_ID, XLOG_ZHEAP_LOCK, RedoRecPtr, + doPageWrites); + if (recptr == InvalidXLogRecPtr) + { + ResetRegisteredTPDBuffers(); + goto prepare_xlog; + } + + PageSetLSN(page, recptr); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + TPDPageSetLSN(page, recptr); + } + END_CRIT_SECTION(); + + pfree(undorecord.uur_tuple.data); + pfree(undorecord.uur_payload.data); + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + + /* + * Let the toaster do its thing, if needed. + * + * Note: below this point, zheaptup is the data we actually intend to + * store into the relation; newtup is the caller's original untoasted + * data. + */ + if (need_toast) + { + zheaptup = ztoast_insert_or_update(relation, newtup, &oldtup, 0); + newtupsize = SHORTALIGN(zheaptup->t_len); /* short aligned */ + } + else + zheaptup = newtup; +reacquire_buffer: + /* + * Get a new page for inserting tuple. We will need to acquire buffer + * locks on both old and new pages. See heap_update. + */ + if (BufferIsValid(vmbuffer_new)) + { + ReleaseBuffer(vmbuffer_new); + vmbuffer_new = InvalidBuffer; + } + + if (newtupsize > pagefree) + { + newbuf = RelationGetBufferForZTuple(relation, zheaptup->t_len, + buffer, 0, NULL, + &vmbuffer_new, &vmbuffer); + } + else + { + /* Re-acquire the lock on the old tuple's page. */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + /* Re-check using the up-to-date free space */ + pagefree = PageGetZHeapFreeSpace(page); + if (newtupsize > pagefree) + { + /* + * Rats, it doesn't fit anymore. We must now unlock and + * relock to avoid deadlock. Fortunately, this path should + * seldom be taken. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + newbuf = RelationGetBufferForZTuple(relation, zheaptup->t_len, + buffer, 0, NULL, + &vmbuffer_new, &vmbuffer); + } + else + { + /* OK, it fits here, so we're done. */ + newbuf = buffer; + } + } + + max_offset = PageGetMaxOffsetNumber(BufferGetPage(newbuf)); + oldblk = BufferGetBlockNumber(buffer); + newblk = BufferGetBlockNumber(newbuf); + + /* + * If we have got the new block than reserve the slot in same order in + * which buffers are locked (ascending). + */ + if (oldblk == newblk) + { + new_trans_slot_id = PageReserveTransactionSlot(relation, + newbuf, + max_offset + 1, + epoch, + xid, + &new_prev_urecptr, + &lock_reacquired); + /* + * We should get the same slot what we reserved previously because + * our transaction information should already be there. But, there + * is possibility that our slot might have moved to the TPD in such + * case we should get previous slot_no + 1. + */ + Assert((new_trans_slot_id == trans_slot_id) || + (ZHeapPageHasTPDSlot((PageHeader)page) && + new_trans_slot_id == trans_slot_id + 1)); + + trans_slot_id = new_trans_slot_id; + } + else if (oldblk < newblk) + { + slot_id = PageReserveTransactionSlot(relation, + buffer, + old_offnum, + epoch, + xid, + &prev_urecptr, + &lock_reacquired); + Assert((slot_id == trans_slot_id) || + (ZHeapPageHasTPDSlot((PageHeader)page) && + slot_id == trans_slot_id + 1)); + + trans_slot_id = slot_id; + + /* reserve the transaction slot on a new page */ + new_trans_slot_id = PageReserveTransactionSlot(relation, + newbuf, + max_offset + 1, + epoch, + xid, + &new_prev_urecptr, + &lock_reacquired); + } + else + { + /* reserve the transaction slot on a new page */ + new_trans_slot_id = PageReserveTransactionSlot(relation, + newbuf, + max_offset + 1, + epoch, + xid, + &new_prev_urecptr, + &lock_reacquired); + + /* reserve the transaction slot on a old page */ + slot_id = PageReserveTransactionSlot(relation, + buffer, + old_offnum, + epoch, + xid, + &prev_urecptr, + &lock_reacquired); + Assert((slot_id == trans_slot_id) || + (ZHeapPageHasTPDSlot((PageHeader)page) && + slot_id == trans_slot_id + 1)); + trans_slot_id = slot_id; + } + + if (lock_reacquired) + goto reacquire_buffer; + + if (new_trans_slot_id == InvalidXactSlotId) + { + /* release the new buffer and lock on old buffer */ + UnlockReleaseBuffer(newbuf); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + UnlockReleaseTPDBuffers(); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + goto reacquire_buffer; + } + + /* + * After we release the lock on page, it could be pruned. As we have + * lock on the tuple, it couldn't be removed underneath us, but its + * position could be changes, so need to refresh the tuple position. + * + * XXX Though the length of the tuple wouldn't have changed, but there + * is no harm in refrehsing it for the sake of consistency of code. + */ + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + oldtup.t_len = ItemIdGetLength(lp); + tup_trans_slot_id = trans_slot_id; + tup_xid = xid; + } + else + { + /* No TOAST work needed, and it'll fit on same page */ + newbuf = buffer; + new_trans_slot_id = trans_slot_id; + zheaptup = newtup; + } + + CheckForSerializableConflictIn(relation, &(oldtup.t_self), buffer); + + /* + * Prepare an undo record for old tuple. We need to separately store the + * latest transaction id that has changed the tuple to ensure that we + * don't try to process the tuple in undo chain that is already discarded. + * See GetTupleFromUndo. + */ + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = relation->rd_id; + undorecord.uur_prevxid = tup_xid; + undorecord.uur_xid = xid; + undorecord.uur_cid = cid; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = prev_urecptr; + undorecord.uur_block = ItemPointerGetBlockNumber(&(oldtup.t_self)); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&(oldtup.t_self)); + undorecord.uur_payload.len = 0; + + initStringInfo(&undorecord.uur_tuple); + + /* + * Copy the entire old tuple including it's header in the undo record. + * We need this to reconstruct the old tuple if current tuple is not + * visible to some other transaction. We choose to write the complete + * tuple in undo record for update operation so that we can reuse the + * space of old tuples for non-inplace-updates after the transaction + * performing the operation commits. + */ + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &oldtup.t_len, + sizeof(uint32)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &oldtup.t_self, + sizeof(ItemPointerData)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &oldtup.t_tableOid, + sizeof(Oid)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) oldtup.t_data, + oldtup.t_len); + + if (use_inplace_update) + { + bool hasPayload = false; + + undorecord.uur_type = UNDO_INPLACE_UPDATE; + + /* + * Store the transaction slot number for undo tuple in undo record, if + * the slot belongs to TPD entry. We can always get the current tuple's + * transaction slot number by referring offset->slot map in TPD entry, + * however that won't be true for tuple in undo. + */ + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT; + initStringInfo(&undorecord.uur_payload); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &tup_trans_slot_id, + sizeof(tup_trans_slot_id)); + hasPayload = true; + } + + /* + * Store subtransaction id in undo record. See SubXactLockTableWait + * to know why we need to store subtransaction id in undo. + */ + if (hasSubXactLock) + { + SubTransactionId subxid = GetCurrentSubTransactionId(); + + if (!hasPayload) + { + initStringInfo(&undorecord.uur_payload); + hasPayload = true; + } + + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT; + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &subxid, + sizeof(subxid)); + } + + if (!hasPayload) + undorecord.uur_payload.len = 0; + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(relation), + &undometa); + } + else + { + Size payload_len; + UnpackedUndoRecord undorec[2]; + + undorecord.uur_type = UNDO_UPDATE; + + /* + * we need to initialize the length of payload before actually knowing + * the value to ensure that the required space is reserved in undo. + */ + payload_len = sizeof(ItemPointerData); + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT; + payload_len += sizeof(tup_trans_slot_id); + } + + /* + * Store subtransaction id in undo record. See SubXactLockTableWait + * to know why we need to store subtransaction id in undo. + */ + if (hasSubXactLock) + { + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT; + payload_len += sizeof(SubTransactionId); + } + + undorecord.uur_payload.len = payload_len; + + /* prepare an undo record for new tuple */ + new_undorecord.uur_type = UNDO_INSERT; + new_undorecord.uur_info = 0; + new_undorecord.uur_prevlen = 0; + new_undorecord.uur_reloid = relation->rd_id; + new_undorecord.uur_prevxid = xid; + new_undorecord.uur_xid = xid; + new_undorecord.uur_cid = cid; + new_undorecord.uur_fork = MAIN_FORKNUM; + new_undorecord.uur_block = BufferGetBlockNumber(newbuf); + new_undorecord.uur_payload.len = 0; + new_undorecord.uur_tuple.len = 0; + + if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + new_undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT; + initStringInfo(&new_undorecord.uur_payload); + appendBinaryStringInfo(&new_undorecord.uur_payload, + (char *) &new_trans_slot_id, + sizeof(new_trans_slot_id)); + } + else + new_undorecord.uur_payload.len = 0; + + undorec[0] = undorecord; + undorec[1] = new_undorecord; + UndoSetPrepareSize(undorec, 2, InvalidTransactionId, + UndoPersistenceForRelation(relation), &undometa); + + /* copy updated record (uur_info might got updated )*/ + undorecord = undorec[0]; + new_undorecord = undorec[1]; + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(relation), + NULL); + + initStringInfo(&undorecord.uur_payload); + + /* Make more room for tuple location if needed */ + enlargeStringInfo(&undorecord.uur_payload, payload_len); + + if (buffer == newbuf) + new_undorecord.uur_blkprev = urecptr; + else + new_undorecord.uur_blkprev = new_prev_urecptr; + + new_urecptr = PrepareUndoInsert(&new_undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(relation), + NULL); + + /* Check and lock the TPD page before starting critical section. */ + CheckAndLockTPDPage(relation, new_trans_slot_id, trans_slot_id, + newbuf, buffer); + + } + + /* + * We can't rely on any_multi_locker_member_alive to clear the multi locker + * bit, if the the lock on the buffer is released inbetween. + */ + temp_infomask = oldtup.t_data->t_infomask; + + /* Compute the new xid and infomask to store into the tuple. */ + compute_new_xid_infomask(&oldtup, buffer, save_tup_xid, tup_trans_slot_id, + temp_infomask, xid, trans_slot_id, + single_locker_xid, *lockmode, ForUpdate, + &old_infomask, &result_trans_slot_id); + + /* + * There must not be any stronger locker than the current operation, + * otherwise it would have waited for it to finish. + */ + Assert(result_trans_slot_id == trans_slot_id); + + /* + * Propagate the lockers information to the new tuple. Since we're doing + * an update, the only possibility is that the lockers had FOR KEY SHARE + * lock. For in-place updates, we are not creating any new version, so + * we don't need to propagate anything. + */ + if ((checked_lockers && !locker_remains) || use_inplace_update) + new_infomask = 0; + else + { + /* + * We should also set the multilocker flag if it was there previously, + * else, we set the tuple as locked-only. + */ + new_infomask = ZHEAP_XID_KEYSHR_LOCK; + if (ZHeapTupleHasMultiLockers(old_infomask)) + new_infomask |= ZHEAP_MULTI_LOCKERS | ZHEAP_XID_LOCK_ONLY; + else + new_infomask |= ZHEAP_XID_LOCK_ONLY; + } + + if (use_inplace_update) + { + infomask_old_tuple = infomask_new_tuple = + old_infomask | new_infomask | ZHEAP_INPLACE_UPDATED; + } + else + { + infomask_old_tuple = old_infomask | ZHEAP_UPDATED; + infomask_new_tuple = new_infomask; + } + + /* We must have a valid buffer. */ + Assert(BufferIsValid(vmbuffer)); + vm_status = visibilitymap_get_status(relation, + BufferGetBlockNumber(buffer), &vmbuffer); + + /* + * If the page is new, then there will no valid vmbuffer_new and the + * visisbilitymap is reset already, hence, need not to clear anything. + */ + if (newbuf != buffer && BufferIsValid(vmbuffer_new)) + vm_status_new = visibilitymap_get_status(relation, + BufferGetBlockNumber(newbuf), &vmbuffer_new); + + START_CRIT_SECTION(); + + if (buffer == newbuf) + { + /* + * If all the members were lockers and are all gone, we can do away + * with the MULTI_LOCKERS bit. + */ + if (ZHeapTupleHasMultiLockers(oldtup.t_data->t_infomask) && + !any_multi_locker_member_alive) + oldtup.t_data->t_infomask &= ~ZHEAP_MULTI_LOCKERS; + } + + if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) || + (vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE)) + { + all_visible_cleared = true; + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), + vmbuffer, VISIBILITYMAP_VALID_BITS); + } + + if (newbuf != buffer) + { + if ((vm_status_new & VISIBILITYMAP_ALL_VISIBLE) || + (vm_status_new & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE)) + { + new_all_visible_cleared = true; + visibilitymap_clear(relation, BufferGetBlockNumber(newbuf), + vmbuffer_new, VISIBILITYMAP_VALID_BITS); + } + } + + /* + * A page can be pruned for non-inplace updates or inplace updates that + * results in shorter tuples. If this transaction commits, the tuple will + * become DEAD sooner or later. If the transaction finally aborts, the + * subsequent page pruning will be a no-op and the hint will be cleared. + */ + if (!use_inplace_update || (zheaptup->t_len < oldtup.t_len)) + ZPageSetPrunable(page, xid); + + /* oldtup should be pointing to right place in page */ + Assert(oldtup.t_data == (ZHeapTupleHeader) PageGetItem(page, lp)); + + ZHeapTupleHeaderSetXactSlot(oldtup.t_data, result_trans_slot_id); + oldtup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + oldtup.t_data->t_infomask |= infomask_old_tuple; + + /* keep the new tuple copy updated for the caller */ + ZHeapTupleHeaderSetXactSlot(zheaptup->t_data, new_trans_slot_id); + zheaptup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + zheaptup->t_data->t_infomask |= infomask_new_tuple; + + if (use_inplace_update) + { + /* + * For inplace updates, we copy the entire data portion including null + * bitmap of new tuple. + * + * For the special case where we are doing inplace updates even when + * the new tuple is bigger, we need to adjust the old tuple's location + * so that new tuple can be copied at that location as it is. + */ + ItemIdChangeLen(lp, zheaptup->t_len); + memcpy((char *) oldtup.t_data + SizeofZHeapTupleHeader, + (char *) zheaptup->t_data + SizeofZHeapTupleHeader, + zheaptup->t_len - SizeofZHeapTupleHeader); + + /* + * Copy everything from new tuple in infomask apart from visibility + * flags. + */ + oldtup.t_data->t_infomask = oldtup.t_data->t_infomask & + ZHEAP_VIS_STATUS_MASK; + oldtup.t_data->t_infomask |= (zheaptup->t_data->t_infomask & + ~ZHEAP_VIS_STATUS_MASK); + /* Copy number of attributes in infomask2 of new tuple. */ + oldtup.t_data->t_infomask2 &= ~ZHEAP_NATTS_MASK; + oldtup.t_data->t_infomask2 |= + newtup->t_data->t_infomask2 & ZHEAP_NATTS_MASK; + /* also update the tuple length and self pointer */ + oldtup.t_len = zheaptup->t_len; + oldtup.t_data->t_hoff = zheaptup->t_data->t_hoff; + ItemPointerCopy(&oldtup.t_self, &zheaptup->t_self); + } + else + { + /* insert tuple at new location */ + RelationPutZHeapTuple(relation, newbuf, zheaptup); + + /* update new tuple location in undo record */ + appendBinaryStringInfoNoExtend(&undorecord.uur_payload, + (char *) &zheaptup->t_self, + sizeof(ItemPointerData)); + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + appendBinaryStringInfoNoExtend(&undorecord.uur_payload, + (char *) &tup_trans_slot_id, + sizeof(tup_trans_slot_id)); + if (hasSubXactLock) + { + SubTransactionId subxid = GetCurrentSubTransactionId(); + + appendBinaryStringInfoNoExtend(&undorecord.uur_payload, + (char *) &subxid, + sizeof(subxid)); + } + + new_undorecord.uur_offset = ItemPointerGetOffsetNumber(&(zheaptup->t_self)); + } + + InsertPreparedUndo(); + if (use_inplace_update) + PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, + xid, urecptr, NULL, 0); + else + { + if (newbuf == buffer) + { + OffsetNumber usedoff[2]; + + usedoff[0] = undorecord.uur_offset; + usedoff[1] = new_undorecord.uur_offset; + + PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, + xid, new_urecptr, usedoff, 2); + } + else + { + /* set transaction slot information for old page */ + PageSetUNDO(undorecord, buffer, trans_slot_id, true, epoch, + xid, urecptr, NULL, 0); + /* set transaction slot information for new page */ + PageSetUNDO(new_undorecord, + newbuf, + new_trans_slot_id, + true, + epoch, + xid, + new_urecptr, + NULL, + 0); + + MarkBufferDirty(newbuf); + } + } + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(relation)) + { + /* + * For logical decoding we need combocids to properly decode the + * catalog. + */ + if (RelationIsAccessibleInLogicalDecoding(relation)) + { + /* + * Fixme: This won't work as it needs to access cmin/cmax which + * we probably needs to retrieve from UNDO. + */ + /*log_heap_new_cid(relation, &oldtup); + log_heap_new_cid(relation, heaptup);*/ + } + + log_zheap_update(relation, undorecord, new_undorecord, + urecptr, new_urecptr, buffer, newbuf, + &oldtup, zheaptup, tup_trans_slot_id, + trans_slot_id, new_trans_slot_id, + use_inplace_update, all_visible_cleared, + new_all_visible_cleared, &undometa); + } + + END_CRIT_SECTION(); + + /* be tidy */ + pfree(undorecord.uur_tuple.data); + if (undorecord.uur_payload.len > 0) + pfree(undorecord.uur_payload.data); + + if (!use_inplace_update && new_undorecord.uur_payload.len > 0) + pfree(new_undorecord.uur_payload.data); + + if (newbuf != buffer) + LockBuffer(newbuf, BUFFER_LOCK_UNLOCK); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + /* + * Fixme - need to support cache invalidation API's for zheaptuples. + */ + /* CacheInvalidateHeapTuple(relation, &oldtup, heaptup); */ + + if (BufferIsValid(vmbuffer_new)) + ReleaseBuffer(vmbuffer_new); + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + if (newbuf != buffer) + ReleaseBuffer(newbuf); + ReleaseBuffer(buffer); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + + /* + * Release the lmgr tuple lock, if we had it. + */ + if (have_tuple_lock) + UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode); + + /* + * As of now, we only count non-inplace updates as that are required to + * decide whether to trigger autovacuum. + */ + if (!use_inplace_update) + pgstat_count_heap_update(relation, false); + else + pgstat_count_zheap_update(relation); + + /* + * If heaptup is a private copy, release it. Don't forget to copy t_self + * back to the caller's image, too. + */ + if (zheaptup != newtup) + { + newtup->t_self = zheaptup->t_self; + zheap_freetuple(zheaptup); + } + bms_free(inplace_upd_attrs); + bms_free(interesting_attrs); + bms_free(modified_attrs); + + bms_free(key_attrs); + return HeapTupleMayBeUpdated; +} + +/* + * log_zheap_update - Perform XLogInsert for a zheap-update operation. + * + * We need to store enough information in the WAL record so that undo records + * can be regenerated at the WAL replay time. + * + * Caller must already have modified the buffer(s) and marked them dirty. + */ +static void +log_zheap_update(Relation reln, UnpackedUndoRecord undorecord, + UnpackedUndoRecord newundorecord, UndoRecPtr urecptr, + UndoRecPtr newurecptr, Buffer oldbuf, Buffer newbuf, + ZHeapTuple oldtup, ZHeapTuple newtup, + int old_tup_trans_slot_id, int trans_slot_id, + int new_trans_slot_id, bool inplace_update, + bool all_visible_cleared, bool new_all_visible_cleared, + xl_undolog_meta *undometa) +{ + xl_undo_header xlundohdr, + xlnewundohdr; + xl_zheap_header xlundotuphdr, + xlhdr; + xl_zheap_update xlrec; + ZHeapTuple difftup; + ZHeapTupleHeader zhtuphdr; + uint16 prefix_suffix[2]; + uint16 prefixlen = 0, + suffixlen = 0; + XLogRecPtr recptr; + XLogRecPtr RedoRecPtr; + bool doPageWrites; + char *oldp = NULL; + char *newp = NULL; + int oldlen, newlen; + uint32 totalundotuplen; + Size dataoff; + int bufflags = REGBUF_STANDARD; + uint8 info = XLOG_ZHEAP_UPDATE; + + totalundotuplen = *((uint32 *) &undorecord.uur_tuple.data[0]); + dataoff = sizeof(uint32) + sizeof(ItemPointerData) + sizeof(Oid); + zhtuphdr = (ZHeapTupleHeader) &undorecord.uur_tuple.data[dataoff]; + + if (inplace_update) + { + /* + * For inplace updates the old tuple is in undo record and the + * new tuple is replaced in page where old tuple was present. + */ + oldp = (char *) zhtuphdr + zhtuphdr->t_hoff; + oldlen = totalundotuplen - zhtuphdr->t_hoff; + newp = (char *) oldtup->t_data + oldtup->t_data->t_hoff; + newlen = oldtup->t_len - oldtup->t_data->t_hoff; + + difftup = oldtup; + } + else if (oldbuf == newbuf) + { + oldp = (char *) oldtup->t_data + oldtup->t_data->t_hoff; + oldlen = oldtup->t_len - oldtup->t_data->t_hoff; + newp = (char *) newtup->t_data + newtup->t_data->t_hoff; + newlen = newtup->t_len - newtup->t_data->t_hoff; + + difftup = newtup; + } + else + { + difftup = newtup; + } + + /* + * See log_heap_update to know under what some circumstances we can use + * prefix-suffix compression. + */ + if (oldbuf == newbuf && !XLogCheckBufferNeedsBackup(newbuf)) + { + Assert(oldp != NULL && newp != NULL); + + /* Check for common prefix between undo and old tuple */ + for (prefixlen = 0; prefixlen < Min(oldlen, newlen); prefixlen++) + { + if (oldp[prefixlen] != newp[prefixlen]) + break; + } + + /* + * Storing the length of the prefix takes 2 bytes, so we need to save + * at least 3 bytes or there's no point. + */ + if (prefixlen < 3) + prefixlen = 0; + + /* Same for suffix */ + for (suffixlen = 0; suffixlen < Min(oldlen, newlen) - prefixlen; suffixlen++) + { + if (oldp[oldlen - suffixlen - 1] != newp[newlen - suffixlen - 1]) + break; + } + if (suffixlen < 3) + suffixlen = 0; + } + + /* + * Store the information required to generate undo record during + * replay. + */ + xlundohdr.reloid = undorecord.uur_reloid; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = undorecord.uur_blkprev; + + xlrec.prevxid = undorecord.uur_prevxid; + xlrec.old_offnum = ItemPointerGetOffsetNumber(&oldtup->t_self); + xlrec.old_infomask = oldtup->t_data->t_infomask; + xlrec.old_trans_slot_id = trans_slot_id; + xlrec.new_offnum = ItemPointerGetOffsetNumber(&difftup->t_self); + xlrec.flags = 0; + if (all_visible_cleared) + xlrec.flags |= XLZ_UPDATE_OLD_ALL_VISIBLE_CLEARED; + if (new_all_visible_cleared) + xlrec.flags |= XLZ_UPDATE_NEW_ALL_VISIBLE_CLEARED; + if (prefixlen > 0) + xlrec.flags |= XLZ_UPDATE_PREFIX_FROM_OLD; + if (suffixlen > 0) + xlrec.flags |= XLZ_UPDATE_SUFFIX_FROM_OLD; + if (undorecord.uur_info & UREC_INFO_PAYLOAD_CONTAINS_SUBXACT) + xlrec.flags |= XLZ_UPDATE_CONTAINS_SUBXACT; + + if (!inplace_update) + { + Page page = BufferGetPage(newbuf); + + xlrec.flags |= XLZ_NON_INPLACE_UPDATE; + + xlnewundohdr.reloid = newundorecord.uur_reloid; + xlnewundohdr.urec_ptr = newurecptr; + xlnewundohdr.blkprev = newundorecord.uur_blkprev; + + Assert(newtup); + /* If new tuple is the single and first tuple on page... */ + if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber && + PageGetMaxOffsetNumber(page) == FirstOffsetNumber) + { + info |= XLOG_ZHEAP_INIT_PAGE; + bufflags |= REGBUF_WILL_INIT; + } + } + + /* + * If full_page_writes is enabled, and the buffer image is not + * included in the WAL then we can rely on the tuple in the page to + * regenerate the undo tuple during recovery. For detail comments related + * to handling of full_page_writes get changed at run time, refer comments + * in zheap_delete. + */ +prepare_xlog: + /* LOG undolog meta if this is the first WAL after the checkpoint. */ + LogUndoMetaData(undometa); + + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + if (!doPageWrites || XLogCheckBufferNeedsBackup(oldbuf)) + { + xlrec.flags |= XLZ_HAS_UPDATE_UNDOTUPLE; + + xlundotuphdr.t_infomask2 = zhtuphdr->t_infomask2; + xlundotuphdr.t_infomask = zhtuphdr->t_infomask; + xlundotuphdr.t_hoff = zhtuphdr->t_hoff; + } + + XLogBeginInsert(); + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + XLogRegisterData((char *) &xlrec, SizeOfZHeapUpdate); + if (old_tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + xlrec.flags |= XLZ_UPDATE_OLD_CONTAINS_TPD_SLOT; + XLogRegisterData((char *) &old_tup_trans_slot_id, + sizeof(old_tup_trans_slot_id)); + } + if (!inplace_update) + { + XLogRegisterData((char *) &xlnewundohdr, SizeOfUndoHeader); + if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + xlrec.flags |= XLZ_UPDATE_NEW_CONTAINS_TPD_SLOT; + XLogRegisterData((char *) &new_trans_slot_id, + sizeof(new_trans_slot_id)); + } + } + if (xlrec.flags & XLZ_HAS_UPDATE_UNDOTUPLE) + { + XLogRegisterData((char *) &xlundotuphdr, SizeOfZHeapHeader); + /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */ + XLogRegisterData((char *) zhtuphdr + SizeofZHeapTupleHeader, + totalundotuplen - SizeofZHeapTupleHeader); + } + + XLogRegisterBuffer(0, newbuf, bufflags); + if (oldbuf != newbuf) + { + uint8 block_id; + + XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD); + block_id = 2; + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + block_id = RegisterTPDBuffer(BufferGetPage(oldbuf), block_id); + if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + RegisterTPDBuffer(BufferGetPage(newbuf), block_id); + } + else + { + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + /* + * Block id '1' is reserved for oldbuf if that is different from + * newbuf. + */ + RegisterTPDBuffer(BufferGetPage(oldbuf), 2); + } + } + + /* + * Prepare WAL data for the new tuple. + */ + if (prefixlen > 0 || suffixlen > 0) + { + if (prefixlen > 0 && suffixlen > 0) + { + prefix_suffix[0] = prefixlen; + prefix_suffix[1] = suffixlen; + XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2); + } + else if (prefixlen > 0) + { + XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16)); + } + else + { + XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16)); + } + } + + xlhdr.t_infomask2 = difftup->t_data->t_infomask2; + xlhdr.t_infomask = difftup->t_data->t_infomask; + xlhdr.t_hoff = difftup->t_data->t_hoff; + Assert(SizeofZHeapTupleHeader + prefixlen + suffixlen <= difftup->t_len); + + /* + * PG73FORMAT: write bitmap [+ padding] [+ oid] + data + * + * The 'data' doesn't include the common prefix or suffix. + */ + XLogRegisterBufData(0, (char *) &xlhdr, SizeOfZHeapHeader); + if (prefixlen == 0) + { + XLogRegisterBufData(0, + ((char *) difftup->t_data) + SizeofZHeapTupleHeader, + difftup->t_len - SizeofZHeapTupleHeader - suffixlen); + } + else + { + /* + * Have to write the null bitmap and data after the common prefix as + * two separate rdata entries. + */ + /* bitmap [+ padding] [+ oid] */ + if (difftup->t_data->t_hoff - SizeofZHeapTupleHeader > 0) + { + XLogRegisterBufData(0, + ((char *) difftup->t_data) + SizeofZHeapTupleHeader, + difftup->t_data->t_hoff - SizeofZHeapTupleHeader); + } + + /* data after common prefix */ + XLogRegisterBufData(0, + ((char *) difftup->t_data) + difftup->t_data->t_hoff + prefixlen, + difftup->t_len - difftup->t_data->t_hoff - prefixlen - suffixlen); + } + + /* filtering by origin on a row level is much more efficient */ + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); + + recptr = XLogInsertExtended(RM_ZHEAP_ID, info, RedoRecPtr, doPageWrites); + if (recptr == InvalidXLogRecPtr) + { + ResetRegisteredTPDBuffers(); + goto prepare_xlog; + } + + if (newbuf != oldbuf) + { + PageSetLSN(BufferGetPage(newbuf), recptr); + if (new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + TPDPageSetLSN(BufferGetPage(newbuf), recptr); + } + PageSetLSN(BufferGetPage(oldbuf), recptr); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + TPDPageSetLSN(BufferGetPage(oldbuf), recptr); +} + +/* + * zheap_lock_tuple - lock a tuple. + * + * The functionality is same as heap_lock_tuple except that here we always + * make a copy of the tuple before returning to the caller. We maintain + * the pin on buffer to keep the specs same as heap_lock_tuple. + * + * eval - indicates whether the tuple will be evaluated to see if it still + * matches the qualification. + * + * XXX - Here, we are purposefully not doing anything for visibility map + * as it is not clear whether we ever need all_frozen kind of concept for + * zheap. + */ +HTSU_Result +zheap_lock_tuple(Relation relation, ItemPointer tid, + CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy, + bool follow_updates, bool eval, Snapshot snapshot, + ZHeapTuple tuple, Buffer *buffer, HeapUpdateFailureData *hufd) +{ + HTSU_Result result; + ZHeapTupleData zhtup; + UndoRecPtr prev_urecptr; + ItemId lp; + Page page; + ItemPointerData ctid; + TransactionId xid, + tup_xid, + single_locker_xid; + SubTransactionId tup_subxid = InvalidSubTransactionId; + CommandId tup_cid; + UndoRecPtr urec_ptr = InvalidUndoRecPtr; + uint32 epoch; + int tup_trans_slot_id, + trans_slot_id, + single_locker_trans_slot; + OffsetNumber offnum; + LockOper lockopr; + bool require_sleep; + bool have_tuple_lock = false; + bool in_place_updated_or_locked = false; + bool any_multi_locker_member_alive = false; + bool lock_reacquired; + bool rollback_and_relocked; + + xid = GetTopTransactionId(); + epoch = GetEpochForXid(xid); + lockopr = eval ? LockForUpdate : LockOnly; + + *buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid)); + + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(*buffer); + offnum = ItemPointerGetOffsetNumber(tid); + lp = PageGetItemId(page, offnum); + Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp)); + + /* + * If TID is already delete marked due to pruning, then get new ctid, so + * that we can lock the new tuple. We will get new ctid if the tuple + * was non-inplace-updated otherwise we will get same TID. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + zhtup.t_tableOid = RelationGetRelid(relation); + zhtup.t_self = *tid; + + /* + * Get the transaction slot and undo record pointer if we are already in a + * transaction. + */ + trans_slot_id = PageGetTransactionSlotId(relation, *buffer, epoch, xid, + &urec_ptr, false, false, NULL); + + /* + * ctid needs to be fetched from undo chain. See zheap_update. + */ + ctid = *tid; + +check_tup_satisfies_update: + any_multi_locker_member_alive = true; + result = ZHeapTupleSatisfiesUpdate(relation, &zhtup, cid, *buffer, &ctid, + &tup_trans_slot_id, &tup_xid, &tup_subxid, + &tup_cid, &single_locker_xid, + &single_locker_trans_slot, false, eval, + snapshot, &in_place_updated_or_locked); + if (result == HeapTupleInvisible) + { + /* ZBORKED? this previously didn't set up a tuple */ + tuple->t_tableOid = RelationGetRelid(relation); + tuple->t_len = zhtup.t_len; + tuple->t_self = zhtup.t_self; + tuple->t_data = palloc0(tuple->t_len); + memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len); + + /* Give caller an opportunity to throw a more specific error. */ + result = HeapTupleInvisible; + goto out_locked; + } + else if (result == HeapTupleBeingUpdated || + result == HeapTupleUpdated || + (result == HeapTupleMayBeUpdated && + ZHeapTupleHasMultiLockers(zhtup.t_data->t_infomask))) + { + TransactionId xwait; + SubTransactionId xwait_subxid; + int xwait_trans_slot; + uint16 infomask; + + xwait_subxid = tup_subxid; + + if (TransactionIdIsValid(single_locker_xid)) + { + xwait = single_locker_xid; + xwait_trans_slot = single_locker_trans_slot; + } + else + { + xwait = tup_xid; + xwait_trans_slot = tup_trans_slot_id; + } + + infomask = zhtup.t_data->t_infomask; + + /* + * make a copy of the tuple before releasing the lock as some other + * backend can perform in-place update this tuple once we release the + * lock. + */ + tuple->t_tableOid = RelationGetRelid(relation); + tuple->t_len = zhtup.t_len; + tuple->t_self = zhtup.t_self; + tuple->t_data = palloc0(tuple->t_len); + memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len); + + LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); + + /* + * If any subtransaction of the current top transaction already holds + * a lock as strong as or stronger than what we're requesting, we + * effectively hold the desired lock already. We *must* succeed + * without trying to take the tuple lock, else we will deadlock + * against anyone wanting to acquire a stronger lock. + */ + if (ZHeapTupleHasMultiLockers(infomask)) + { + List *mlmembers; + ListCell *lc; + + if (trans_slot_id != InvalidXactSlotId) + { + mlmembers = ZGetMultiLockMembersForCurrentXact(&zhtup, + trans_slot_id, urec_ptr); + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + /* + * Only members of our own transaction must be present in + * the list. + */ + Assert(TransactionIdIsCurrentTransactionId(mlmember->xid)); + + if (mlmember->mode >= mode) + { + list_free_deep(mlmembers); + result = HeapTupleMayBeUpdated; + goto out_unlocked; + } + } + + list_free_deep(mlmembers); + } + } + else if (TransactionIdIsCurrentTransactionId(xwait)) + { + switch (mode) + { + case LockTupleKeyShare: + Assert(ZHEAP_XID_IS_KEYSHR_LOCKED(infomask) || + ZHEAP_XID_IS_SHR_LOCKED(infomask) || + ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) || + ZHEAP_XID_IS_EXCL_LOCKED(infomask)); + { + result = HeapTupleMayBeUpdated; + goto out_unlocked; + } + break; + case LockTupleShare: + if (ZHEAP_XID_IS_SHR_LOCKED(infomask) || + ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) || + ZHEAP_XID_IS_EXCL_LOCKED(infomask)) + { + result = HeapTupleMayBeUpdated; + goto out_unlocked; + } + break; + case LockTupleNoKeyExclusive: + if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) || + ZHEAP_XID_IS_EXCL_LOCKED(infomask)) + { + result = HeapTupleMayBeUpdated; + goto out_unlocked; + } + break; + case LockTupleExclusive: + if (ZHEAP_XID_IS_EXCL_LOCKED(infomask)) + { + result = HeapTupleMayBeUpdated; + goto out_unlocked; + } + break; + } + } + + /* + * Initially assume that we will have to wait for the locking + * transaction(s) to finish. We check various cases below in which + * this can be turned off. + */ + require_sleep = true; + if (mode == LockTupleKeyShare) + { + if (!(ZHEAP_XID_IS_EXCL_LOCKED(infomask))) + { + bool updated; + + updated = !ZHEAP_XID_IS_LOCKED_ONLY(infomask); + + /* + * If there are updates, follow the update chain; bail out if + * that cannot be done. + */ + if (follow_updates && updated) + { + if (!ZHeapTupleIsMoved(zhtup.t_data->t_infomask) && + !ItemPointerEquals(&zhtup.t_self, &ctid)) + { + HTSU_Result res; + + res = zheap_lock_updated_tuple(relation, &zhtup, &ctid, + xid, mode, lockopr, cid, + &rollback_and_relocked); + + /* + * If the update was by some aborted transaction and its + * pending undo actions are applied now, then check the + * latest copy of the tuple. + */ + if (rollback_and_relocked) + { + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto check_tup_satisfies_update; + } + else if (res != HeapTupleMayBeUpdated) + { + result = res; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + } + } + + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release + * the buffer lock. For this we check if ItemId is not deleted + * and refresh the tuple offset position in page. If TID is + * already delete marked due to pruning, then get new ctid, so + * that we can lock the new tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + /* + * Make sure it's still an appropriate lock, else start over. + * Also, if it wasn't updated before we released the lock, but + * is updated now, we start over too; the reason is that we + * now need to follow the update chain to lock the new + * versions. + */ + if (!(ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask)) && + ((ZHEAP_XID_IS_EXCL_LOCKED(zhtup.t_data->t_infomask)) || + !updated)) + goto check_tup_satisfies_update; + + /* Skip sleeping */ + require_sleep = false; + + /* + * Note we allow Xid to change here; other updaters/lockers + * could have modified it before we grabbed the buffer lock. + * However, this is not a problem, because with the recheck we + * just did we ensure that they still don't conflict with the + * lock we want. + */ + } + } + else if (mode == LockTupleShare) + { + /* + * If we're requesting Share, we can similarly avoid sleeping if + * there's no update and no exclusive lock present. + */ + if (ZHEAP_XID_IS_LOCKED_ONLY(infomask) && + !ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) && + !ZHEAP_XID_IS_EXCL_LOCKED(infomask)) + { + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release + * the buffer lock. For this we check if ItemId is not deleted + * and refresh the tuple offset position in page. If TID is + * already delete marked due to pruning, then get new ctid, so + * that we can lock the new tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + /* + * Make sure it's still an appropriate lock, else start over. + * See above about allowing xid to change. + */ + if (!ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask) || + ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(zhtup.t_data->t_infomask) || + ZHEAP_XID_IS_EXCL_LOCKED(zhtup.t_data->t_infomask)) + goto check_tup_satisfies_update; + + /* Skip sleeping */ + require_sleep = false; + } + } + else if (mode == LockTupleNoKeyExclusive) + { + LockTupleMode old_lock_mode; + TransactionId current_tup_xid; + bool buf_lock_reacquired = false; + + old_lock_mode = get_old_lock_mode(infomask); + + /* + * If we're requesting NoKeyExclusive, we might also be able to + * avoid sleeping; just ensure that there is no conflicting lock + * already acquired. + */ + if (ZHeapTupleHasMultiLockers(infomask)) + { + if (!DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode), + HWLOCKMODE_from_locktupmode(mode))) + { + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + buf_lock_reacquired = true; + } + } + else if (old_lock_mode == LockTupleKeyShare) + { + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + buf_lock_reacquired = true; + } + + if (buf_lock_reacquired) + { + /* + * Also take care of cases when page is pruned after we release + * the buffer lock. For this we check if ItemId is not deleted + * and refresh the tuple offset position in page. If TID is + * already delete marked due to pruning, then get new ctid, so + * that we can lock the new tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, ¤t_tup_xid, + NULL, NULL, false); + + if (xid_infomask_changed(zhtup.t_data->t_infomask, infomask) || + !TransactionIdEquals(current_tup_xid, xwait)) + goto check_tup_satisfies_update; + /* Skip sleeping */ + require_sleep = false; + } + } + + /* + * As a check independent from those above, we can also avoid sleeping + * if the current transaction is the sole locker of the tuple. Note + * that the strength of the lock already held is irrelevant; this is + * not about recording the lock (which will be done regardless of this + * optimization, below). Also, note that the cases where we hold a + * lock stronger than we are requesting are already handled above + * by not doing anything. + */ + if (require_sleep && + !ZHeapTupleHasMultiLockers(infomask) && + TransactionIdIsCurrentTransactionId(xwait)) + { + TransactionId current_tup_xid; + + /* + * ... but if the xid changed in the meantime, start over + * + * Also take care of cases when page is pruned after we release + * the buffer lock. For this we check if ItemId is not deleted and + * refresh the tuple offset position in page. If TID is already + * delete marked due to pruning, then get new ctid, so that we can + * lock the new tuple. + */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, ¤t_tup_xid, + NULL, NULL, false); + if (xid_infomask_changed(zhtup.t_data->t_infomask, infomask) || + !TransactionIdEquals(current_tup_xid, xwait)) + goto check_tup_satisfies_update; + require_sleep = false; + } + + if (require_sleep && result == HeapTupleUpdated) + { + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + else if (require_sleep) + { + List *mlmembers = NIL; + bool upd_xact_aborted = false; + TransactionId current_tup_xid; + + /* + * Acquire tuple lock to establish our priority for the tuple, or + * die trying. LockTuple will release us when we are next-in-line + * for the tuple. We must do this even if we are share-locking. + * + * If we are forced to "start over" below, we keep the tuple lock; + * this arranges that we stay at the head of the line while + * rechecking tuple state. + */ + if (!heap_acquire_tuplock(relation, tid, mode, wait_policy, + &have_tuple_lock)) + { + /* + * This can only happen if wait_policy is Skip and the lock + * couldn't be obtained. + */ + result = HeapTupleWouldBlock; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + + if (ZHeapTupleHasMultiLockers(infomask)) + { + LockTupleMode old_lock_mode; + TransactionId update_xact; + + old_lock_mode = get_old_lock_mode(infomask); + + /* + * For aborted updates, we must allow to reverify the tuple in + * case it's values got changed. + */ + if (!ZHEAP_XID_IS_LOCKED_ONLY(infomask)) + ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, &update_xact, + NULL, NULL, true); + else + update_xact = InvalidTransactionId; + + if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_lock_mode), + HWLOCKMODE_from_locktupmode(mode))) + { + /* + * There is a potential conflict. It is quite possible + * that by this time the locker has already been committed. + * So we need to check for conflict with all the possible + * lockers and wait for each of them. + */ + mlmembers = ZGetMultiLockMembers(relation, &zhtup, + *buffer, true); + + /* wait for multixact to end, or die trying */ + switch (wait_policy) + { + case LockWaitBlock: + ZMultiLockMembersWait(relation, mlmembers, &zhtup, + *buffer, update_xact, mode, + false, XLTW_Lock, NULL, + &upd_xact_aborted); + break; + case LockWaitSkip: + if (!ConditionalZMultiLockMembersWait(relation, + mlmembers, + *buffer, + update_xact, + mode, + NULL, + &upd_xact_aborted)) + { + result = HeapTupleWouldBlock; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + break; + case LockWaitError: + if (!ConditionalZMultiLockMembersWait(relation, + mlmembers, + *buffer, + update_xact, + mode, + NULL, + &upd_xact_aborted)) + ereport(ERROR, + (errcode(ERRCODE_LOCK_NOT_AVAILABLE), + errmsg("could not obtain lock on row in relation \"%s\"", + RelationGetRelationName(relation)))); + + break; + } + } + } + else + { + /* wait for regular transaction to end, or die trying */ + switch (wait_policy) + { + case LockWaitBlock: + { + if (xwait_subxid != InvalidSubTransactionId) + SubXactLockTableWait(xwait, xwait_subxid, relation, + &zhtup.t_self, XLTW_Lock); + else + XactLockTableWait(xwait, relation, &zhtup.t_self, + XLTW_Lock); + } + break; + case LockWaitSkip: + if (xwait_subxid != InvalidSubTransactionId) + { + if (!ConditionalSubXactLockTableWait(xwait, xwait_subxid)) + { + result = HeapTupleWouldBlock; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + } + else if (!ConditionalXactLockTableWait(xwait)) + { + result = HeapTupleWouldBlock; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + break; + case LockWaitError: + if (xwait_subxid != InvalidSubTransactionId) + { + if (!ConditionalSubXactLockTableWait(xwait, xwait_subxid)) + ereport(ERROR, + (errcode(ERRCODE_LOCK_NOT_AVAILABLE), + errmsg("could not obtain lock on row in relation \"%s\"", + RelationGetRelationName(relation)))); + } + else if (!ConditionalXactLockTableWait(xwait)) + ereport(ERROR, + (errcode(ERRCODE_LOCK_NOT_AVAILABLE), + errmsg("could not obtain lock on row in relation \"%s\"", + RelationGetRelationName(relation)))); + break; + } + } + + /* if there are updates, follow the update chain */ + if (follow_updates && !ZHEAP_XID_IS_LOCKED_ONLY(infomask)) + { + HTSU_Result res; + + if (!ZHeapTupleIsMoved(zhtup.t_data->t_infomask) && + !ItemPointerEquals(&zhtup.t_self, &ctid)) + { + res = zheap_lock_updated_tuple(relation, &zhtup, &ctid, + xid, mode, lockopr, cid, + &rollback_and_relocked); + + /* + * If the update was by some aborted transaction and its + * pending undo actions are applied now, then check the + * latest copy of the tuple. + */ + if (rollback_and_relocked) + { + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto check_tup_satisfies_update; + } + else if (res != HeapTupleMayBeUpdated) + { + result = res; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + } + } + + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release + * the buffer lock. For this we check if ItemId is not deleted and + * refresh the tuple offset position in page. If TID is already + * delete marked due to pruning, then get new ctid, so that we can + * lock the new tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + if (ZHeapTupleHasMultiLockers(infomask)) + { + List *new_mlmembers; + + /* + * If the aborted xact is for update, then we need to reverify + * the tuple. + */ + if (upd_xact_aborted) + goto check_tup_satisfies_update; + + new_mlmembers = ZGetMultiLockMembers(relation, &zhtup, + *buffer, false); + + /* + * Ensure, no new lockers have been added, if so, then start + * again. + */ + if (!ZMultiLockMembersSame(mlmembers, new_mlmembers)) + { + list_free_deep(mlmembers); + list_free_deep(new_mlmembers); + goto check_tup_satisfies_update; + } + + any_multi_locker_member_alive = + ZIsAnyMultiLockMemberRunning(new_mlmembers, &zhtup, + *buffer); + list_free_deep(mlmembers); + list_free_deep(new_mlmembers); + } + + /* + * xwait is done, but if xwait had just locked the tuple then some + * other xact could update/lock this tuple before we get to this + * point. Check for xid change, and start over if so. We need to + * do some special handling for lockers because their xid is never + * stored on the tuples. If there was a single locker on the + * tuple and that locker is gone and some new locker has locked + * the tuple, we won't be able to identify that by infomask/xid on + * the tuple, rather we need to fetch the locker xid. + */ + ZHeapTupleGetTransInfo(&zhtup, *buffer, NULL, NULL, + ¤t_tup_xid, NULL, NULL, false); + if (xid_infomask_changed(zhtup.t_data->t_infomask, infomask) || + !TransactionIdEquals(current_tup_xid, xwait)) + { + if (ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask) && + !ZHeapTupleHasMultiLockers(zhtup.t_data->t_infomask) && + TransactionIdIsValid(single_locker_xid)) + { + TransactionId current_single_locker_xid = InvalidTransactionId; + + (void) GetLockerTransInfo(relation, &zhtup, *buffer, NULL, + NULL, ¤t_single_locker_xid, + NULL, NULL); + if (!TransactionIdEquals(single_locker_xid, + current_single_locker_xid)) + goto check_tup_satisfies_update; + + } + else + goto check_tup_satisfies_update; + } + } + + if (TransactionIdIsValid(xwait) && TransactionIdDidAbort(xwait)) + { + /* + * For aborted transaction, if the undo actions are not applied + * yet, then apply them before modifying the page. + */ + if (!TransactionIdIsCurrentTransactionId(xwait)) + zheap_exec_pending_rollback(relation, *buffer, + xwait_trans_slot, xwait); + + /* + * For aborted updates, we must allow to reverify the tuple in + * case it's values got changed. + */ + if (!ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask)) + goto check_tup_satisfies_update; + } + + /* + * We may lock if previous xid committed or aborted but only locked + * the tuple without updating it; or if we didn't have to wait at all + * for whatever reason. + */ + if (!require_sleep || + ZHEAP_XID_IS_LOCKED_ONLY(zhtup.t_data->t_infomask) || + result == HeapTupleMayBeUpdated) + result = HeapTupleMayBeUpdated; + else + result = HeapTupleUpdated; + } + else if (result == HeapTupleMayBeUpdated) + { + TransactionId xwait; + uint16 infomask; + + if (TransactionIdIsValid(single_locker_xid)) + xwait = single_locker_xid; + else + xwait = tup_xid; + + infomask = zhtup.t_data->t_infomask; + + /* + * If any subtransaction of the current top transaction already holds + * a lock as strong as or stronger than what we're requesting, we + * effectively hold the desired lock already. We *must* succeed + * without trying to take the tuple lock, else we will deadlock + * against anyone wanting to acquire a stronger lock. + * + * Note that inplace-updates without key updates are considered + * equivalent to lock mode LockTupleNoKeyExclusive. + */ + if (ZHeapTupleHasMultiLockers(infomask)) + { + List *mlmembers; + ListCell *lc; + + if (trans_slot_id != InvalidXactSlotId) + { + mlmembers = ZGetMultiLockMembersForCurrentXact(&zhtup, + trans_slot_id, urec_ptr); + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + /* + * Only members of our own transaction must be present in + * the list. + */ + Assert(TransactionIdIsCurrentTransactionId(mlmember->xid)); + + if (mlmember->mode >= mode) + { + list_free_deep(mlmembers); + result = HeapTupleMayBeUpdated; + goto out_locked; + } + } + + list_free_deep(mlmembers); + } + } + else if (TransactionIdIsCurrentTransactionId(xwait)) + { + tuple->t_tableOid = RelationGetRelid(relation); + tuple->t_len = zhtup.t_len; + tuple->t_self = zhtup.t_self; + tuple->t_data = palloc0(tuple->t_len); + memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len); + + switch (mode) + { + case LockTupleKeyShare: + if (ZHEAP_XID_IS_KEYSHR_LOCKED(infomask) || + ZHEAP_XID_IS_SHR_LOCKED(infomask) || + ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) || + ZHEAP_XID_IS_EXCL_LOCKED(infomask) || + ZHeapTupleIsInPlaceUpdated(infomask)) + { + goto out_locked; + } + break; + case LockTupleShare: + if (ZHEAP_XID_IS_SHR_LOCKED(infomask) || + ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) || + ZHEAP_XID_IS_EXCL_LOCKED(infomask) || + ZHeapTupleIsInPlaceUpdated(infomask)) + { + goto out_locked; + } + break; + case LockTupleNoKeyExclusive: + if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(infomask) || + ZHeapTupleIsInPlaceUpdated(infomask)) + { + goto out_locked; + } + break; + case LockTupleExclusive: + if (ZHeapTupleIsInPlaceUpdated(infomask) && + ZHEAP_XID_IS_EXCL_LOCKED(infomask)) + { + goto out_locked; + } + break; + } + } + } + +failed: + if (result != HeapTupleMayBeUpdated) + { + Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated || + result == HeapTupleWouldBlock); + Assert(ItemIdIsDeleted(lp) || + IsZHeapTupleModified(zhtup.t_data->t_infomask)); + + /* If item id is deleted, tuple can't be marked as moved. */ + if (!ItemIdIsDeleted(lp) && + ZHeapTupleIsMoved(zhtup.t_data->t_infomask)) + ItemPointerSetMovedPartitions(&hufd->ctid); + else + hufd->ctid = ctid; + hufd->xmax = tup_xid; + if (result == HeapTupleSelfUpdated) + hufd->cmax = tup_cid; + else + hufd->cmax = InvalidCommandId; + hufd->in_place_updated_or_locked = in_place_updated_or_locked; + goto out_locked; + } + + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + */ + trans_slot_id = PageReserveTransactionSlot(relation, *buffer, + PageGetMaxOffsetNumber(page), + epoch, xid, &prev_urecptr, + &lock_reacquired); + if (lock_reacquired) + goto check_tup_satisfies_update; + + if (trans_slot_id == InvalidXactSlotId) + { + LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Also take care of cases when page is pruned after we release + * the buffer lock. For this we check if ItemId is not deleted and + * refresh the tuple offset position in page. If TID is already + * delete marked due to pruning, then get new ctid, so that we can + * lock the new tuple. + */ + if (ItemIdIsDeleted(lp)) + { + ctid = *tid; + ZHeapPageGetNewCtid(*buffer, &ctid, &tup_xid, &tup_cid); + result = HeapTupleUpdated; + goto failed; + } + + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + goto check_tup_satisfies_update; + } + + /* transaction slot must be reserved before locking a tuple */ + Assert(trans_slot_id != InvalidXactSlotId); + + /* + * It's possible that tuple slot is now marked as frozen. Hence, we refetch + * the tuple here. + */ + Assert(!ItemIdIsDeleted(lp)); + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + + /* + * If the slot is marked as frozen, the latest modifier of the tuple must be + * frozen. + */ + if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (zhtup.t_data)) == ZHTUP_SLOT_FROZEN) + { + tup_trans_slot_id = ZHTUP_SLOT_FROZEN; + tup_xid = InvalidTransactionId; + } + + /* + * If all the members were lockers and are all gone, we can do away + * with the MULTI_LOCKERS bit. + */ + zheap_lock_tuple_guts(relation, *buffer, &zhtup, tup_xid, xid, mode, + lockopr, epoch, tup_trans_slot_id, trans_slot_id, + single_locker_xid, single_locker_trans_slot, + prev_urecptr, cid, !any_multi_locker_member_alive); + + tuple->t_tableOid = RelationGetRelid(relation); + tuple->t_len = zhtup.t_len; + tuple->t_self = zhtup.t_self; + tuple->t_data = palloc0(tuple->t_len); + + memcpy(tuple->t_data, zhtup.t_data, zhtup.t_len); + + result = HeapTupleMayBeUpdated; + +out_locked: + LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); +out_unlocked: + + /* + * Don't update the visibility map here. Locking a tuple doesn't change + * visibility info. + */ + + /* + * Now that we have successfully marked the tuple as locked, we can + * release the lmgr tuple lock, if we had it. + */ + if (have_tuple_lock) + UnlockTupleTuplock(relation, tid, mode); + + return result; +} + +/* + * test_lockmode_for_conflict - Helper function for zheap_lock_updated_tuple. + * + * Given a lockmode held by the transaction identified with the given xid, + * does the current transaction need to wait, fail, or can it continue if + * it wanted to acquire a lock of the given mode (required_mode)? "needwait" + * is set to true if waiting is necessary; if it can continue, then + * HeapTupleMayBeUpdated is returned. To notify the caller if some pending + * rollback is applied, rollback_and_relocked is set to true. + */ +static HTSU_Result +test_lockmode_for_conflict(Relation rel, Buffer buf, ZHeapTuple zhtup, + UndoRecPtr urec_ptr, LockTupleMode old_mode, + TransactionId xid, int trans_slot_id, + LockTupleMode required_mode, bool has_update, + SubTransactionId *subxid, bool *needwait, + bool *rollback_and_relocked) +{ + *needwait = false; + + /* + * Note: we *must* check TransactionIdIsInProgress before + * TransactionIdDidAbort/Commit; see comment at top of tqual.c for an + * explanation. + */ + if (TransactionIdIsCurrentTransactionId(xid)) + { + /* + * The tuple has already been locked by our own transaction. This is + * very rare but can happen if multiple transactions are trying to + * lock an ancient version of the same tuple. + */ + return HeapTupleSelfUpdated; + } + else if (TransactionIdIsInProgress(xid)) + { + /* + * If the locking transaction is running, what we do depends on + * whether the lock modes conflict: if they do, then we must wait for + * it to finish; otherwise we can fall through to lock this tuple + * version without waiting. + */ + if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_mode), + HWLOCKMODE_from_locktupmode(required_mode))) + { + *needwait = true; + if (subxid) + ZHeapTupleGetSubXid(zhtup, buf, urec_ptr, subxid); + } + + /* + * If we set needwait above, then this value doesn't matter; + * otherwise, this value signals to caller that it's okay to proceed. + */ + return HeapTupleMayBeUpdated; + } + else if (TransactionIdDidAbort(xid)) + { + /* + * For aborted transaction, if the undo actions are not applied + * yet, then apply them before modifying the page. + */ + zheap_exec_pending_rollback(rel, buf, trans_slot_id, xid); + + /* + * If it was only a locker, then the lock is completely gone now and + * we can return success; but if it was an update, then after applying + * pending actions, the tuple might have changed and we must report + * error to the caller. It will allow caller to reverify the tuple in + * case it's values got changed. + */ + + *rollback_and_relocked = true; + + return HeapTupleMayBeUpdated; + } + else if (TransactionIdDidCommit(xid)) + { + /* + * The other transaction committed. If it was only a locker, then the + * lock is completely gone now and we can return success; but if it + * was an update, then what we do depends on whether the two lock + * modes conflict. If they conflict, then we must report error to + * caller. But if they don't, we can fall through to allow the current + * transaction to lock the tuple. + * + * Note: the reason we worry about has_update here is because as soon + * as a transaction ends, all its locks are gone and meaningless, and + * thus we can ignore them; whereas its updates persist. In the + * TransactionIdIsInProgress case, above, we don't need to check + * because we know the lock is still "alive" and thus a conflict needs + * always be checked. + */ + if (!has_update) + return HeapTupleMayBeUpdated; + + if (DoLockModesConflict(HWLOCKMODE_from_locktupmode(old_mode), + HWLOCKMODE_from_locktupmode(required_mode))) + /* bummer */ + return HeapTupleUpdated; + + return HeapTupleMayBeUpdated; + } + + /* Not in progress, not aborted, not committed -- must have crashed */ + return HeapTupleMayBeUpdated; +} + +/* + * zheap_lock_updated_tuple - Lock all the versions of updated tuple. + * + * Fetch the tuple pointed to by tid in rel, reserve transaction slot on a + * page for a given and mark it as locked by the given xid with the given + * mode; if this tuple is updated, recurse to lock the new version as well. + * During chain traversal, we might find some intermediate version which + * is pruned (due to non-inplace-update got committed and the version only + * has line pointer), so we need to continue fetching the newer versions + * to lock them. The bool rolled_and_relocked is used to notify the caller + * that the update has been performed by an aborted transaction and it's + * pending undo actions are applied here. + * + * Note that it is important to lock all the versions that are from + * non-committed transaction, but if the transaction that has created the + * new version is committed, we only care to lock its latest version. + * + */ +static HTSU_Result +zheap_lock_updated_tuple(Relation rel, ZHeapTuple tuple, ItemPointer ctid, + TransactionId xid, LockTupleMode mode, + LockOper lockopr, CommandId cid, + bool *rollback_and_relocked) +{ + HTSU_Result result; + ZHeapTuple mytup; + UndoRecPtr prev_urecptr; + Buffer buf; + Page page; + ItemPointerData tupid; + TransactionId tup_xid; + int tup_trans_slot; + TransactionId priorXmax = InvalidTransactionId; + uint32 epoch; + uint64 epoch_xid; + int trans_slot_id; + bool lock_reacquired; + OffsetNumber offnum; + + ItemPointerCopy(ctid, &tupid); + + if (rollback_and_relocked) + *rollback_and_relocked = false; + + for (;;) + { + ZHeapTupleData zhtup; + ItemId lp; + uint16 old_infomask; + UndoRecPtr urec_ptr; + + if (!zheap_fetch(rel, SnapshotAny, ctid, &mytup, &buf, false, NULL)) + { + /* + * if we fail to find the updated version of the tuple, it's + * because it was vacuumed/pruned/rolledback away after its creator + * transaction aborted. So behave as if we got to the end of the + * chain, and there's no further tuple to lock: return success to + * caller. + */ + if (mytup == NULL) + return HeapTupleMayBeUpdated; + + /* + * If we reached the end of the chain, we're done, so return + * success. See EvalPlanQualZFetch for detailed reason. + */ + if (TransactionIdIsValid(priorXmax) && + !ValidateTuplesXact(mytup, SnapshotAny, buf, priorXmax, true)) + return HeapTupleMayBeUpdated; + + /* deleted or moved to another partition, so forget about it */ + if (ZHeapTupleIsMoved(mytup->t_data->t_infomask) || + ItemPointerEquals(&(mytup->t_self), ctid)) + return HeapTupleMayBeUpdated; + + /* updated row should have xid matching this xmax */ + ZHeapTupleGetTransInfo(mytup, buf, NULL, NULL, &priorXmax, NULL, + NULL, true); + + /* continue to lock the next version of tuple */ + continue; + } + +lock_tuple: + urec_ptr = InvalidUndoRecPtr; + + CHECK_FOR_INTERRUPTS(); + + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + /* + * If we reached the end of the chain, we're done, so return + * success. See EvalPlanQualZFetch for detailed reason. + */ + if (TransactionIdIsValid(priorXmax) && + !ValidateTuplesXact(mytup, SnapshotAny, buf, priorXmax, false)) + { + UnlockReleaseBuffer(buf); + return HeapTupleMayBeUpdated; + } + + ZHeapTupleGetTransInfo(mytup, buf, &tup_trans_slot, &epoch_xid, + &tup_xid, NULL, &urec_ptr, false); + old_infomask = mytup->t_data->t_infomask; + + /* + * If this tuple was created by an aborted (sub)transaction, then we + * already locked the last live one in the chain, thus we're done, so + * return success. + */ + if (!IsZHeapTupleModified(old_infomask) && + TransactionIdDidAbort(tup_xid)) + { + result = HeapTupleMayBeUpdated; + goto out_locked; + } + + /* + * If this tuple version has been updated or locked by some concurrent + * transaction(s), what we do depends on whether our lock mode + * conflicts with what those other transactions hold, and also on the + * status of them. + */ + if (IsZHeapTupleModified(old_infomask)) + { + SubTransactionId subxid = InvalidSubTransactionId; + LockTupleMode old_lock_mode; + bool needwait; + bool has_update = false; + + if (ZHeapTupleHasMultiLockers(old_infomask)) + { + List *mlmembers; + ListCell *lc; + TransactionId update_xact = InvalidTransactionId; + + /* + * As we always maintain strongest lock mode on the tuple, it + * must be pointing to the transaction id of the updater. + */ + if (!ZHEAP_XID_IS_LOCKED_ONLY(old_infomask)) + update_xact = tup_xid; + + mlmembers = ZGetMultiLockMembers(rel, mytup, buf, false); + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + if (TransactionIdIsValid(update_xact)) + { + has_update = (update_xact == mlmember->xid) ? + true : false; + } + + result = test_lockmode_for_conflict(rel, + buf, + NULL, + InvalidUndoRecPtr, + mlmember->mode, + mlmember->xid, + mlmember->trans_slot_id, + mode, has_update, + NULL, + &needwait, + rollback_and_relocked); + + /* + * If the update was by some aborted transaction with + * pending rollback, then it's undo actions are applied. + * Now, notify the caller to check for the latest + * copy of the tuple. + */ + if (*rollback_and_relocked) + { + list_free_deep(mlmembers); + goto out_locked; + } + + if (result == HeapTupleSelfUpdated) + { + list_free_deep(mlmembers); + goto next; + } + + if (needwait) + { + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + if (mlmember->subxid != InvalidSubTransactionId) + SubXactLockTableWait(mlmember->xid, mlmember->subxid, + rel, &mytup->t_self, + XLTW_LockUpdated); + else + XactLockTableWait(mlmember->xid, rel, + &mytup->t_self, + XLTW_LockUpdated); + + list_free_deep(mlmembers); + goto lock_tuple; + } + if (result != HeapTupleMayBeUpdated) + { + list_free_deep(mlmembers); + goto out_locked; + } + } + } + else + { + /* + * For a non-multi locker, we first need to compute the + * corresponding lock mode by using the infomask bits. + */ + if (ZHEAP_XID_IS_LOCKED_ONLY(old_infomask)) + { + /* + * We don't expect to lock updated version of a tuple if + * there is only a single locker on the tuple and previous + * modifier is all-visible. + */ + Assert(!(tup_trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))); + + if (ZHEAP_XID_IS_KEYSHR_LOCKED(old_infomask)) + old_lock_mode = LockTupleKeyShare; + else if (ZHEAP_XID_IS_SHR_LOCKED(old_infomask)) + old_lock_mode = LockTupleShare; + else if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(old_infomask)) + old_lock_mode = LockTupleNoKeyExclusive; + else if (ZHEAP_XID_IS_EXCL_LOCKED(old_infomask)) + old_lock_mode = LockTupleExclusive; + else + { + /* LOCK_ONLY can't be present alone */ + pg_unreachable(); + } + } + else + { + has_update = true; + /* it's an update, but which kind? */ + if (old_infomask & ZHEAP_XID_EXCL_LOCK) + old_lock_mode = LockTupleExclusive; + else + old_lock_mode = LockTupleNoKeyExclusive; + } + + result = test_lockmode_for_conflict(rel, buf, mytup, urec_ptr, + old_lock_mode, tup_xid, + tup_trans_slot, mode, + has_update, &subxid, + &needwait, + rollback_and_relocked); + + /* + * If the update was by some aborted transaction with + * pending rollback, then it's undo actions are applied. + * Now, notify the caller to check for the latest + * copy of the tuple. + */ + if (*rollback_and_relocked) + goto out_locked; + + /* + * If the tuple was already locked by ourselves in a previous + * iteration of this (say zheap_lock_tuple was forced to + * restart the locking loop because of a change in xid), then + * we hold the lock already on this tuple version and we don't + * need to do anything; and this is not an error condition + * either. We just need to skip this tuple and continue + * locking the next version in the update chain. + */ + if (result == HeapTupleSelfUpdated) + goto next; + + if (needwait) + { + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + if (subxid != InvalidSubTransactionId) + SubXactLockTableWait(tup_xid, subxid, rel, + &mytup->t_self, + XLTW_LockUpdated); + else + XactLockTableWait(tup_xid, rel, &mytup->t_self, + XLTW_LockUpdated); + goto lock_tuple; + } + if (result != HeapTupleMayBeUpdated) + { + goto out_locked; + } + } + } + + epoch = GetEpochForXid(xid); + offnum = ItemPointerGetOffsetNumber(&mytup->t_self); + + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + */ + trans_slot_id = PageReserveTransactionSlot(rel, buf, offnum, epoch, xid, + &prev_urecptr, &lock_reacquired); + if (lock_reacquired) + goto lock_tuple; + + if (trans_slot_id == InvalidXactSlotId) + { + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + goto lock_tuple; + } + + /* transaction slot must be reserved before locking a tuple */ + Assert(trans_slot_id != InvalidXactSlotId); + + page = BufferGetPage(buf); + lp = PageGetItemId(page, offnum); + + Assert(ItemIdIsNormal(lp)); + + /* + * It's possible that tuple slot is now marked as frozen. Hence, we refetch + * the tuple here. + */ + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + zhtup.t_tableOid = mytup->t_tableOid; + zhtup.t_self = mytup->t_self; + + /* + * If the slot is marked as frozen, the latest modifier of the tuple must be + * frozen. + */ + if (ZHeapTupleHeaderGetXactSlot((ZHeapTupleHeader) (zhtup.t_data)) == ZHTUP_SLOT_FROZEN) + { + tup_trans_slot = ZHTUP_SLOT_FROZEN; + tup_xid = InvalidTransactionId; + } + + zheap_lock_tuple_guts(rel, buf, &zhtup, tup_xid, xid, mode, lockopr, + epoch, tup_trans_slot, trans_slot_id, + InvalidTransactionId, InvalidXactSlotId, + prev_urecptr, cid, false); + +next: + /* + * if we find the end of update chain, or if the transaction that has + * updated the tuple is aborter, we're done. + */ + if (TransactionIdDidAbort(tup_xid) || + ZHeapTupleIsMoved(mytup->t_data->t_infomask) || + ItemPointerEquals(&mytup->t_self, ctid) || + ZHEAP_XID_IS_LOCKED_ONLY(mytup->t_data->t_infomask)) + { + result = HeapTupleMayBeUpdated; + goto out_locked; + } + + /* + * Updated row should have xid matching this xmax. + * + * XXX Using tup_xid will work as this must be the xid of updater if + * any on the tuple; that is because we always maintain the strongest + * locker information on the tuple. + */ + priorXmax = tup_xid; + + /* + * As we still hold a snapshot to which priorXmax is not visible, neither + * the transaction slot on tuple can be marked as frozen nor the + * corresponding undo be discarded. + */ + Assert(TransactionIdIsValid(priorXmax)); + + /* be tidy */ + zheap_freetuple(mytup); + UnlockReleaseBuffer(buf); + } + + result = HeapTupleMayBeUpdated; + +out_locked: + UnlockReleaseBuffer(buf); + + return result; +} + +/* + * zheap_lock_tuple_guts - Helper function for locking the tuple. + * + * It locks the tuple in given mode, writes an undo and WAL for the + * operation. + * + * It is the responsibility of caller to lock and unlock the buffer ('buf'). + */ +static void +zheap_lock_tuple_guts(Relation rel, Buffer buf, ZHeapTuple zhtup, + TransactionId tup_xid, TransactionId xid, + LockTupleMode mode, LockOper lockopr, uint32 epoch, + int tup_trans_slot_id, int trans_slot_id, + TransactionId single_locker_xid, + int single_locker_trans_slot, UndoRecPtr prev_urecptr, + CommandId cid, bool clear_multi_locker) +{ + TransactionId oldestXidHavingUndo; + UndoRecPtr urecptr; + UnpackedUndoRecord undorecord; + int new_trans_slot_id; + uint16 old_infomask, temp_infomask; + uint16 new_infomask = 0; + Page page; + xl_undolog_meta undometa; + bool hasSubXactLock = false; + + page = BufferGetPage(buf); + + /* Compute the new xid and infomask to store into the tuple. */ + old_infomask = zhtup->t_data->t_infomask; + + temp_infomask = old_infomask; + if (ZHeapTupleHasMultiLockers(old_infomask) && clear_multi_locker) + old_infomask &= ~ZHEAP_MULTI_LOCKERS; + compute_new_xid_infomask(zhtup, buf, tup_xid, tup_trans_slot_id, + temp_infomask, xid, trans_slot_id, + single_locker_xid, mode, lockopr, + &new_infomask, &new_trans_slot_id); + + + /* Acquire subtransaction lock, if current transaction is a subtransaction. */ + if (IsSubTransaction()) + { + SubXactLockTableInsert(GetCurrentSubTransactionId()); + hasSubXactLock = true; + } + + /* + * If the last transaction that has updated the tuple is already too + * old, then consider it as frozen which means it is all-visible. This + * ensures that we don't need to store epoch in the undo record to check + * if the undo tuple belongs to previous epoch and hence all-visible. See + * comments atop of file ztqual.c. + */ + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + if (TransactionIdPrecedes(tup_xid, oldestXidHavingUndo)) + tup_xid = FrozenTransactionId; + + /* + * Prepare an undo record. We need to separately store the latest + * transaction id that has changed the tuple to ensure that we don't + * try to process the tuple in undo chain that is already discarded. + * See GetTupleFromUndo. + */ + if (ZHeapTupleHasMultiLockers(new_infomask)) + undorecord.uur_type = UNDO_XID_MULTI_LOCK_ONLY; + else if (lockopr == LockForUpdate) + undorecord.uur_type = UNDO_XID_LOCK_FOR_UPDATE; + else + undorecord.uur_type = UNDO_XID_LOCK_ONLY; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = rel->rd_id; + undorecord.uur_prevxid = tup_xid; + undorecord.uur_xid = xid; + /* + * While locking the tuple, we set the command id as FirstCommandId since + * it doesn't modify the tuple, just updates the infomask. + */ + undorecord.uur_cid = FirstCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = prev_urecptr; + undorecord.uur_block = ItemPointerGetBlockNumber(&(zhtup->t_self)); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&(zhtup->t_self)); + + initStringInfo(&undorecord.uur_tuple); + initStringInfo(&undorecord.uur_payload); + + /* + * Here, we are storing zheap tuple header which is required to + * reconstruct the old copy of tuple. + */ + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) zhtup->t_data, + SizeofZHeapTupleHeader); + + /* + * We keep the lock mode in undo record as for multi lockers we can't have + * that information in tuple header. We need lock mode later to detect + * conflicts. + */ + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &mode, + sizeof(LockTupleMode)); + + if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SLOT; + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &tup_trans_slot_id, + sizeof(tup_trans_slot_id)); + } + + /* + * Store subtransaction id in undo record. See SubXactLockTableWait + * to know why we need to store subtransaction id in undo. + */ + if (hasSubXactLock) + { + SubTransactionId subxid = GetCurrentSubTransactionId(); + + undorecord.uur_info |= UREC_INFO_PAYLOAD_CONTAINS_SUBXACT; + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &subxid, + sizeof(subxid)); + } + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(rel), + &undometa); + + + START_CRIT_SECTION(); + + InsertPreparedUndo(); + + /* + * We never set the locker slot on the tuple, so pass set_tpd_map_slot flag + * as false from the locker. From all other places it should always be + * passed as true so that the proper slot get set in the TPD offset map if + * its a TPD slot. + */ + PageSetUNDO(undorecord, buf, trans_slot_id, + (lockopr == LockForUpdate) ? true : false, + epoch, xid, urecptr, NULL, 0); + + ZHeapTupleHeaderSetXactSlot(zhtup->t_data, new_trans_slot_id); + zhtup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + zhtup->t_data->t_infomask |= new_infomask; + + MarkBufferDirty(buf); + + /* + * Do xlog stuff + */ + if (RelationNeedsWAL(rel)) + { + xl_zheap_lock xlrec; + xl_undo_header xlundohdr; + XLogRecPtr recptr; + XLogRecPtr RedoRecPtr; + bool doPageWrites; + + /* + * Store the information required to generate undo record during + * replay. + */ + xlundohdr.reloid = undorecord.uur_reloid; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = prev_urecptr; + + xlrec.prev_xid = tup_xid; + xlrec.offnum = ItemPointerGetOffsetNumber(&zhtup->t_self); + xlrec.infomask = zhtup->t_data->t_infomask; + xlrec.trans_slot_id = new_trans_slot_id; + xlrec.flags = 0; + if (new_trans_slot_id != trans_slot_id) + { + Assert(new_trans_slot_id == tup_trans_slot_id); + xlrec.flags |= XLZ_LOCK_TRANS_SLOT_FOR_UREC; + } + else if (tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + xlrec.flags |= XLZ_LOCK_CONTAINS_TPD_SLOT; + + if (hasSubXactLock) + xlrec.flags |= XLZ_LOCK_CONTAINS_SUBXACT; + if (lockopr == LockForUpdate) + xlrec.flags |= XLZ_LOCK_FOR_UPDATE; + +prepare_xlog: + /* LOG undolog meta if this is the first WAL after the checkpoint. */ + LogUndoMetaData(&undometa); + + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + XLogBeginInsert(); + XLogRegisterBuffer(0, buf, REGBUF_STANDARD); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + (void) RegisterTPDBuffer(page, 1); + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + XLogRegisterData((char *) &xlrec, SizeOfZHeapLock); + + /* + * We always include old tuple header for undo in WAL record + * irrespective of full page image is taken or not. This is done + * since savings for not including a zheap tuple header are less + * compared to code complexity. However in future, if required we + * can do it similar to what we have done in zheap_update or + * zheap_delete. + */ + XLogRegisterData((char *) undorecord.uur_tuple.data, + SizeofZHeapTupleHeader); + XLogRegisterData((char *) &mode, sizeof(LockTupleMode)); + if (xlrec.flags & XLZ_LOCK_TRANS_SLOT_FOR_UREC) + XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id)); + else if (xlrec.flags & XLZ_LOCK_CONTAINS_TPD_SLOT) + XLogRegisterData((char *) &tup_trans_slot_id, sizeof(tup_trans_slot_id)); + + recptr = XLogInsertExtended(RM_ZHEAP_ID, XLOG_ZHEAP_LOCK, RedoRecPtr, + doPageWrites); + if (recptr == InvalidXLogRecPtr) + { + ResetRegisteredTPDBuffers(); + goto prepare_xlog; + } + + PageSetLSN(page, recptr); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + TPDPageSetLSN(page, recptr); + } + END_CRIT_SECTION(); + + pfree(undorecord.uur_tuple.data); + pfree(undorecord.uur_payload.data); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); +} + +/* + * compute_new_xid_infomask - Given the old values of tuple header's infomask, + * compute the new values for tuple header which includes lock mode, new + * infomask and transaction slot. + * + * We don't clear the multi lockers bit in this function as for that we need + * to ensure that all the lockers are gone. Unfortunately, it is not easy to + * do that as we need to traverse all the undo chains for the current page to + * ensure the same and doing it here which is quite common code path doesn't + * seem advisable. We clear this bit lazily when we detect the conflict and + * we anyway need to traverse the undo chains for the page. + * + * We ensure that the tuple always point to the transaction slot of latest + * inserter/updater except for cases where we lock first and then update the + * tuple (aka locks via EvalPlanQual mechanism). For example, say after a + * committed insert/update, a new request arrives to lock the tuple in key + * share mode, we will keep the inserter's/updater's slot on the tuple and + * set the multi-locker and key-share bit. If the inserter/updater is already + * known to be having a frozen slot (visible to every one), we will set the + * key-share locker bit and the tuple will indicate a frozen slot. Similarly, + * for a new updater, if the tuple has a single locker, then the undo will + * have a frozen tuple and for multi-lockers, the undo of updater will have + * previous inserter/updater slot; in both cases the new tuple will point to + * the updaters slot. Now, the rollback of a single locker will set the + * frozen slot on tuple and the rollback of multi-locker won't change slot + * information on tuple. We don't want to keep the slot of locker on the + * tuple as after rollback, we will lose track of last updater/inserter. + * + * When we are locking for the purpose of updating the tuple, we don't need + * to preserve previous updater's information and we also keep the latest + * slot on tuple. This is only true when there are no previous lockers on + * the tuple. + */ +static void +compute_new_xid_infomask(ZHeapTuple zhtup, Buffer buf, TransactionId tup_xid, + int tup_trans_slot, uint16 old_infomask, + TransactionId add_to_xid, int trans_slot, + TransactionId single_locker_xid, LockTupleMode mode, + LockOper lockoper, uint16 *result_infomask, + int *result_trans_slot) +{ + int new_trans_slot; + uint16 new_infomask; + bool old_tuple_has_update = false; + bool is_update = false; + + Assert(TransactionIdIsValid(add_to_xid)); + + new_infomask = 0; + new_trans_slot = trans_slot; + is_update = (lockoper == ForUpdate || lockoper == LockForUpdate); + + if ((IsZHeapTupleModified(old_infomask) && + TransactionIdIsInProgress(tup_xid)) || + ZHeapTupleHasMultiLockers(old_infomask)) + { + ZGetMultiLockInfo(old_infomask, tup_xid, tup_trans_slot, + add_to_xid, &new_infomask, &new_trans_slot, + &mode, &old_tuple_has_update, is_update); + } + else if (!is_update && + TransactionIdIsInProgress(single_locker_xid)) + { + LockTupleMode old_mode; + + /* + * When there is a single in-progress locker on the tuple and previous + * inserter/updater became all visible, we've to set multi-locker flag + * and highest lock mode. If current transaction tries to reacquire + * a lock, we don't set multi-locker flag. + */ + Assert(ZHEAP_XID_IS_LOCKED_ONLY(old_infomask)); + if (single_locker_xid != add_to_xid) + { + new_infomask |= ZHEAP_MULTI_LOCKERS; + new_trans_slot = tup_trans_slot; + } + + old_mode = get_old_lock_mode(old_infomask); + + /* Acquire the strongest of both. */ + if (mode < old_mode) + mode = old_mode; + + /* Keep the old tuple slot as it is */ + new_trans_slot = tup_trans_slot; + } + else if (!is_update && + TransactionIdIsInProgress(tup_xid)) + { + /* + * Normally if the tuple is not modified and the current transaction + * is in progress, the other transaction can't lock the tuple except + * itself. + * + * However, this can happen while locking the updated tuple chain. We + * keep the transaction slot of original tuple as that will allow us to + * check the visibility of tuple by just referring the current + * transaction slot. + */ + Assert((tup_xid == add_to_xid) || (mode == LockTupleKeyShare)); + + if (tup_xid != add_to_xid) + { + new_infomask |= ZHEAP_MULTI_LOCKERS; + new_trans_slot = tup_trans_slot; + } + } + else if (!is_update && + tup_trans_slot == ZHTUP_SLOT_FROZEN) + { + /* + * It's a frozen update or insert, so the locker must not change the + * slot on a tuple. The lockmode to be used on tuple is computed + * below. There could be a single committed/aborted locker (multilocker + * case is handled in the first condition). In that case, we can ignore + * the locker. If the locker is still in progress, it'll be handled in + * above case. + */ + new_trans_slot = ZHTUP_SLOT_FROZEN; + } + else if (!is_update && + !ZHEAP_XID_IS_LOCKED_ONLY(old_infomask) && + tup_trans_slot != ZHTUP_SLOT_FROZEN && + (TransactionIdDidCommit(tup_xid) + || !TransactionIdIsValid(tup_xid))) + { + /* + * It's a committed update or insert, so we gotta preserve him as + * updater of the tuple. Also, indicate that tuple has multiple + * lockers. + * + * Tuple xid could be invalid if the corresponding transaction is + * discarded or the tuple is marked as frozen. The later case is + * handled in the above condition (slot frozen). In the former case, + * we can consider it as a committed update or insert. + */ + old_tuple_has_update = true; + new_infomask |= ZHEAP_MULTI_LOCKERS; + + if (ZHEAP_XID_IS_EXCL_LOCKED(old_infomask)) + new_infomask |= ZHEAP_XID_EXCL_LOCK; + else if (ZHEAP_XID_IS_NOKEY_EXCL_LOCKED(old_infomask)) + new_infomask |= ZHEAP_XID_NOKEY_EXCL_LOCK; + else + { + /* + * Tuple must not be locked in any other mode as we are here + * because either the tuple is updated or inserted and the + * corresponding transaction is committed. + */ + Assert(!(ZHEAP_XID_IS_KEYSHR_LOCKED(old_infomask) || + ZHEAP_XID_IS_SHR_LOCKED(old_infomask))); + } + + if (ZHeapTupleIsInPlaceUpdated(old_infomask)) + new_infomask |= ZHEAP_INPLACE_UPDATED; + else if (ZHeapTupleIsUpdated(old_infomask)) + new_infomask |= ZHEAP_UPDATED; + else + { + /* + * This is a freshly inserted tuple, allow to set the requested + * lock mode on tuple. + */ + old_tuple_has_update = false; + } + + new_trans_slot = tup_trans_slot; + + if (old_tuple_has_update) + goto infomask_is_computed; + } + else if (!is_update && + ZHEAP_XID_IS_LOCKED_ONLY(old_infomask) && + tup_trans_slot != ZHTUP_SLOT_FROZEN && + (TransactionIdDidCommit(tup_xid) + || !TransactionIdIsValid(tup_xid))) + { + LockTupleMode old_mode; + + /* + * This case arises for non-inplace updates when the newly inserted + * tuple is marked as locked-only, but multi-locker bit is not set. + * + * See comments in above condition to know when tup_xid can be + * invalid. + */ + new_infomask |= ZHEAP_MULTI_LOCKERS; + + /* The tuple is locked-only. */ + Assert(!(old_infomask & + (ZHEAP_DELETED | ZHEAP_UPDATED | ZHEAP_INPLACE_UPDATED))); + + old_mode = get_old_lock_mode(old_infomask); + + /* Acquire the strongest of both. */ + if (mode < old_mode) + mode = old_mode; + + /* Keep the old tuple slot as it is */ + new_trans_slot = tup_trans_slot; + } + + if (is_update && !ZHeapTupleHasMultiLockers(new_infomask)) + { + if (lockoper == LockForUpdate) + { + /* + * When we are locking for the purpose of updating the tuple, we + * don't need to preserve previous updater's information. + */ + new_infomask |= ZHEAP_XID_LOCK_ONLY; + if (mode == LockTupleExclusive) + new_infomask |= ZHEAP_XID_EXCL_LOCK; + else + new_infomask |= ZHEAP_XID_NOKEY_EXCL_LOCK; + } + else if (mode == LockTupleExclusive) + new_infomask |= ZHEAP_XID_EXCL_LOCK; + } + else + { + if (!is_update && !old_tuple_has_update) + new_infomask |= ZHEAP_XID_LOCK_ONLY; + switch (mode) + { + case LockTupleKeyShare: + new_infomask |= ZHEAP_XID_KEYSHR_LOCK; + break; + case LockTupleShare: + new_infomask |= ZHEAP_XID_SHR_LOCK; + break; + case LockTupleNoKeyExclusive: + new_infomask |= ZHEAP_XID_NOKEY_EXCL_LOCK; + break; + case LockTupleExclusive: + new_infomask |= ZHEAP_XID_EXCL_LOCK; + break; + default: + elog(ERROR, "invalid lock mode"); + } + } + +infomask_is_computed: + + *result_infomask = new_infomask; + + if (result_trans_slot) + *result_trans_slot = new_trans_slot; + + /* + * We store the reserved transaction slot only when we update the + * tuple. For lock only, we keep the old transaction slot in the + * tuple. + */ + Assert(is_update || new_trans_slot == tup_trans_slot); + } + +/* + * zheap_finish_speculative - mark speculative insertion as successful + * + * To successfully finish a speculative insertion we have to clear speculative + * flag from tuple. See heap_finish_speculative why it is important to clear + * the information of speculative insertion on tuple. + */ +void +zheap_finish_speculative(Relation relation, ZHeapTuple tuple) +{ + Buffer buffer; + Page page; + OffsetNumber offnum; + ItemId lp = NULL; + ZHeapTupleHeader zhtup; + + buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&(tuple->t_self))); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + page = (Page) BufferGetPage(buffer); + + offnum = ItemPointerGetOffsetNumber(&(tuple->t_self)); + if (PageGetMaxOffsetNumber(page) >= offnum) + lp = PageGetItemId(page, offnum); + + if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp)) + elog(ERROR, "invalid lp"); + + zhtup = (ZHeapTupleHeader) PageGetItem(page, lp); + + /* NO EREPORT(ERROR) from here till changes are logged */ + START_CRIT_SECTION(); + + Assert(ZHeapTupleHeaderIsSpeculative(tuple->t_data)); + + MarkBufferDirty(buffer); + + /* Clear the speculative insertion marking from the tuple. */ + zhtup->t_infomask &= ~ZHEAP_SPECULATIVE_INSERT; + + /* XLOG stuff */ + if (RelationNeedsWAL(relation)) + { + xl_zheap_confirm xlrec; + XLogRecPtr recptr; + + xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self); + xlrec.flags = XLZ_SPEC_INSERT_SUCCESS; + + XLogBeginInsert(); + + /* We want the same filtering on this as on a plain insert */ + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); + + XLogRegisterData((char *) &xlrec, SizeOfZHeapConfirm); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + recptr = XLogInsert(RM_ZHEAP2_ID, XLOG_ZHEAP_CONFIRM); + + PageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + UnlockReleaseBuffer(buffer); +} + +/* + * zheap_abort_speculative - kill a speculatively inserted tuple + * + * Marks a tuple that was speculatively inserted in the same command as dead. + * That makes it immediately appear as dead to all transactions, including our + * own. In particular, it makes another backend inserting a duplicate key + * value won't unnecessarily wait for our whole transaction to finish (it'll + * just wait for our speculative insertion to finish). + * + * The functionality is same as heap_abort_speculative, but we achieve it + * differently. + */ +void +zheap_abort_speculative(Relation relation, ZHeapTuple tuple) +{ + TransactionId xid = GetTopTransactionId(); + TransactionId current_tup_xid; + ItemPointer tid = &(tuple->t_self); + ItemId lp; + ZHeapTupleHeader zhtuphdr; + Page page; + BlockNumber block; + Buffer buffer; + OffsetNumber offnum; + int out_slot_no PG_USED_FOR_ASSERTS_ONLY; + + Assert(ItemPointerIsValid(tid)); + + block = ItemPointerGetBlockNumber(tid); + buffer = ReadBuffer(relation, block); + page = BufferGetPage(buffer); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + offnum = ItemPointerGetOffsetNumber(tid); + lp = PageGetItemId(page, offnum); + Assert(ItemIdIsNormal(lp)); + + zhtuphdr = (ZHeapTupleHeader) PageGetItem(page, lp); + + /* + * Sanity check that the tuple really is a speculatively inserted tuple, + * inserted by us. + */ + out_slot_no = GetTransactionSlotInfo(buffer, + offnum, + ZHeapTupleHeaderGetXactSlot(zhtuphdr), + NULL, + ¤t_tup_xid, + NULL, + true, + false); + + /* As the transaction is still open, the slot can't be frozen. */ + Assert(out_slot_no != ZHTUP_SLOT_FROZEN); + Assert(current_tup_xid != InvalidTransactionId); + + if (current_tup_xid != xid) + elog(ERROR, "attempted to kill a tuple inserted by another transaction"); + if (!(IsToastRelation(relation) || ZHeapTupleHeaderIsSpeculative(zhtuphdr))) + elog(ERROR, "attempted to kill a non-speculative tuple"); + Assert(!IsZHeapTupleModified(zhtuphdr->t_infomask)); + + START_CRIT_SECTION(); + + /* + * The tuple will become DEAD immediately. Flag that this page is a + * candidate for pruning. The action here is exactly same as what we do + * for rolling back insert. + */ + ItemIdSetDead(lp); + ZPageSetPrunable(page, xid); + + MarkBufferDirty(buffer); + + /* + * XLOG stuff + * + * The WAL records generated here match heap_delete(). The same recovery + * routines are used. + */ + if (RelationNeedsWAL(relation)) + { + xl_zheap_confirm xlrec; + XLogRecPtr recptr; + + xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self); + xlrec.flags = XLZ_SPEC_INSERT_FAILED; + + XLogBeginInsert(); + + XLogRegisterData((char *) &xlrec, SizeOfZHeapConfirm); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + /* No replica identity & replication origin logged */ + + recptr = XLogInsert(RM_ZHEAP2_ID, XLOG_ZHEAP_CONFIRM); + + PageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + if (ZHeapTupleHasExternal(tuple)) + { + Assert(!IsToastRelation(relation)); + ztoast_delete(relation, tuple, true); + } + + /* + * Never need to mark tuple for invalidation, since catalogs don't support + * speculative insertion + */ + + /* Now we can release the buffer */ + ReleaseBuffer(buffer); + + /* count deletion, as we counted the insertion too */ + pgstat_count_heap_delete(relation); +} + +/* + * zheap_freetuple + */ +void +zheap_freetuple(ZHeapTuple zhtup) +{ + pfree(zhtup); +} + +/* + * znocachegetattr - This is same as nocachegetattr except that it takes + * ZHeapTuple as input. + * + * Note that for zheap, cached offsets are not used and we always start + * deforming with the actual byte from where the first attribute starts. See + * atop zheap_compute_data_size. + */ +Datum +znocachegetattr(ZHeapTuple tuple, + int attnum, + TupleDesc tupleDesc) +{ + ZHeapTupleHeader tup = tuple->t_data; + Form_pg_attribute thisatt; + Datum ret_datum = (Datum) 0; + char *tp; /* ptr to data part of tuple */ + bits8 *bp = tup->t_bits; /* ptr to null bitmap in tuple */ + int off; /* current offset within data */ + int i; + + attnum--; + tp = (char *) tup; + + /* + * For each non-null attribute, we have to first account for alignment + * padding before the attr, then advance over the attr based on its + * length. Nulls have no storage and no alignment padding either. + */ + off = tup->t_hoff; + + for (i = 0;; i++) /* loop exit is at "break" */ + { + Form_pg_attribute att = TupleDescAttr(tupleDesc, i); + + if (ZHeapTupleHasNulls(tuple) && att_isnull(i, bp)) + { + continue; /* this cannot be the target att */ + } + + if (att->attlen == -1) + { + off = att_align_pointer(off, att->attalign, -1, + tp + off); + } + else if (!att->attbyval) + { + /* not varlena, so safe to use att_align_nominal */ + off = att_align_nominal(off, att->attalign); + } + + if (i == attnum) + break; + + off = att_addlength_pointer(off, att->attlen, tp + off); + } + + thisatt = TupleDescAttr(tupleDesc, attnum); + if (thisatt->attbyval) + memcpy(&ret_datum, tp + off, thisatt->attlen); + else + ret_datum = PointerGetDatum((char *) (tp + off)); + + return ret_datum; +} + +TransactionId +zheap_fetchinsertxid(ZHeapTuple zhtup, Buffer buffer) +{ + UndoRecPtr urec_ptr; + TransactionId xid = InvalidTransactionId; + int trans_slot_id = InvalidXactSlotId; + int prev_trans_slot_id; + TransactionId result; + BlockNumber blk; + OffsetNumber offnum; + UnpackedUndoRecord *urec; + ZHeapTuple undo_tup; + + prev_trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup->t_data); + blk = ItemPointerGetBlockNumber(&zhtup->t_self); + offnum = ItemPointerGetOffsetNumber(&zhtup->t_self); + (void) GetTransactionSlotInfo(buffer, + offnum, + prev_trans_slot_id, + NULL, + NULL, + &urec_ptr, + true, + false); + undo_tup = zhtup; + + while(true) + { + urec = UndoFetchRecord(urec_ptr, blk, offnum, xid, NULL, ZHeapSatisfyUndoRecord); + if (urec != NULL) + { + /* + * If we have valid undo record, then check if we have + * reached the insert log and return the corresponding + * transaction id. + */ + if (urec->uur_type == UNDO_INSERT || + urec->uur_type == UNDO_MULTI_INSERT || + urec->uur_type == UNDO_INPLACE_UPDATE) + { + result = urec->uur_xid; + UndoRecordRelease(urec); + break; + } + + undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, + NULL, (undo_tup) == (zhtup) ? false : true, + BufferGetPage(buffer)); + + xid = urec->uur_prevxid; + urec_ptr = urec->uur_blkprev; + UndoRecordRelease(urec); + if (!UndoRecPtrIsValid(urec_ptr)) + { + zheap_freetuple(undo_tup); + result = FrozenTransactionId; + break; + } + + + /* + * Change the undo chain if the undo tuple is stamped + * with the different transaction slot. + */ + if (trans_slot_id != prev_trans_slot_id) + { + (void) GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(&undo_tup->t_self), + trans_slot_id, + NULL, + NULL, + &urec_ptr, + true, + true); + prev_trans_slot_id = trans_slot_id; + } + zhtup = undo_tup; + } + else + { + /* + * Undo record could be null only when it's undo log + * is/about to be discarded. We cannot use any assert + * for checking is the log is actually discarded, since + * UndoFetchRecord can return NULL for the records which + * are not yet discarded but are about to be discarded. + */ + result = FrozenTransactionId; + break; + } + } + + return result; +} + +/* ---------------- + * zheap_getsysattr + * + * Fetch the value of a system attribute for a tuple. + * + * This provides same information as heap_getsysattr, but for zheap tuple. + * ---------------- + */ +Datum +zheap_getsysattr(ZHeapTuple zhtup, Buffer buf, int attnum, + TupleDesc tupleDesc, bool *isnull) +{ + Datum result; + TransactionId xid = InvalidTransactionId; + bool release_buf = false; + + Assert(zhtup); + + /* + * For xmin,xmax,cmin and cmax we may need to fetch the information from + * the undo record, so ensure we have the valid buffer. + */ + if (!BufferIsValid(buf) && + ((attnum == MinTransactionIdAttributeNumber) || + (attnum == MaxTransactionIdAttributeNumber) || + (attnum == MinCommandIdAttributeNumber) || + (attnum == MaxCommandIdAttributeNumber))) + { + Relation rel = relation_open(zhtup->t_tableOid, NoLock); + buf = ReadBuffer(rel, ItemPointerGetBlockNumber(&(zhtup->t_self))); + relation_close(rel, NoLock); + release_buf = true; + } + + /* Currently, no sys attribute ever reads as NULL. */ + *isnull = false; + + switch (attnum) + { + case SelfItemPointerAttributeNumber: + /* pass-by-reference datatype */ + result = PointerGetDatum(&(zhtup->t_self)); + break; + case MinTransactionIdAttributeNumber: + { + /* + * Fixme - Need to check whether we need any handling of epoch here. + */ + uint64 epoch_xid; + ZHeapTupleGetTransInfo(zhtup, buf, NULL, &epoch_xid, &xid, + NULL, NULL, false); + + if (!TransactionIdIsValid(xid) || epoch_xid < + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + xid = FrozenTransactionId; + + result = TransactionIdGetDatum(xid); + } + break; + case MaxTransactionIdAttributeNumber: + case MinCommandIdAttributeNumber: + case MaxCommandIdAttributeNumber: + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("xmax, cmin, and cmax are not supported for zheap tuples"))); + break; + case TableOidAttributeNumber: + result = ObjectIdGetDatum(zhtup->t_tableOid); + break; + default: + elog(ERROR, "invalid attnum: %d", attnum); + result = 0; /* keep compiler quiet */ + break; + } + + if (release_buf) + ReleaseBuffer(buf); + + return result; +} + +/* --------------------- + * zheap_attisnull - returns TRUE if zheap tuple attribute is not present + * --------------------- + */ +bool +zheap_attisnull(ZHeapTuple tup, int attnum, TupleDesc tupleDesc) +{ + if (attnum > (int) ZHeapTupleHeaderGetNatts(tup->t_data)) + return true; + + /* + * We allow a NULL tupledesc for relations not expected to have missing + * values, such as catalog relations and indexes. + */ + Assert(!tupleDesc || attnum <= tupleDesc->natts); + if (attnum > (int) ZHeapTupleHeaderGetNatts(tup->t_data)) + { + if (tupleDesc && TupleDescAttr(tupleDesc, attnum - 1)->atthasmissing) + return false; + else + return true; + } + + if (attnum > 0) + { + if (ZHeapTupleNoNulls(tup)) + return false; + return att_isnull(attnum - 1, tup->t_data->t_bits); + } + + switch (attnum) + { + case TableOidAttributeNumber: + case SelfItemPointerAttributeNumber: + case MinTransactionIdAttributeNumber: + case MinCommandIdAttributeNumber: + case MaxTransactionIdAttributeNumber: + case MaxCommandIdAttributeNumber: + /* these are never null */ + break; + default: + elog(ERROR, "invalid attnum: %d", attnum); + } + + return false; +} + +/* + * Check if the specified attribute's value is same in both given tuples. + * Subroutine for ZHeapDetermineModifiedColumns. + */ +static bool +zheap_tuple_attr_equals(TupleDesc tupdesc, int attrnum, + ZHeapTuple tup1, ZHeapTuple tup2) +{ + Datum value1, + value2; + bool isnull1, + isnull2; + Form_pg_attribute att; + + /* + * If it's a whole-tuple reference, say "not equal". It's not really + * worth supporting this case, since it could only succeed after a no-op + * update, which is hardly a case worth optimizing for. + */ + if (attrnum == 0) + return false; + + /* + * Likewise, automatically say "not equal" for any system attribute other + * than OID and tableOID; we cannot expect these to be consistent in a HOT + * chain, or even to be set correctly yet in the new tuple. + */ + if (attrnum < 0) + { + if (attrnum != TableOidAttributeNumber) + return false; + } + + /* + * Extract the corresponding values. XXX this is pretty inefficient if + * there are many indexed columns. Should HeapDetermineModifiedColumns do + * a single heap_deform_tuple call on each tuple, instead? But that + * doesn't work for system columns ... + */ + value1 = zheap_getattr(tup1, attrnum, tupdesc, &isnull1); + value2 = zheap_getattr(tup2, attrnum, tupdesc, &isnull2); + + /* + * If one value is NULL and other is not, then they are certainly not + * equal + */ + if (isnull1 != isnull2) + return false; + + /* + * If both are NULL, they can be considered equal. + */ + if (isnull1) + return true; + + /* + * We do simple binary comparison of the two datums. This may be overly + * strict because there can be multiple binary representations for the + * same logical value. But we should be OK as long as there are no false + * positives. Using a type-specific equality operator is messy because + * there could be multiple notions of equality in different operator + * classes; furthermore, we cannot safely invoke user-defined functions + * while holding exclusive buffer lock. + */ + if (attrnum <= 0) + { + /* The only allowed system columns are OIDs, so do this */ + return (DatumGetObjectId(value1) == DatumGetObjectId(value2)); + } + else + { + Assert(attrnum <= tupdesc->natts); + att = TupleDescAttr(tupdesc, attrnum - 1); + return datumIsEqual(value1, value2, att->attbyval, att->attlen); + } +} + +/* + * ZHeapDetermineModifiedColumns - Check which columns are being updated. + * This is same as HeapDetermineModifiedColumns except that it takes + * ZHeapTuple as input. + */ +static Bitmapset * +ZHeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols, + ZHeapTuple oldtup, ZHeapTuple newtup) +{ + int attnum; + Bitmapset *modified = NULL; + + while ((attnum = bms_first_member(interesting_cols)) >= 0) + { + attnum += FirstLowInvalidHeapAttributeNumber; + + if (!zheap_tuple_attr_equals(RelationGetDescr(relation), + attnum, oldtup, newtup)) + modified = bms_add_member(modified, + attnum - FirstLowInvalidHeapAttributeNumber); + } + + return modified; +} + +/* + * ----------- + * Zheap transaction information related API's. + * ----------- + */ + +/* + * GetTransactionSlotInfo - Get the required transaction slot info. We also + * return the transaction slot number, if the transaction slot is in TPD entry. + * + * We can directly call this function to get transaction slot info if we are + * sure that the corresponding tuple is not deleted or we don't care if the + * tuple has multi-locker flag in which case we need to call + * ZHeapTupleGetTransInfo. + * + * NoTPDBufLock - See TPDPageGetTransactionSlotInfo. + * TPDSlot - true, if the passed transaction_slot_id is the slot number in TPD + * entry. + */ +int +GetTransactionSlotInfo(Buffer buf, OffsetNumber offset, int trans_slot_id, + uint32 *epoch, TransactionId *xid, + UndoRecPtr *urec_ptr, bool NoTPDBufLock, bool TPDSlot) +{ + ZHeapPageOpaque opaque; + Page page; + PageHeader phdr PG_USED_FOR_ASSERTS_ONLY; + int out_trans_slot_id = trans_slot_id; + + page = BufferGetPage(buf); + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + /* + * Fetch the required information from the transaction slot. The + * transaction slot can either be on the heap page or TPD page. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + { + if (epoch) + *epoch = 0; + if (xid) + *xid = InvalidTransactionId; + if (urec_ptr) + *urec_ptr = InvalidUndoRecPtr; + } + else if (trans_slot_id < ZHEAP_PAGE_TRANS_SLOTS || + (trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS && + !ZHeapPageHasTPDSlot(phdr))) + { + if (epoch) + *epoch = opaque->transinfo[trans_slot_id - 1].xid_epoch; + if (xid) + *xid = opaque->transinfo[trans_slot_id - 1].xid; + if (urec_ptr) + *urec_ptr = opaque->transinfo[trans_slot_id - 1].urec_ptr; + } + else + { + Assert((ZHeapPageHasTPDSlot(phdr))); + if (TPDSlot) + { + /* + * The heap page's last transaction slot data is copied over to + * first slot in TPD entry, so we need fetch it from there. See + * AllocateAndFormTPDEntry. + */ + if (trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS) + trans_slot_id = ZHEAP_PAGE_TRANS_SLOTS + 1; + out_trans_slot_id = TPDPageGetTransactionSlotInfo(buf, + trans_slot_id, + InvalidOffsetNumber, + epoch, + xid, + urec_ptr, + NoTPDBufLock, + false); + } + else + { + Assert(offset != InvalidOffsetNumber); + out_trans_slot_id = TPDPageGetTransactionSlotInfo(buf, + trans_slot_id, + offset, + epoch, + xid, + urec_ptr, + NoTPDBufLock, + false); + } + } + + return out_trans_slot_id; +} + +/* + * PageSetUNDO - Set the transaction information pointer for a given + * transaction slot. + */ +void +PageSetUNDO(UnpackedUndoRecord undorecord, Buffer buffer, int trans_slot_id, + bool set_tpd_map_slot, uint32 epoch, TransactionId xid, + UndoRecPtr urecptr, OffsetNumber *usedoff, int ucnt) +{ + ZHeapPageOpaque opaque; + Page page = BufferGetPage(buffer); + PageHeader phdr; + + Assert(trans_slot_id != InvalidXactSlotId); + + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + /* + * Set the required information in the transaction slot. The transaction + * slot can either be on the heap page or TPD page. + * + * During recovery, we set the required information in TPD separately + * only if required. + */ + if (trans_slot_id < ZHEAP_PAGE_TRANS_SLOTS || + (trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS && + !ZHeapPageHasTPDSlot(phdr))) + { + opaque->transinfo[trans_slot_id - 1].xid_epoch = epoch; + opaque->transinfo[trans_slot_id - 1].xid = xid; + opaque->transinfo[trans_slot_id - 1].urec_ptr = urecptr; + } + /* TPD information is set separately during recovery. */ + else if (!InRecovery) + { + if (ucnt <= 0) + { + Assert(ucnt == 0); + + usedoff = &undorecord.uur_offset; + ucnt++; + } + + TPDPageSetUndo(buffer, trans_slot_id, set_tpd_map_slot, epoch, xid, + urecptr, usedoff, ucnt); + } + + elog(DEBUG1, "undo record: TransSlot: %d, Epoch: %d, TransactionId: %d, urec: " UndoRecPtrFormat ", prev_urec: " UINT64_FORMAT ", block: %d, offset: %d, undo_op: %d, xid_tup: %d, reloid: %d", + trans_slot_id, epoch, xid, urecptr, undorecord.uur_blkprev, undorecord.uur_block, undorecord.uur_offset, undorecord.uur_type, + undorecord.uur_prevxid, undorecord.uur_reloid); +} + +/* + * PageSetTransactionSlotInfo - Set the transaction slot info for the given + * slot. + * + * This is similar to PageSetUNDO except that it doesn't need to update offset + * map in TPD. + */ +void +PageSetTransactionSlotInfo(Buffer buf, int trans_slot_id, uint32 epoch, + TransactionId xid, UndoRecPtr urec_ptr) +{ + ZHeapPageOpaque opaque; + Page page; + PageHeader phdr; + + page = BufferGetPage(buf); + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + if (trans_slot_id < ZHEAP_PAGE_TRANS_SLOTS || + (trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS && + !ZHeapPageHasTPDSlot(phdr))) + { + opaque->transinfo[trans_slot_id - 1].xid_epoch = epoch; + opaque->transinfo[trans_slot_id - 1].xid = xid; + opaque->transinfo[trans_slot_id - 1].urec_ptr = urec_ptr; + } + else + { + TPDPageSetTransactionSlotInfo(buf, trans_slot_id, epoch, xid, + urec_ptr); + } +} + +/* + * PageGetTransactionSlotId - Get the transaction slot for the given epoch and + * xid. + * + * If the slot is not in the TPD page but the caller has asked to lock the TPD + * buffer than do so. tpd_page_locked will be set to true if the required page + * is locked, false, otherwise. + */ +int +PageGetTransactionSlotId(Relation rel, Buffer buf, uint32 epoch, + TransactionId xid, UndoRecPtr *urec_ptr, + bool keepTPDBufLock, bool locktpd, + bool *tpd_page_locked) +{ + ZHeapPageOpaque opaque; + Page page; + PageHeader phdr; + int slot_no; + int total_slots_in_page; + bool check_tpd; + + page = BufferGetPage(buf); + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + if (ZHeapPageHasTPDSlot(phdr)) + { + total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS - 1; + check_tpd = true; + } + else + { + total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS; + check_tpd = false; + } + + /* Check if the required slot exists on the page. */ + for (slot_no = 0; slot_no < total_slots_in_page; slot_no++) + { + if (opaque->transinfo[slot_no].xid_epoch == epoch && + opaque->transinfo[slot_no].xid == xid) + { + *urec_ptr = opaque->transinfo[slot_no].urec_ptr; + + /* Check if TPD has page slot, then lock TPD page */ + if (locktpd && ZHeapPageHasTPDSlot(phdr)) + { + Assert(tpd_page_locked); + *tpd_page_locked = TPDPageLock(rel, buf); + } + + return slot_no + 1; + } + } + + /* Check if the slot exists on the TPD page. */ + if (check_tpd) + { + int tpd_e_slot; + + tpd_e_slot = TPDPageGetSlotIfExists(rel, buf, InvalidOffsetNumber, + epoch, xid, urec_ptr, + keepTPDBufLock, false); + if (tpd_e_slot != InvalidXactSlotId) + { + /* + * If we get the valid slot then the TPD page must be locked and + * the lock will be retained if asked for. + */ + if (tpd_page_locked) + *tpd_page_locked = keepTPDBufLock; + return tpd_e_slot; + } + } + else + { + /* + * Lock the TPD page if the caller has instructed so and the page + * has tpd slot. + */ + if (locktpd && ZHeapPageHasTPDSlot(phdr)) + { + Assert(tpd_page_locked); + *tpd_page_locked = TPDPageLock(rel, buf); + } + } + + return InvalidXactSlotId; +} + +/* + * PageGetTransactionSlotInfo - Get the transaction slot info for the given + * slot no. + */ +void +PageGetTransactionSlotInfo(Buffer buf, int slot_no, uint32 *epoch, + TransactionId *xid, UndoRecPtr *urec_ptr, + bool keepTPDBufLock) +{ + ZHeapPageOpaque opaque; + Page page; + PageHeader phdr; + + page = BufferGetPage(buf); + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + /* + * Fetch the required information from the transaction slot. The + * transaction slot can either be on the heap page or TPD page. + */ + if (slot_no < ZHEAP_PAGE_TRANS_SLOTS || + (slot_no == ZHEAP_PAGE_TRANS_SLOTS && + !ZHeapPageHasTPDSlot(phdr))) + { + if (epoch) + *epoch = opaque->transinfo[slot_no - 1].xid_epoch; + if (xid) + *xid = opaque->transinfo[slot_no - 1].xid; + if (urec_ptr) + *urec_ptr = opaque->transinfo[slot_no - 1].urec_ptr; + } + else + { + Assert((ZHeapPageHasTPDSlot(phdr))); + (void)TPDPageGetTransactionSlotInfo(buf, + slot_no, + InvalidOffsetNumber, + epoch, + xid, + urec_ptr, + false, + true); + } +} + +/* + * PageReserveTransactionSlot - Reserve the transaction slot in page. + * + * This function returns transaction slot number if either the page already + * has some slot that contains the transaction info or there is an empty + * slot or it manages to reuse some existing slot or it manages to get the + * slot in TPD; otherwise retruns InvalidXactSlotId. + * + * Note that we always return array location of slot plus one as zeroth slot + * number is reserved for frozen slot number (ZHTUP_SLOT_FROZEN). + */ +int +PageReserveTransactionSlot(Relation relation, Buffer buf, OffsetNumber offset, + uint32 epoch, TransactionId xid, + UndoRecPtr *urec_ptr, bool *lock_reacquired) +{ + ZHeapPageOpaque opaque; + Page page; + PageHeader phdr; + int latestFreeTransSlot = InvalidXactSlotId; + int slot_no; + int total_slots_in_page; + bool check_tpd; + + *lock_reacquired = false; + + page = BufferGetPage(buf); + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + if (ZHeapPageHasTPDSlot(phdr)) + { + total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS - 1; + check_tpd = true; + } + else + { + total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS; + check_tpd = false; + } + + /* + * For temp relations, we don't have to check all the slots since + * no other backend can access the same relation. If a slot is available, + * we return it from here. Else, we freeze the slot in PageFreezeTransSlots. + * + * XXX For temp tables, oldestXidWithEpochHavingUndo is not relevant as + * the undo for them can be discarded on commit. Hence, comparing xid + * with oldestXidWithEpochHavingUndo during visibility checks can lead to + * incorrect behavior. To avoid that, we can mark the tuple as frozen + * for any previous transaction id. In that way, we don't have to + * compare the previous xid of tuple with oldestXidWithEpochHavingUndo. + */ + if (RELATION_IS_LOCAL(relation)) + { + /* We can't access temp tables of other backends. */ + Assert(!RELATION_IS_OTHER_TEMP(relation)); + + slot_no = 0; + if (opaque->transinfo[slot_no].xid_epoch == epoch && + opaque->transinfo[slot_no].xid == xid) + { + *urec_ptr = opaque->transinfo[slot_no].urec_ptr; + return (slot_no + 1); + } + else if (opaque->transinfo[slot_no].xid == InvalidTransactionId && + latestFreeTransSlot == InvalidXactSlotId) + latestFreeTransSlot = slot_no; + } + else + { + for (slot_no = 0; slot_no < total_slots_in_page; slot_no++) + { + if (opaque->transinfo[slot_no].xid_epoch == epoch && + opaque->transinfo[slot_no].xid == xid) + { + *urec_ptr = opaque->transinfo[slot_no].urec_ptr; + return (slot_no + 1); + } + else if (opaque->transinfo[slot_no].xid == InvalidTransactionId && + latestFreeTransSlot == InvalidXactSlotId) + latestFreeTransSlot = slot_no; + } + } + + /* Check if we already have a slot on the TPD page */ + if (check_tpd) + { + int tpd_e_slot; + + tpd_e_slot = TPDPageGetSlotIfExists(relation, buf, offset, epoch, + xid, urec_ptr, true, true); + if (tpd_e_slot != InvalidXactSlotId) + return tpd_e_slot; + } + + + if (latestFreeTransSlot >= 0) + { + *urec_ptr = opaque->transinfo[latestFreeTransSlot].urec_ptr; + return (latestFreeTransSlot + 1); + } + + /* no transaction slot available, try to reuse some existing slot */ + if (PageFreezeTransSlots(relation, buf, lock_reacquired, NULL, 0)) + { + /* + * If the lock is reacquired inside, then we allow callers to reverify + * the condition whether then can still perform the required + * operation. + */ + if (*lock_reacquired) + return InvalidXactSlotId; + + /* + * TPD entry might get pruned in TPDPageGetSlotIfExists, so recheck + * it. + */ + if (ZHeapPageHasTPDSlot(phdr)) + total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS - 1; + else + total_slots_in_page = ZHEAP_PAGE_TRANS_SLOTS; + + for (slot_no = 0; slot_no < total_slots_in_page; slot_no++) + { + if (opaque->transinfo[slot_no].xid == InvalidTransactionId) + { + *urec_ptr = opaque->transinfo[slot_no].urec_ptr; + return (slot_no + 1); + } + } + + /* + * After freezing transaction slots, we should get atleast one free + * slot. + */ + Assert(false); + } + Assert (!RELATION_IS_LOCAL(relation)); + + /* + * Reserve the transaction slot in TPD. First we check if there already + * exists an TPD entry for this page, then reserve in that, otherwise, + * allocate a new TPD entry and reserve the slot in it. + */ + if (ZHeapPageHasTPDSlot(phdr)) + { + int tpd_e_slot; + + tpd_e_slot = TPDPageReserveTransSlot(relation, buf, offset, + urec_ptr, lock_reacquired); + + if (tpd_e_slot != InvalidXactSlotId) + return tpd_e_slot; + + /* + * Fixme : We should allow to allocate bigger TPD entries or support + * chained TPD entries. + */ + return InvalidXactSlotId; + } + else + { + slot_no = TPDAllocateAndReserveTransSlot(relation, buf, offset, + urec_ptr); + if (slot_no != InvalidXactSlotId) + return slot_no; + } + + /* no transaction slot available */ + return InvalidXactSlotId; +} + +/* + * zheap_freeze_or_invalidate_tuples - Clear the slot information or set + * invalid_xact flags. + * + * Process all the tuples on the page and match their trasaction slot with + * the input slot array, if tuple is pointing to the slot then set the tuple + * slot as ZHTUP_SLOT_FROZEN if is frozen is true otherwise set + * ZHEAP_INVALID_XACT_SLOT flag on the tuple + */ +void +zheap_freeze_or_invalidate_tuples(Buffer buf, int nSlots, int *slots, + bool isFrozen, bool TPDSlot) +{ + OffsetNumber offnum, maxoff; + Page page = BufferGetPage(buf); + int i; + + /* clear the slot info from tuples */ + maxoff = PageGetMaxOffsetNumber(page); + + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ZHeapTupleHeader tup_hdr; + ItemId itemid; + int trans_slot; + + itemid = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemid)) + continue; + + if (!ItemIdIsUsed(itemid)) + { + if (!ItemIdHasPendingXact(itemid)) + continue; + trans_slot = ItemIdGetTransactionSlot(itemid); + } + else if (ItemIdIsDeleted(itemid)) + { + trans_slot = ItemIdGetTransactionSlot(itemid); + } + else + { + tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid); + trans_slot = ZHeapTupleHeaderGetXactSlot(tup_hdr); + } + + /* If we are freezing TPD slot then get the actual slot from the TPD. */ + if (TPDSlot) + { + /* Tuple is not pointing to TPD slot so skip it. */ + if (trans_slot < ZHEAP_PAGE_TRANS_SLOTS) + continue; + + /* + * If we come for freezing the TPD slot the fetch the exact slot + * info from the TPD. + */ + trans_slot = TPDPageGetTransactionSlotInfo(buf, trans_slot, offnum, + NULL, NULL, NULL, false, + false); + + /* + * The input slots array always stores the slot index which starts + * from 0, even for TPD slots, the index will start from 0. + * So convert it into the slot index. + */ + trans_slot -= (ZHEAP_PAGE_TRANS_SLOTS + 1); + } + else + { + /* + * The slot number on tuple is always array location of slot plus + * one, so we need to subtract one here before comparing it with + * frozen slots. See PageReserveTransactionSlot. + */ + trans_slot -= 1; + } + + for (i = 0; i < nSlots; i++) + { + if (trans_slot == slots[i]) + { + /* + * Set transaction slots of tuple as frozen to indicate tuple + * is all visible and mark the deleted itemids as dead. + */ + if (isFrozen) + { + if (!ItemIdIsUsed(itemid)) + { + /* This must be unused entry which has xact information. */ + Assert(ItemIdHasPendingXact(itemid)); + + /* + * The pending xact must be commited if the corresponding + * slot is being marked as frozen. So, clear the pending + * xact and transaction slot information from itemid. + */ + ItemIdSetUnused(itemid); + } + else if (ItemIdIsDeleted(itemid)) + { + /* + * The deleted item must not be visible to anyone if the + * corresponding slot is being marked as frozen. So, + * marking it as dead. + */ + ItemIdSetDead(itemid); + } + else + { + tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid); + ZHeapTupleHeaderSetXactSlot(tup_hdr, ZHTUP_SLOT_FROZEN); + } + } + else + { + /* + * We just append the invalid xact flag in the tuple/itemid to + * indicate that for this tuple/itemid we need to fetch the + * transaction information from undo record. Also, we + * ensure to clear the transaction information from unused + * itemid. + */ + if (!ItemIdIsUsed(itemid)) + { + /* This must be unused entry which has xact information. */ + Assert(ItemIdHasPendingXact(itemid)); + + /* + * The pending xact is commited. So, clear the pending xact + * and transaction slot information from itemid. + */ + ItemIdSetUnused(itemid); + } + else if (ItemIdIsDeleted(itemid)) + ItemIdSetInvalidXact(itemid); + else + { + tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid); + tup_hdr->t_infomask |= ZHEAP_INVALID_XACT_SLOT; + } + break; + } + break; + } + } + } +} + +/* + * GetCompletedSlotOffsets + * + * Find all the tuples pointing to the transaction slots for committed + * transactions. + */ +void +GetCompletedSlotOffsets(Page page, int nCompletedXactSlots, + int *completed_slots, + OffsetNumber *offset_completed_slots, + int *numOffsets) +{ + int noffsets = 0; + OffsetNumber offnum, maxoff; + + maxoff = PageGetMaxOffsetNumber(page); + + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ZHeapTupleHeader tup_hdr; + ItemId itemid; + int i, trans_slot; + + itemid = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemid)) + continue; + + if (!ItemIdIsUsed(itemid)) + { + if (!ItemIdHasPendingXact(itemid)) + continue; + trans_slot = ItemIdGetTransactionSlot(itemid); + } + else if (ItemIdIsDeleted(itemid)) + { + if ((ItemIdGetVisibilityInfo(itemid) & ITEMID_XACT_INVALID)) + continue; + trans_slot = ItemIdGetTransactionSlot(itemid); + } + else + { + tup_hdr = (ZHeapTupleHeader) PageGetItem(page, itemid); + if (ZHeapTupleHasInvalidXact(tup_hdr->t_infomask)) + continue; + trans_slot = ZHeapTupleHeaderGetXactSlot(tup_hdr); + } + + for (i = 0; i < nCompletedXactSlots; i++) + { + /* + * we don't need to include the tuples that have not changed + * since the last time as the special undo record for them can + * be found in the undo chain of their present slot. + */ + if (trans_slot == completed_slots[i]) + { + offset_completed_slots[noffsets++] = offnum; + break; + } + } + } + + *numOffsets = noffsets; +} + +/* + * PageFreezeTransSlots - Make the transaction slots available for reuse. + * + * This function tries to free up some existing transaction slots so that + * they can be reused. To reuse the slot, it needs to ensure one of the below + * conditions: + * (a) the xid is committed, all-visible and doesn't have pending rollback + * to perform. + * (b) if the xid is committed, then ensure to mark a special flag on the + * tuples that are modified by that xid on the current page. + * (c) if the xid is rolledback, then ensure that rollback is performed or + * at least undo actions for this page have been replayed. + * + * For committed/aborted transactions, we simply clear the xid from the + * transaction slot and undo record pointer is kept as it is to ensure that + * we don't break the undo chain for that slot. We also mark the tuples that + * are modified by committed xid with a special flag indicating that slot for + * this tuple is reused. The special flag is just an indication that the + * transaction information of the transaction that has modified the tuple can + * be retrieved from the undo. + * + * If we don't do so, then after that slot got reused for some other + * unrelated transaction, it might become tricky to traverse the undo chain. + * In such a case, it is quite possible that the particular tuple has not + * been modified, but it is still pointing to transaction slot which has been + * reused by new transaction and that transaction is still not committed. + * During the visibility check for such a tuple, it can appear that the tuple + * is modified by current transaction which is clearly wrong and can lead to + * wrong results. One such case would be when we try to fetch the commandid + * for that tuple to check the visibility, it will fetch the commandid for a + * different transaction that is already committed. + * + * The basic principle used here is to ensure that we can always fetch the + * transaction information of tuple until it is frozen (committed and + * all-visible). + * + * This also ensures that we are consistent with how other operations work in + * zheap i.e the tuple always reflect the current state. + * + * We don't need any special handling for the tuples that are locked by + * multiple transactions (aka tuples that have MULTI_LOCKERS bit set). + * Basically, we always maintain either strongest lockers or latest lockers + * (when all the lockers are of same mode) transaction slot on the tuple. + * In either case, we should be able to detect the visibility of tuple based + * on the latest locker information. + * + * This function assumes that the caller already has Exclusive lock on the + * buffer. + * + * This function returns true if it manages to free some transaction slot, + * false otherwise. + */ +bool +PageFreezeTransSlots(Relation relation, Buffer buf, bool *lock_reacquired, + TransInfo *transinfo, int num_slots) +{ + uint64 oldestXidWithEpochHavingUndo; + int slot_no; + int *frozen_slots = NULL; + int nFrozenSlots = 0; + int *completed_xact_slots = NULL; + uint16 nCompletedXactSlots = 0; + int *aborted_xact_slots = NULL; + int nAbortedXactSlots = 0; + bool TPDSlot; + Page page; + bool result = false; + + page = BufferGetPage(buf); + + /* + * If the num_slots is 0 then the caller wants to freeze the page slots so + * get the transaction slots information from the page. + */ + if (num_slots == 0) + { + PageHeader phdr; + ZHeapPageOpaque opaque; + + phdr = (PageHeader) page; + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + if (ZHeapPageHasTPDSlot(phdr)) + num_slots = ZHEAP_PAGE_TRANS_SLOTS - 1; + else + num_slots = ZHEAP_PAGE_TRANS_SLOTS; + + transinfo = opaque->transinfo; + TPDSlot = false; + } + else + { + Assert(num_slots > 0); + TPDSlot = true; + } + + oldestXidWithEpochHavingUndo = pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo); + + frozen_slots = palloc0(num_slots * sizeof(int)); + + /* + * Clear the slot information from tuples. The basic idea is to collect + * all the transaction slots that can be cleared. Then traverse the page + * to see if any tuple has marking for any of the slots, if so, just clear + * the slot information from the tuple. + * + * For temp relations, we can freeze the first slot since no other backend + * can access the same relation. + */ + if (RELATION_IS_LOCAL(relation)) + frozen_slots[nFrozenSlots++] = 0; + else + { + for (slot_no = 0; slot_no < num_slots; slot_no++) + { + uint64 slot_xid_epoch = transinfo[slot_no].xid_epoch; + TransactionId slot_xid = transinfo[slot_no].xid; + + /* + * Transaction slot can be considered frozen if it belongs to previous + * epoch or transaction id is old enough that it is all visible. + */ + slot_xid_epoch = MakeEpochXid(slot_xid_epoch, slot_xid); + + if (slot_xid_epoch < oldestXidWithEpochHavingUndo) + frozen_slots[nFrozenSlots++] = slot_no; + } + } + + if (nFrozenSlots > 0) + { + TransactionId latestxid = InvalidTransactionId; + int i; + int slot_no; + + + START_CRIT_SECTION(); + + /* clear the transaction slot info on tuples */ + zheap_freeze_or_invalidate_tuples(buf, nFrozenSlots, frozen_slots, + true, TPDSlot); + + /* Initialize the frozen slots. */ + if (TPDSlot) + { + for (i = 0; i < nFrozenSlots; i++) + { + int tpd_slot_id; + + slot_no = frozen_slots[i]; + + /* Remember the latest xid. */ + if (TransactionIdFollows(transinfo[slot_no].xid, latestxid)) + latestxid = transinfo[slot_no].xid; + + /* Calculate the actual slot no. */ + tpd_slot_id = slot_no + ZHEAP_PAGE_TRANS_SLOTS + 1; + + /* Initialize the TPD slot. */ + TPDPageSetTransactionSlotInfo(buf, tpd_slot_id, 0, + InvalidTransactionId, + InvalidUndoRecPtr); + } + } + else + { + for (i = 0; i < nFrozenSlots; i++) + { + slot_no = frozen_slots[i]; + + /* Remember the latest xid. */ + if (TransactionIdFollows(transinfo[slot_no].xid, latestxid)) + latestxid = transinfo[slot_no].xid; + + transinfo[slot_no].xid_epoch = 0; + transinfo[slot_no].xid = InvalidTransactionId; + transinfo[slot_no].urec_ptr = InvalidUndoRecPtr; + } + } + + MarkBufferDirty(buf); + + /* + * Xlog Stuff + * + * Log all the frozen_slots number for which we need to clear the + * transaction slot information. Also, note down the latest xid + * corresponding to the frozen slots. This is required to ensure that + * no standby query conflicts with the frozen xids. + */ + if (RelationNeedsWAL(relation)) + { + xl_zheap_freeze_xact_slot xlrec = {0}; + XLogRecPtr recptr; + + XLogBeginInsert(); + + xlrec.nFrozen = nFrozenSlots; + xlrec.lastestFrozenXid = latestxid; + + XLogRegisterData((char *) &xlrec, SizeOfZHeapFreezeXactSlot); + + /* + * Ideally we need the frozen slots information when WAL needs to be + * applied on the page, but in case of the TPD slots freeze we need + * the frozen slot information for both heap page as well as for the + * TPD page. So the problem is that if we register with any one of + * the buffer it might happen that the data did not registered due + * to fpw of that buffer but we need that data for another buffer. + */ + XLogRegisterData((char *) frozen_slots, nFrozenSlots * sizeof(int)); + XLogRegisterBuffer(0, buf, REGBUF_STANDARD); + if (TPDSlot) + RegisterTPDBuffer(page, 1); + + recptr = XLogInsert(RM_ZHEAP_ID, XLOG_ZHEAP_FREEZE_XACT_SLOT); + PageSetLSN(page, recptr); + + if (TPDSlot) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + result = true; + goto cleanup; + } + + Assert(!RELATION_IS_LOCAL(relation)); + completed_xact_slots = palloc0(num_slots * sizeof(int)); + aborted_xact_slots = palloc0(num_slots * sizeof(int)); + + /* + * Try to reuse transaction slots of committed/aborted transactions. This + * is just like above but it will maintain a link to the previous + * transaction undo record in this slot. This is to ensure that if there + * is still any alive snapshot to which this transaction is not visible, + * it can fetch the record from undo and check the visibility. + */ + for (slot_no = 0; slot_no < num_slots; slot_no++) + { + if (!TransactionIdIsInProgress(transinfo[slot_no].xid)) + { + if (TransactionIdDidCommit(transinfo[slot_no].xid)) + completed_xact_slots[nCompletedXactSlots++] = slot_no; + else + aborted_xact_slots[nAbortedXactSlots++] = slot_no; + } + } + + if (nCompletedXactSlots > 0) + { + int i; + int slot_no; + + + START_CRIT_SECTION(); + + /* clear the transaction slot info on tuples */ + zheap_freeze_or_invalidate_tuples(buf, nCompletedXactSlots, + completed_xact_slots, false, TPDSlot); + + /* + * Clear the xid information from the slot but keep the undo record + * pointer as it is so that undo records of the transaction are + * accessible by traversing slot's undo chain even though the slots + * are reused. + */ + if (TPDSlot) + { + for (i = 0; i < nCompletedXactSlots; i++) + { + int tpd_slot_id; + + slot_no = completed_xact_slots[i]; + /* calculate the actual slot no. */ + tpd_slot_id = slot_no + ZHEAP_PAGE_TRANS_SLOTS + 1; + + /* Clear xid from the TPD slot but keep the urec_ptr intact. */ + TPDPageSetTransactionSlotInfo(buf, tpd_slot_id, 0, + InvalidTransactionId, + transinfo[slot_no].urec_ptr); + } + } + else + { + for (i = 0; i < nCompletedXactSlots; i++) + { + slot_no = completed_xact_slots[i]; + transinfo[slot_no].xid_epoch = 0; + transinfo[slot_no].xid = InvalidTransactionId; + } + } + MarkBufferDirty(buf); + + /* + * Xlog Stuff + */ + if (RelationNeedsWAL(relation)) + { + XLogRecPtr recptr; + + XLogBeginInsert(); + + + /* See comments while registering frozen slot. */ + XLogRegisterData((char *) &nCompletedXactSlots, sizeof(uint16)); + XLogRegisterData((char *) completed_xact_slots, nCompletedXactSlots * sizeof(int)); + + XLogRegisterBuffer(0, buf, REGBUF_STANDARD); + + if (TPDSlot) + RegisterTPDBuffer(page, 1); + + recptr = XLogInsert(RM_ZHEAP_ID, XLOG_ZHEAP_INVALID_XACT_SLOT); + PageSetLSN(page, recptr); + + if (TPDSlot) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + result = true; + goto cleanup; + } + else if (nAbortedXactSlots) + { + int i; + int slot_no; + UndoRecPtr *urecptr = palloc(nAbortedXactSlots * sizeof(UndoRecPtr)); + TransactionId *xid = palloc(nAbortedXactSlots * sizeof(TransactionId)); + uint32 *epoch = palloc(nAbortedXactSlots * sizeof(uint32)); + + /* Collect slot information before releasing the lock. */ + for (i = 0; i < nAbortedXactSlots; i++) + { + urecptr[i] = transinfo[aborted_xact_slots[i]].urec_ptr; + xid[i] = transinfo[aborted_xact_slots[i]].xid; + epoch[i] = transinfo[aborted_xact_slots[i]].xid_epoch; + } + + /* + * We need to release and the lock before applying undo actions for a + * page as we might need to traverse the long undo chain for a page. + */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + /* + * Instead of just unlocking the TPD buffer like heap buffer its ok to + * unlock and release, because next time while trying to reserve the + * slot if we get the slot in TPD then anyway we will pin it again. + */ + if (TPDSlot) + UnlockReleaseTPDBuffers(); + + for (i = 0; i < nAbortedXactSlots; i++) + { + slot_no = aborted_xact_slots[i] + 1; + process_and_execute_undo_actions_page(urecptr[i], + relation, + buf, + epoch[i], + xid[i], + slot_no); + } + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + *lock_reacquired = true; + pfree(urecptr); + pfree(xid); + pfree(epoch); + + result = true; + goto cleanup; + } + +cleanup: + if (frozen_slots != NULL) + pfree(frozen_slots); + if (completed_xact_slots != NULL) + pfree(completed_xact_slots); + if (aborted_xact_slots != NULL) + pfree(aborted_xact_slots); + + return result; +} + +/* + * ZHeapTupleGetCid - Retrieve command id from tuple's undo record. + * + * It is expected that the caller of this function has atleast read lock + * on the buffer. + */ +CommandId +ZHeapTupleGetCid(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr, + int trans_slot_id) +{ + UnpackedUndoRecord *urec; + UndoRecPtr undo_rec_ptr; + CommandId current_cid; + TransactionId xid; + uint64 epoch_xid; + uint32 epoch; + bool TPDSlot = true; + int out_slot_no; + + /* + * For undo tuple caller will pass the valid slot id otherwise we can get it + * directly from the tuple. + */ + if (trans_slot_id == InvalidXactSlotId) + { + trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup->t_data); + TPDSlot = false; + } + + /* + * If urec_ptr is not provided, fetch the latest undo pointer from the page. + */ + if (!UndoRecPtrIsValid(urec_ptr)) + { + out_slot_no = GetTransactionSlotInfo(buf, + ItemPointerGetOffsetNumber(&zhtup->t_self), + trans_slot_id, + &epoch, + &xid, + &undo_rec_ptr, + true, + TPDSlot); + } + else + { + out_slot_no = GetTransactionSlotInfo(buf, + ItemPointerGetOffsetNumber(&zhtup->t_self), + trans_slot_id, + &epoch, + &xid, + NULL, + true, + TPDSlot); + undo_rec_ptr = urec_ptr; + } + + if (out_slot_no == ZHTUP_SLOT_FROZEN) + return InvalidCommandId; + + epoch_xid = (uint64 ) epoch; + epoch_xid = MakeEpochXid(epoch_xid, xid); + + if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return InvalidCommandId; + + Assert(UndoRecPtrIsValid(undo_rec_ptr)); + urec = UndoFetchRecord(undo_rec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + if (urec == NULL) + return InvalidCommandId; + + current_cid = urec->uur_cid; + + UndoRecordRelease(urec); + + return current_cid; +} + +/* + * ZHeapTupleGetCtid - Retrieve tuple id from tuple's undo record. + * + * It is expected that caller of this function has atleast read lock + * on the buffer and we call it only for non-inplace-updated tuples. + */ +void +ZHeapTupleGetCtid(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr, + ItemPointer ctid) +{ + *ctid = zhtup->t_self; + ZHeapPageGetCtid(ZHeapTupleHeaderGetXactSlot(zhtup->t_data), buf, + urec_ptr, ctid); +} + +/* + * ZHeapTupleGetSubXid - Retrieve subtransaction id from tuple's undo record. + * + * It is expected that caller of this function has atleast read lock. + * + * Note that we don't handle ZHEAP_INVALID_XACT_SLOT as this function is only + * called for in-progress transactions. If we need to call it for some other + * purpose, then we might need to deal with ZHEAP_INVALID_XACT_SLOT. + */ +void +ZHeapTupleGetSubXid(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr, + SubTransactionId *subxid) +{ + UnpackedUndoRecord *urec; + + *subxid = InvalidSubTransactionId; + + Assert(UndoRecPtrIsValid(urec_ptr)); + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* + * We mostly expect urec here to be valid as it try to fetch + * subtransactionid of tuples that are visible to the snapshot, so + * corresponding undo record can't be discarded. + * + * In case when it is called while index creation, it might be possible + * that the transaction that updated the tuple is committed and is not + * present the calling transaction's snapshot (it uses snapshotany while + * index creation), hence undo is discarded. + */ + if (urec == NULL) + return; + + if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SUBXACT) + { + Assert(urec->uur_payload.len > 0); + + /* + * For UNDO_UPDATE, we first store the CTID, then transaction slot + * and after that subtransaction id in payload. For + * UNDO_XID_LOCK_ONLY, we first store the Lockmode, then transaction + * slot and after that subtransaction id. So retrieve accordingly. + */ + if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + { + if (urec->uur_type == UNDO_UPDATE) + *subxid = *(int *) ((char *) urec->uur_payload.data + + sizeof(ItemPointerData) + sizeof(TransactionId)); + else if (urec->uur_type == UNDO_XID_LOCK_ONLY || + urec->uur_type == UNDO_XID_LOCK_FOR_UPDATE || + urec->uur_type == UNDO_XID_MULTI_LOCK_ONLY) + *subxid = *(int *) ((char *) urec->uur_payload.data + + sizeof(LockTupleMode) + sizeof(TransactionId)); + else + *subxid = *(int *) ((char *) urec->uur_payload.data + + sizeof(TransactionId)); + } + else + { + if (urec->uur_type == UNDO_UPDATE) + *subxid = *(int *) ((char *) urec->uur_payload.data + + sizeof(ItemPointerData)); + else if (urec->uur_type == UNDO_XID_LOCK_ONLY || + urec->uur_type == UNDO_XID_LOCK_FOR_UPDATE || + urec->uur_type == UNDO_XID_MULTI_LOCK_ONLY) + *subxid = *(int *) ((char *) urec->uur_payload.data + + sizeof(LockTupleMode)); + else + *subxid = *(SubTransactionId *) urec->uur_payload.data; + } + } + + UndoRecordRelease(urec); +} + +/* + * ZHeapTupleGetSpecToken - Retrieve speculative token from tuple's undo + * record. + * + * It is expected that caller of this function has atleast read lock + * on the buffer. + */ +void +ZHeapTupleGetSpecToken(ZHeapTuple zhtup, Buffer buf, UndoRecPtr urec_ptr, + uint32 *specToken) +{ + UnpackedUndoRecord *urec; + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* + * We always expect urec to be valid as it try to fetch speculative token + * of tuples for which inserting transaction hasn't been committed. So, + * corresponding undo record can't be discarded. + */ + Assert(urec); + + *specToken = *(uint32 *) urec->uur_payload.data; + + UndoRecordRelease(urec); +} + +/* + * ZHeapTupleGetTransInfo - Retrieve transaction information of transaction + * that has modified the tuple. + * + * nobuflock indicates whether caller has lock on the buffer 'buf'. If nobuflock + * is false, we rely on the supplied tuple zhtup to fetch the slot and undo + * information. Otherwise, we take buffer lock and fetch the actual tuple. + */ +void +ZHeapTupleGetTransInfo(ZHeapTuple zhtup, Buffer buf, int *trans_slot, + uint64 *epoch_xid_out, TransactionId *xid_out, + CommandId *cid_out, UndoRecPtr *urec_ptr_out, + bool nobuflock) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + UndoRecPtr urec_ptr; + uint64 epoch; + uint32 tmp_epoch; + TransactionId xid = InvalidTransactionId; + CommandId cid; + ItemId lp; + Page page; + ItemPointer tid = &(zhtup->t_self); + int trans_slot_id; + OffsetNumber offnum = ItemPointerGetOffsetNumber(tid); + bool is_invalid_slot = false; + + /* + * We are going to access special space in the page to retrieve the + * transaction information and that requires share lock on buffer. + */ + if (nobuflock) + LockBuffer(buf, BUFFER_LOCK_SHARE); + + page = BufferGetPage(buf); + lp = PageGetItemId(page, offnum); + Assert(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp)); + if (!ItemIdIsDeleted(lp)) + { + if (nobuflock) + { + /* + * If the tuple is updated such that its transaction slot has + * been changed, then we will never be able to get the correct + * tuple from undo. To avoid, that we get the latest tuple from + * page rather than relying on it's in-memory copy. + */ + zhtup->t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup->t_len = ItemIdGetLength(lp); + tuple = zhtup->t_data; + } + trans_slot_id = ZHeapTupleHeaderGetXactSlot(tuple); + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + goto slot_is_frozen; + trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id, + &tmp_epoch, &xid, &urec_ptr, + true, false); + /* + * It is quite possible that the item is showing some + * valid transaction slot, but actual slot has been frozen. + * This can happen when the slot belongs to TPD entry and + * the corresponding TPD entry is pruned. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + goto slot_is_frozen; + if (ZHeapTupleHasInvalidXact(tuple->t_infomask)) + is_invalid_slot = true; + } + else + { + /* + * If it's deleted and pruned, we fetch the slot and undo information + * from the item pointer itself. + */ + trans_slot_id = ItemIdGetTransactionSlot(lp); + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + goto slot_is_frozen; + trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id, + &tmp_epoch, &xid, &urec_ptr, + true, false); + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + goto slot_is_frozen; + if (ItemIdGetVisibilityInfo(lp) & ITEMID_XACT_INVALID) + is_invalid_slot = true; + } + + /* + * We need to fetch all the transaction related information from undo + * record for the tuples that point to a slot that gets invalidated for + * reuse at some point of time. See PageFreezeTransSlots. + */ + if (is_invalid_slot) + { + xid = InvalidTransactionId; + FetchTransInfoFromUndo(zhtup, &epoch, &xid, &cid, &urec_ptr, false); + } + else if (ZHeapTupleHasMultiLockers(tuple->t_infomask)) + { + /* + * When we take a lock on the tuple, we never set locker's slot on the + * tuple. However, we use the newly computed infomask for the tuple + * and write its current infomask in undo due to which + * INVALID_XACT_SLOT bit of the tuple will move to undo. In such + * cases, if we need the previous inserter/updater's transaction id, + * we've to skip locker's undo records. + */ + xid = InvalidTransactionId; + FetchTransInfoFromUndo(zhtup, &epoch, &xid, &cid, &urec_ptr, true); + } + else + { + if(cid_out && TransactionIdIsCurrentTransactionId(xid)) + { + lp = PageGetItemId(page, offnum); + if (!ItemIdIsDeleted(lp)) + cid = ZHeapTupleGetCid(zhtup, buf, InvalidUndoRecPtr, InvalidXactSlotId); + else + cid = ZHeapPageGetCid(buf, trans_slot_id, tmp_epoch, xid, + urec_ptr, offnum); + } + epoch = (uint64) tmp_epoch; + } + + goto done; + +slot_is_frozen: + trans_slot_id = ZHTUP_SLOT_FROZEN; + epoch = 0; + xid = InvalidTransactionId; + cid = InvalidCommandId; + urec_ptr = InvalidUndoRecPtr; + +done: + /* Set the value of required parameters. */ + if (trans_slot) + *trans_slot = trans_slot_id; + if (epoch_xid_out) + *epoch_xid_out = MakeEpochXid(epoch, xid); + if (xid_out) + *xid_out = xid; + if (cid_out) + *cid_out = cid; + if (nobuflock) + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + if (urec_ptr_out) + *urec_ptr_out = urec_ptr; + + return; +} + +/* + * ZHeapPageGetCid - Retrieve command id from tuple's undo record. + * + * This is similar to ZHeapTupleGetCid with a difference that here we use + * transaction slot to fetch the appropriate undo record. It is expected that + * the caller of this function has atleast read lock on the buffer. + */ +CommandId +ZHeapPageGetCid(Buffer buf, int trans_slot, uint32 epoch, TransactionId xid, + UndoRecPtr urec_ptr, OffsetNumber off) +{ + UnpackedUndoRecord *urec; + CommandId current_cid; + uint64 epoch_xid; + + epoch_xid = (uint64) epoch; + epoch_xid = MakeEpochXid(epoch_xid, xid); + + if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return InvalidCommandId; + + urec = UndoFetchRecord(urec_ptr, + BufferGetBlockNumber(buf), + off, + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + if (urec == NULL) + return InvalidCommandId; + + current_cid = urec->uur_cid; + + UndoRecordRelease(urec); + + return current_cid; +} + + +/* + * ZHeapPageGetCtid - Retrieve tuple id from tuple's undo record. + * + * It is expected that caller of this function has atleast read lock. + */ +void +ZHeapPageGetCtid(int trans_slot, Buffer buf, UndoRecPtr urec_ptr, + ItemPointer ctid) +{ + UnpackedUndoRecord *urec; + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(ctid), + ItemPointerGetOffsetNumber(ctid), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* + * We always expect urec here to be valid as it try to fetch ctid of + * tuples that are visible to the snapshot, so corresponding undo record + * can't be discarded. + */ + Assert(urec); + + /* + * The tuple should be deleted/updated previously. Else, the caller should + * not be calling this function. + */ + Assert(urec->uur_type == UNDO_DELETE || urec->uur_type == UNDO_UPDATE); + + /* + * For a deleted tuple, ctid refers to self. + */ + if (urec->uur_type != UNDO_DELETE) + { + Assert(urec->uur_payload.len > 0); + *ctid = *(ItemPointer) urec->uur_payload.data; + } + + UndoRecordRelease(urec); +} + + +/* + * ValidateTuplesXact - Check if the tuple is modified by priorXmax. + * + * We need to traverse the undo chain of tuple to see if any of its + * prior version is modified by priorXmax. + * + * nobuflock indicates whether caller has lock on the buffer 'buf'. + */ +bool +ValidateTuplesXact(ZHeapTuple tuple, Snapshot snapshot, Buffer buf, + TransactionId priorXmax, bool nobuflock) +{ + ZHeapTupleData zhtup; + UnpackedUndoRecord *urec = NULL; + UndoRecPtr urec_ptr; + ZHeapTuple undo_tup = NULL; + ItemPointer tid = &(tuple->t_self); + ItemId lp; + Page page; + TransactionId xid; + TransactionId prev_undo_xid = InvalidTransactionId; + uint32 epoch; + int trans_slot_id = InvalidXactSlotId; + int prev_trans_slot_id; + OffsetNumber offnum; + bool valid = false; + + /* + * As we are going to access special space in the page to retrieve the + * transaction information share lock on buffer is required. + */ + if (nobuflock) + LockBuffer(buf, BUFFER_LOCK_SHARE); + + page = BufferGetPage(buf); + offnum = ItemPointerGetOffsetNumber(tid); + lp = PageGetItemId(page, offnum); + + zhtup.t_tableOid = tuple->t_tableOid; + zhtup.t_self = *tid; + + if(ItemIdIsDead(lp) || !ItemIdHasStorage(lp)) + { + /* + * If the tuple is already removed by Rollbacks/pruning, then we + * don't need to proceed further. + */ + if (nobuflock) + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + return false; + } + else if (!ItemIdIsDeleted(lp)) + { + /* + * If the tuple is updated such that its transaction slot has been + * changed, then we will never be able to get the correct tuple from undo. + * To avoid, that we get the latest tuple from page rather than relying on + * it's in-memory copy. + */ + zhtup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zhtup.t_len = ItemIdGetLength(lp); + trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup.t_data); + trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id, + &epoch, &xid, &urec_ptr, true, + false); + } + else + { + ZHeapTuple vis_tuple; + trans_slot_id = ItemIdGetTransactionSlot(lp); + trans_slot_id = GetTransactionSlotInfo(buf, offnum, trans_slot_id, + &epoch, &xid, &urec_ptr, true, + false); + + /* + * XXX for now we shall get a visible undo tuple for the given + * dirty snapshot. The tuple data is needed below in + * CopyTupleFromUndoRecord and some undo records will not have + * tuple data and mask info with them. + * */ + vis_tuple = ZHeapGetVisibleTuple(ItemPointerGetOffsetNumber(tid), + snapshot, buf, NULL); + Assert(vis_tuple != NULL); + zhtup.t_data = vis_tuple->t_data; + zhtup.t_len = vis_tuple->t_len; + } + + /* + * Current xid on tuple must not precede oldestXidHavingUndo as it + * will be greater than priorXmax which was not visible to our + * snapshot. + */ + Assert(trans_slot_id != ZHTUP_SLOT_FROZEN); + + if (TransactionIdEquals(xid, priorXmax)) + { + valid = true; + goto tuple_is_valid; + } + + undo_tup = &zhtup; + + /* + * Current xid on tuple must not precede RecentGlobalXmin as it will be + * greater than priorXmax which was not visible to our snapshot. + */ + Assert(TransactionIdEquals(xid, InvalidTransactionId) || + !TransactionIdPrecedes(xid, RecentGlobalXmin)); + + do + { + prev_trans_slot_id = trans_slot_id; + Assert(prev_trans_slot_id != ZHTUP_SLOT_FROZEN); + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&undo_tup->t_self), + ItemPointerGetOffsetNumber(&undo_tup->t_self), + prev_undo_xid, + NULL, + ZHeapSatisfyUndoRecord); + + /* + * As we still hold a snapshot to which priorXmax is not visible, neither + * the transaction slot on tuple can be marked as frozen nor the + * corresponding undo be discarded. + */ + Assert(urec != NULL); + + if (TransactionIdEquals(urec->uur_xid, priorXmax)) + { + valid = true; + goto tuple_is_valid; + } + + /* don't free the tuple passed by caller */ + undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, NULL, + (undo_tup) == (&zhtup) ? false : true, + page); + + Assert(!TransactionIdPrecedes(urec->uur_prevxid, RecentGlobalXmin)); + + prev_undo_xid = urec->uur_prevxid; + + /* + * Change the undo chain if the undo tuple is stamped with the different + * transaction slot. + */ + if (prev_trans_slot_id != trans_slot_id) + { + trans_slot_id = GetTransactionSlotInfo(buf, + ItemPointerGetOffsetNumber(&undo_tup->t_self), + trans_slot_id, + NULL, + NULL, + &urec_ptr, + true, + true); + } + else + urec_ptr = urec->uur_blkprev; + + UndoRecordRelease(urec); + urec = NULL; + } while (UndoRecPtrIsValid(urec_ptr)); + +tuple_is_valid: + if (urec) + UndoRecordRelease(urec); + if (undo_tup && undo_tup != &zhtup) + pfree(undo_tup); + + if (nobuflock) + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + return valid; +} + +/* + * Initialize zheap page. + */ +void +ZheapInitPage(Page page, Size pageSize) +{ + ZHeapPageOpaque opaque; + int i; + + /* + * The size of the opaque space depends on the number of transaction + * slots in a page. We set it to default here. + */ + PageInit(page, pageSize, ZHEAP_PAGE_TRANS_SLOTS * sizeof(TransInfo)); + + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + for (i = 0; i < ZHEAP_PAGE_TRANS_SLOTS; i++) + { + opaque->transinfo[i].xid_epoch = 0; + opaque->transinfo[i].xid = InvalidTransactionId; + opaque->transinfo[i].urec_ptr = InvalidUndoRecPtr; + } +} + +/* + * zheap_init_meta_page - Initialize the metapage. + */ +void +zheap_init_meta_page(Buffer metabuf, BlockNumber first_blkno, + BlockNumber last_blkno) +{ + ZHeapMetaPage metap; + Page page; + + page = BufferGetPage(metabuf); + PageInit(page, BufferGetPageSize(metabuf), 0); + + metap = ZHeapPageGetMeta(page); + metap->zhm_magic = ZHEAP_MAGIC; + metap->zhm_version = ZHEAP_VERSION; + metap->zhm_first_used_tpd_page = first_blkno; + metap->zhm_last_used_tpd_page = last_blkno; + + /* + * Set pd_lower just past the end of the metadata. This is essential, + * because without doing so, metadata will be lost if xlog.c compresses + * the page. + */ + ((PageHeader) page)->pd_lower = + ((char *) metap + sizeof(ZHeapMetaPageData)) - (char *) page; +} + +/* + * ZheapInitMetaPage - Allocate and initialize the zheap metapage. + */ +void +ZheapInitMetaPage(Relation rel, ForkNumber forkNum) +{ + Buffer buf; + bool use_wal; + + buf = ReadBufferExtended(rel, forkNum, P_NEW, RBM_NORMAL, NULL); + if (BufferGetBlockNumber(buf) != ZHEAP_METAPAGE) + elog(ERROR, "unexpected zheap relation size: %u, should be %u", + BufferGetBlockNumber(buf), ZHEAP_METAPAGE); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + START_CRIT_SECTION(); + + zheap_init_meta_page(buf, InvalidBlockNumber, InvalidBlockNumber); + MarkBufferDirty(buf); + + /* + * WAL log creation of metapage if the relation is persistent, or this is the + * init fork. Init forks for unlogged relations always need to be WAL + * logged. + */ + use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM; + + if (use_wal) + log_newpage_buffer(buf, true); + + END_CRIT_SECTION(); + + UnlockReleaseBuffer(buf); +} + +/* + * ----------- + * Zheap scan related API's. + * ----------- + */ + +/* + * zinitscan - same as initscan except for tuple initialization + */ +static void +zinitscan(ZHeapScanDesc scan, ScanKey key, bool keep_startblock) +{ + bool allow_strat; + bool allow_sync; + + /* + * Determine the number of blocks we have to scan. + * + * It is sufficient to do this once at scan start, since any tuples added + * while the scan is in progress will be invisible to my snapshot anyway. + * (That is not true when using a non-MVCC snapshot. However, we couldn't + * guarantee to return tuples added after scan start anyway, since they + * might go into pages we already scanned. To guarantee consistent + * results for a non-MVCC snapshot, the caller must hold some higher-level + * lock that ensures the interesting tuple(s) won't change.) + */ + if (scan->rs_scan.rs_parallel != NULL) + scan->rs_scan.rs_nblocks = scan->rs_scan.rs_parallel->phs_nblocks; + else + scan->rs_scan.rs_nblocks = RelationGetNumberOfBlocks(scan->rs_scan.rs_rd); + + /* + * If the table is large relative to NBuffers, use a bulk-read access + * strategy and enable synchronized scanning (see syncscan.c). Although + * the thresholds for these features could be different, we make them the + * same so that there are only two behaviors to tune rather than four. + * (However, some callers need to be able to disable one or both of these + * behaviors, independently of the size of the table; also there is a GUC + * variable that can disable synchronized scanning.) + * + * Note that heap_parallelscan_initialize has a very similar test; if you + * change this, consider changing that one, too. + */ + if (!RelationUsesLocalBuffers(scan->rs_scan.rs_rd) && + scan->rs_scan.rs_nblocks > NBuffers / 4) + { + allow_strat = scan->rs_scan.rs_allow_strat; + allow_sync = scan->rs_scan.rs_allow_sync; + } + else + allow_strat = allow_sync = false; + + if (allow_strat) + { + /* During a rescan, keep the previous strategy object. */ + if (scan->rs_strategy == NULL) + scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD); + } + else + { + if (scan->rs_strategy != NULL) + FreeAccessStrategy(scan->rs_strategy); + scan->rs_strategy = NULL; + } + + if (scan->rs_scan.rs_parallel != NULL) + { + /* For parallel scan, believe whatever ParallelHeapScanDesc says. */ + scan->rs_scan.rs_syncscan = scan->rs_scan.rs_parallel->phs_syncscan; + } + else if (keep_startblock) + { + /* + * When rescanning, we want to keep the previous startblock setting, + * so that rewinding a cursor doesn't generate surprising results. + * Reset the active syncscan setting, though. + */ + scan->rs_scan.rs_syncscan = (allow_sync && synchronize_seqscans); + } + else if (allow_sync && synchronize_seqscans) + { + scan->rs_scan.rs_syncscan = true; + scan->rs_scan.rs_startblock = ss_get_location(scan->rs_scan.rs_rd, scan->rs_scan.rs_nblocks); + /* Skip metapage */ + if (scan->rs_scan.rs_startblock == ZHEAP_METAPAGE) + scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1; + } + else + { + scan->rs_scan.rs_syncscan = false; + scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1; + } + + scan->rs_scan.rs_numblocks = InvalidBlockNumber; + scan->rs_inited = false; + scan->rs_cbuf = InvalidBuffer; + scan->rs_cblock = InvalidBlockNumber; + + /* page-at-a-time fields are always invalid when not rs_inited */ + + /* + * copy the scan key, if appropriate + */ + if (key != NULL) + memcpy(scan->rs_scan.rs_key, key, scan->rs_scan.rs_nkeys * sizeof(ScanKeyData)); + + /* + * Currently, we don't have a stats counter for bitmap heap scans (but the + * underlying bitmap index scans will be counted) or sample scans (we only + * update stats for tuple fetches there) + */ + if (!scan->rs_scan.rs_bitmapscan && !scan->rs_scan.rs_samplescan) + pgstat_count_heap_scan(scan->rs_scan.rs_rd); +} + +/* ---------------- + * zheap_rescan - similar to heap_rescan + * ---------------- + */ +void +zheap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, + bool allow_strat, bool allow_sync, bool allow_pagemode) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + + if (set_params) + { + scan->rs_scan.rs_allow_strat = allow_strat; + scan->rs_scan.rs_allow_sync = allow_sync; + scan->rs_scan.rs_pageatatime = allow_pagemode && IsMVCCSnapshot(scan->rs_scan.rs_snapshot); + } + + /* + * unpin scan buffers + */ + if (BufferIsValid(scan->rs_cbuf)) + ReleaseBuffer(scan->rs_cbuf); + + /* + * reinitialize scan descriptor + */ + zinitscan(scan, key, true); + + /* + * reset parallel scan, if present + */ + if (scan->rs_scan.rs_parallel != NULL) + { + ParallelTableScanDesc parallel_scan; + + /* + * Caller is responsible for making sure that all workers have + * finished the scan before calling this. + */ + parallel_scan = scan->rs_scan.rs_parallel; + pg_atomic_write_u64(¶llel_scan->phs_nallocated, 0); + } +} + +/* + * zheap_beginscan - same as heap_beginscan except for tuple initialization + */ +TableScanDesc +zheap_beginscan(Relation relation, Snapshot snapshot, + int nkeys, ScanKey key, + ParallelTableScanDesc parallel_scan, + bool allow_strat, + bool allow_sync, + bool allow_pagemode, + bool is_bitmapscan, + bool is_samplescan, + bool temp_snap) +{ + ZHeapScanDesc scan; + + /* + * increment relation ref count while scanning relation + * + * This is just to make really sure the relcache entry won't go away while + * the scan has a pointer to it. Caller should be holding the rel open + * anyway, so this is redundant in all normal scenarios... + */ + RelationIncrementReferenceCount(relation); + + /* + * allocate and initialize scan descriptor + */ + scan = (ZHeapScanDesc) palloc(sizeof(ZHeapScanDescData)); + + scan->rs_scan.rs_rd = relation; + scan->rs_scan.rs_snapshot = snapshot; + scan->rs_scan.rs_nkeys = nkeys; + scan->rs_scan.rs_bitmapscan = is_bitmapscan; + scan->rs_scan.rs_samplescan = is_samplescan; + scan->rs_strategy = NULL; /* set in zinitscan */ + scan->rs_scan.rs_startblock = 0; /* set in initscan */ + scan->rs_scan.rs_allow_strat = allow_strat; + scan->rs_scan.rs_allow_sync = allow_sync; + scan->rs_scan.rs_temp_snap = temp_snap; + scan->rs_scan.rs_parallel = parallel_scan; + scan->rs_ntuples = 0; // ZBORKED ? + + /* + * we can use page-at-a-time mode if it's an MVCC-safe snapshot + */ + scan->rs_scan.rs_pageatatime = allow_pagemode && snapshot && IsMVCCSnapshot(snapshot); + + /* + * For a seqscan in a serializable transaction, acquire a predicate lock + * on the entire relation. This is required not only to lock all the + * matching tuples, but also to conflict with new insertions into the + * table. In an indexscan, we take page locks on the index pages covering + * the range specified in the scan qual, but in a heap scan there is + * nothing more fine-grained to lock. A bitmap scan is a different story, + * there we have already scanned the index and locked the index pages + * covering the predicate. But in that case we still have to lock any + * matching heap tuples. + */ + if (!is_bitmapscan && snapshot) + PredicateLockRelation(relation, snapshot); + + scan->rs_cztup = NULL; + + + /* + * we do this here instead of in initscan() because heap_rescan also calls + * initscan() and we don't want to allocate memory again + */ + if (nkeys > 0) + scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys); + else + scan->rs_scan.rs_key = NULL; + + zinitscan(scan, key, false); + + return (TableScanDesc) scan; +} + +/* + * zheap_setscanlimits - restrict range of a zheapscan + * + * startBlk is the page to start at + * numBlks is number of pages to scan (InvalidBlockNumber means "all") + */ +void +zheap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlks) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + + Assert(!scan->rs_inited); /* else too late to change */ + Assert(!scan->rs_scan.rs_syncscan); /* else rs_startblock is + * significant */ + + /* + * Check startBlk is valid (but allow case of zero blocks...). + * Consider meta-page as well. + */ + Assert(startBlk == 0 || startBlk < scan->rs_scan.rs_nblocks || + startBlk == ZHEAP_METAPAGE + 1); + + scan->rs_scan.rs_startblock = startBlk; + scan->rs_scan.rs_numblocks = numBlks; +} + +/* ---------------- + * heap_update_snapshot + * + * Update snapshot info in heap scan descriptor. + * ---------------- + */ +void +zheap_update_snapshot(TableScanDesc sscan, Snapshot snapshot) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + + Assert(IsMVCCSnapshot(snapshot)); + + RegisterSnapshot(snapshot); + scan->rs_scan.rs_snapshot = snapshot; + scan->rs_scan.rs_temp_snap = true; +} + +/* + * zheapgetpage - Same as heapgetpage, but operate on zheap page and + * in page-at-a-time mode, visible tuples are stored in rs_visztuples. + * + * It returns false, if we can't scan the page (like in case of TPD page), + * otherwise, return true. + */ +bool +zheapgetpage(TableScanDesc sscan, BlockNumber page) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + Buffer buffer; + Snapshot snapshot; + Page dp; + int lines; + int ntup; + OffsetNumber lineoff; + ItemId lpp; + bool all_visible; + uint8 vmstatus; + Buffer vmbuffer = InvalidBuffer; + + Assert(page < scan->rs_scan.rs_nblocks); + + /* release previous scan buffer, if any */ + if (BufferIsValid(scan->rs_cbuf)) + { + ReleaseBuffer(scan->rs_cbuf); + scan->rs_cbuf = InvalidBuffer; + } + + // ZBORKED + if (page == ZHEAP_METAPAGE) + { + /* needs to be udpated to keep track of scan position */ + scan->rs_cblock = page; + return false; + } + + /* + * Be sure to check for interrupts at least once per page. Checks at + * higher code levels won't be able to stop a seqscan that encounters many + * pages' worth of consecutive dead tuples. + */ + CHECK_FOR_INTERRUPTS(); + + /* read page using selected strategy */ + buffer = ReadBufferExtended(scan->rs_scan.rs_rd, MAIN_FORKNUM, page, + RBM_NORMAL, scan->rs_strategy); + scan->rs_cblock = page; + + /* + * We must hold share lock on the buffer content while examining tuple + * visibility. Afterwards, however, the tuples we have found to be + * visible are guaranteed good as long as we hold the buffer pin. + */ + LockBuffer(buffer, BUFFER_LOCK_SHARE); + + dp = BufferGetPage(buffer); + + /* + * Skip TPD pages. As of now, the size of special space in TPD pages is + * different from other zheap pages like metapage and regular zheap page, + * however, if that changes, we might need to explicitly store pagetype + * flag somewhere. + * + * Fixme - As an exception, the size of special space for zheap page + * with one transaction slot will match with TPD page's special size. + */ + if (PageGetSpecialSize(dp) == MAXALIGN(sizeof(TPDPageOpaqueData))) + { + UnlockReleaseBuffer(buffer); + return false; + } + else if (!scan->rs_scan.rs_pageatatime) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + scan->rs_cbuf = buffer; + return true; + } + + snapshot = scan->rs_scan.rs_snapshot; + + /* + * Prune and repair fragmentation for the whole page, if possible. + * Fixme - Pruning is required in zheap for deletes, so we need to + * make it work. + */ + /* heap_page_prune_opt(scan->rs_rd, buffer); */ + + TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp); + lines = PageGetMaxOffsetNumber(dp); + ntup = 0; + + /* + * If the all-visible flag indicates that all tuples on the page are + * visible to everyone, we can skip the per-tuple visibility tests. + * + * Note: In hot standby, a tuple that's already visible to all + * transactions in the master might still be invisible to a read-only + * transaction in the standby. We partly handle this problem by tracking + * the minimum xmin of visible tuples as the cut-off XID while marking a + * page all-visible on master and WAL log that along with the visibility + * map SET operation. In hot standby, we wait for (or abort) all + * transactions that can potentially may not see one or more tuples on the + * page. That's how index-only scans work fine in hot standby. + */ + + vmstatus = visibilitymap_get_status(scan->rs_scan.rs_rd, page, &vmbuffer); + + all_visible = (vmstatus & VISIBILITYMAP_ALL_VISIBLE) && + !snapshot->takenDuringRecovery; + + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); + lineoff <= lines; + lineoff++, lpp++) + { + if (ItemIdIsNormal(lpp) || ItemIdIsDeleted(lpp)) + { + ZHeapTuple loctup = NULL; + ZHeapTuple resulttup = NULL; + Size loctup_len; + bool valid = false; + ItemPointerData tid; + + ItemPointerSet(&tid, page, lineoff); + + if (ItemIdIsDeleted(lpp)) + { + if (all_visible) + { + valid = false; + resulttup = NULL; + } + else + { + resulttup = ZHeapGetVisibleTuple(lineoff, snapshot, buffer, + NULL); + valid = resulttup ? true : false; + } + } + else + { + loctup_len = ItemIdGetLength(lpp); + + loctup = palloc(ZHEAPTUPLESIZE + loctup_len); + loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE); + + loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd); + loctup->t_len = loctup_len; + loctup->t_self = tid; + + /* + * We always need to make a copy of zheap tuple as once we + * release the buffer, an in-place update can change the tuple. + */ + memcpy(loctup->t_data, + ((ZHeapTupleHeader) PageGetItem((Page) dp, lpp)), + loctup->t_len); + + if (all_visible) + { + valid = true; + resulttup = loctup; + } + else + { + resulttup = ZHeapTupleSatisfies(loctup, snapshot, + buffer, NULL); + valid = resulttup ? true : false; + } + } + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &tid, + buffer, snapshot); + + if (valid) + scan->rs_visztuples[ntup++] = resulttup; + } + } + + UnlockReleaseBuffer(buffer); + + Assert(ntup <= MaxZHeapTuplesPerPage); + scan->rs_ntuples = ntup; + + return true; +} + +/* ---------------- + * zheapgettup_pagemode - fetch next zheap tuple in page-at-a-time mode + * + * Note that here we process only regular zheap pages, meta and tpd pages are + * skipped. + * ---------------- + */ +static ZHeapTuple +zheapgettup_pagemode(ZHeapScanDesc scan, + ScanDirection dir) +{ + ZHeapTuple tuple = scan->rs_cztup; + bool backward = ScanDirectionIsBackward(dir); + BlockNumber page; + bool finished; + bool valid; + int lines; + int lineindex; + int linesleft; + int i = 0; + + /* + * calculate next starting lineindex, given scan direction + */ + if (ScanDirectionIsForward(dir)) + { + if (!scan->rs_inited) + { + /* + * return null immediately if relation is empty + */ + if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 || + scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1) + { + Assert(!BufferIsValid(scan->rs_cbuf)); + tuple = NULL; + return tuple; + } + if (scan->rs_scan.rs_parallel != NULL) + { + table_parallelscan_startblock_init(&scan->rs_scan); + + page = table_parallelscan_nextpage(&scan->rs_scan); + + /* Skip metapage */ + if (page == ZHEAP_METAPAGE) + page = table_parallelscan_nextpage(&scan->rs_scan); + + /* Other processes might have already finished the scan. */ + if (page == InvalidBlockNumber) + { + Assert(!BufferIsValid(scan->rs_cbuf)); + tuple = NULL; + return tuple; + } + } + else + page = scan->rs_scan.rs_startblock; /* first page */ + valid = zheapgetpage(&scan->rs_scan, page); + if (!valid) + goto get_next_page; + + lineindex = 0; + scan->rs_inited = true; + } + else + { + /* continue from previously returned page/tuple */ + page = scan->rs_cblock; /* current page */ + lineindex = scan->rs_cindex + 1; + } + + /*dp = BufferGetPage(scan->rs_cbuf); + TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);*/ + lines = scan->rs_ntuples; + /* page and lineindex now reference the next visible tid */ + + linesleft = lines - lineindex; + } + else if (backward) + { + /* backward parallel scan not supported */ + Assert(scan->rs_scan.rs_parallel == NULL); + + if (!scan->rs_inited) + { + /* + * return null immediately if relation is empty + */ + if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 || + scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1) + { + Assert(!BufferIsValid(scan->rs_cbuf)); + tuple = NULL; + return tuple; + } + + /* + * Disable reporting to syncscan logic in a backwards scan; it's + * not very likely anyone else is doing the same thing at the same + * time, and much more likely that we'll just bollix things for + * forward scanners. + */ + scan->rs_scan.rs_syncscan = false; + /* start from last page of the scan */ + if (scan->rs_scan.rs_startblock > ZHEAP_METAPAGE + 1) + page = scan->rs_scan.rs_startblock - 1; + else + page = scan->rs_scan.rs_nblocks - 1; + valid = zheapgetpage(&scan->rs_scan, page); + if (!valid) + goto get_next_page; + } + else + { + /* continue from previously returned page/tuple */ + page = scan->rs_cblock; /* current page */ + } + + lines = scan->rs_ntuples; + + if (!scan->rs_inited) + { + lineindex = lines - 1; + scan->rs_inited = true; + } + else + { + lineindex = scan->rs_cindex - 1; + } + /* page and lineindex now reference the previous visible tid */ + + linesleft = lineindex + 1; + } + else + { + /* + * In executor it seems NoMovementScanDirection is nothing but + * do-nothing flag so we should not be here. The else part is still + * here to keep the code as in heapgettup_pagemode. + */ + Assert(false); + return NULL; + } + +get_next_tuple: + /* + * advance the scan until we find a qualifying tuple or run out of stuff + * to scan + */ + while (linesleft > 0) + { + tuple = scan->rs_visztuples[lineindex]; + scan->rs_cindex = lineindex; + return tuple; + } + + /* + * if we get here, it means we've exhausted the items on this page and + * it's time to move to the next. + * For now we shall free all of the zheap tuples stored in rs_visztuples. + * Later a better memory management is required. + */ + for (i = 0; i < scan->rs_ntuples; i++) + zheap_freetuple(scan->rs_visztuples[i]); + scan->rs_ntuples = 0; + +get_next_page: + for (;;) + { + if (backward) + { + finished = (page == scan->rs_scan.rs_startblock) || + (scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false); + if (page == ZHEAP_METAPAGE + 1) + page = scan->rs_scan.rs_nblocks; + page--; + } + else if (scan->rs_scan.rs_parallel != NULL) + { + page = table_parallelscan_nextpage(&scan->rs_scan); + /* Skip metapage */ + if (page == ZHEAP_METAPAGE) + page = table_parallelscan_nextpage(&scan->rs_scan); + finished = (page == InvalidBlockNumber); + } + else + { + page++; + if (page >= scan->rs_scan.rs_nblocks) + page = 0; + + if (page == ZHEAP_METAPAGE) + { + /* + * Since, we're skipping the metapage, we should update the scan + * location if sync scan is enabled. + */ + if (scan->rs_scan.rs_syncscan) + ss_report_location(scan->rs_scan.rs_rd, page); + page++; + } + + finished = (page == scan->rs_scan.rs_startblock) || + (scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false); + + /* + * Report our new scan position for synchronization purposes. We + * don't do that when moving backwards, however. That would just + * mess up any other forward-moving scanners. + * + * Note: we do this before checking for end of scan so that the + * final state of the position hint is back at the start of the + * rel. That's not strictly necessary, but otherwise when you run + * the same query multiple times the starting position would shift + * a little bit backwards on every invocation, which is confusing. + * We don't guarantee any specific ordering in general, though. + */ + if (scan->rs_scan.rs_syncscan) + ss_report_location(scan->rs_scan.rs_rd, page); + } + + /* + * return NULL if we've exhausted all the pages + */ + if (finished) + { + if (BufferIsValid(scan->rs_cbuf)) + ReleaseBuffer(scan->rs_cbuf); + scan->rs_cbuf = InvalidBuffer; + scan->rs_cblock = InvalidBlockNumber; + tuple = NULL; + scan->rs_inited = false; + return tuple; + } + + valid = zheapgetpage(&scan->rs_scan, page); + if (!valid) + continue; + + if (!scan->rs_inited) + scan->rs_inited = true; + lines = scan->rs_ntuples; + linesleft = lines; + if (backward) + lineindex = lines - 1; + else + lineindex = 0; + + goto get_next_tuple; + } +} + +/* + * Similar to heapgettup, but for fetching zheap tuple. + * + * Note that here we process only regular zheap pages, meta and tpd pages are + * skipped. + */ +static ZHeapTuple +zheapgettup(ZHeapScanDesc scan, + ScanDirection dir) +{ + ZHeapTuple tuple = scan->rs_cztup; + Snapshot snapshot = scan->rs_scan.rs_snapshot; + bool backward = ScanDirectionIsBackward(dir); + BlockNumber page; + bool finished; + bool valid; + Page dp; + int lines; + OffsetNumber lineoff; + int linesleft; + ItemId lpp; + + /* + * calculate next starting lineoff, given scan direction + */ + if (ScanDirectionIsForward(dir)) + { + if (!scan->rs_inited) + { + /* + * return null immediately if relation is empty + */ + if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 || + scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1) + { + Assert(!BufferIsValid(scan->rs_cbuf)); + return NULL; + } + if (scan->rs_scan.rs_parallel != NULL) + { + table_parallelscan_startblock_init(&scan->rs_scan); + + page = table_parallelscan_nextpage(&scan->rs_scan); + + /* Skip metapage */ + if (page == ZHEAP_METAPAGE) + page = table_parallelscan_nextpage(&scan->rs_scan); + + /* Other processes might have already finished the scan. */ + if (page == InvalidBlockNumber) + { + Assert(!BufferIsValid(scan->rs_cbuf)); + return NULL; + } + } + else + page = scan->rs_scan.rs_startblock; /* first page */ + valid = zheapgetpage(&scan->rs_scan, page); + if (!valid) + goto get_next_page; + lineoff = FirstOffsetNumber; /* first offnum */ + scan->rs_inited = true; + } + else + { + /* continue from previously returned page/tuple */ + page = scan->rs_cblock; /* current page */ + lineoff = /* next offnum */ + OffsetNumberNext(ItemPointerGetOffsetNumber(&(tuple->t_self))); + } + + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); + + dp = BufferGetPage(scan->rs_cbuf); + TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp); + lines = PageGetMaxOffsetNumber(dp); + /* page and lineoff now reference the physically next tid */ + + linesleft = lines - lineoff + 1; + } + else if (backward) + { + /* backward parallel scan not supported */ + Assert(scan->rs_scan.rs_parallel == NULL); + + if (!scan->rs_inited) + { + /* + * return null immediately if relation is empty + */ + if (scan->rs_scan.rs_nblocks == ZHEAP_METAPAGE + 1 || + scan->rs_scan.rs_numblocks == ZHEAP_METAPAGE + 1) + { + Assert(!BufferIsValid(scan->rs_cbuf)); + return NULL; + } + + /* + * Disable reporting to syncscan logic in a backwards scan; it's + * not very likely anyone else is doing the same thing at the same + * time, and much more likely that we'll just bollix things for + * forward scanners. + */ + scan->rs_scan.rs_syncscan = false; + /* start from last page of the scan */ + if (scan->rs_scan.rs_startblock > ZHEAP_METAPAGE + 1) + page = scan->rs_scan.rs_startblock - 1; + else + page = scan->rs_scan.rs_nblocks - 1; + valid = zheapgetpage(&scan->rs_scan, page); + if (!valid) + goto get_next_page; + } + else + { + /* continue from previously returned page/tuple */ + page = scan->rs_cblock; /* current page */ + } + + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); + + dp = BufferGetPage(scan->rs_cbuf); + TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp); + lines = PageGetMaxOffsetNumber(dp); + + if (!scan->rs_inited) + { + lineoff = lines; /* final offnum */ + scan->rs_inited = true; + } + else + { + lineoff = /* previous offnum */ + OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self))); + } + /* page and lineoff now reference the physically previous tid */ + + linesleft = lineoff; + } + else + { + /* + * In executor it seems NoMovementScanDirection is nothing but + * do-nothing flag so we should not be here. The else part is still + * here to keep the code as in heapgettup_pagemode. + */ + Assert(false); + + return NULL; + } + + /* + * advance the scan until we find a qualifying tuple or run out of stuff + * to scan + */ + lpp = PageGetItemId(dp, lineoff); + +get_next_tuple: + while (linesleft > 0) + { + if (ItemIdIsNormal(lpp)) + { + ZHeapTuple tuple = NULL; + ZHeapTuple loctup = NULL; + Size loctup_len; + bool valid = false; + ItemPointerData tid; + + ItemPointerSet(&tid, page, lineoff); + + loctup_len = ItemIdGetLength(lpp); + + loctup = palloc(ZHEAPTUPLESIZE + loctup_len); + loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE); + + loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd); + loctup->t_len = loctup_len; + loctup->t_self = tid; + + /* + * We always need to make a copy of zheap tuple as once we release + * the buffer an in-place update can change the tuple. + */ + memcpy(loctup->t_data, ((ZHeapTupleHeader) PageGetItem((Page) dp, lpp)), loctup->t_len); + + tuple = ZHeapTupleSatisfies(loctup, snapshot, scan->rs_cbuf, NULL); + valid = tuple ? true : false; + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &tid, + scan->rs_cbuf, snapshot); + + if (valid) + { + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + return tuple; + } + } + + /* + * otherwise move to the next item on the page + */ + --linesleft; + if (backward) + { + --lpp; /* move back in this page's ItemId array */ + --lineoff; + } + else + { + ++lpp; /* move forward in this page's ItemId array */ + ++lineoff; + } + } + + /* + * if we get here, it means we've exhausted the items on this page and + * it's time to move to the next. + */ + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + +get_next_page: + for (;;) + { + /* + * advance to next/prior page and detect end of scan + */ + if (backward) + { + finished = (page == scan->rs_scan.rs_startblock) || + (scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false); + if (page == ZHEAP_METAPAGE + 1) + page = scan->rs_scan.rs_nblocks; + page--; + } + else if (scan->rs_scan.rs_parallel != NULL) + { + page = table_parallelscan_nextpage(&scan->rs_scan); + /* Skip metapage */ + if (page == ZHEAP_METAPAGE) + page = table_parallelscan_nextpage(&scan->rs_scan); + finished = (page == InvalidBlockNumber); + } + else + { + page++; + if (page >= scan->rs_scan.rs_nblocks) + page = 0; + + if (page == ZHEAP_METAPAGE) + { + /* + * Since, we're skipping the metapage, we should update the scan + * location if sync scan is enabled. + */ + if (scan->rs_scan.rs_syncscan) + ss_report_location(scan->rs_scan.rs_rd, page); + page++; + } + + finished = (page == scan->rs_scan.rs_startblock) || + (scan->rs_scan.rs_numblocks != InvalidBlockNumber ? --scan->rs_scan.rs_numblocks == 0 : false); + + /* + * Report our new scan position for synchronization purposes. We + * don't do that when moving backwards, however. That would just + * mess up any other forward-moving scanners. + * + * Note: we do this before checking for end of scan so that the + * final state of the position hint is back at the start of the + * rel. That's not strictly necessary, but otherwise when you run + * the same query multiple times the starting position would shift + * a little bit backwards on every invocation, which is confusing. + * We don't guarantee any specific ordering in general, though. + */ + if (scan->rs_scan.rs_syncscan) + ss_report_location(scan->rs_scan.rs_rd, page); + } + + /* + * return NULL if we've exhausted all the pages + */ + if (finished) + { + if (BufferIsValid(scan->rs_cbuf)) + ReleaseBuffer(scan->rs_cbuf); + scan->rs_cbuf = InvalidBuffer; + scan->rs_cblock = InvalidBlockNumber; + scan->rs_inited = false; + return NULL; + } + + valid = zheapgetpage(&scan->rs_scan, page); + if (!valid) + continue; + + if (!scan->rs_inited) + scan->rs_inited = true; + + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); + + dp = BufferGetPage(scan->rs_cbuf); + TestForOldSnapshot(snapshot, scan->rs_scan.rs_rd, dp); + lines = PageGetMaxOffsetNumber((Page) dp); + linesleft = lines; + if (backward) + { + lineoff = lines; + lpp = PageGetItemId(dp, lines); + } + else + { + lineoff = FirstOffsetNumber; + lpp = PageGetItemId(dp, FirstOffsetNumber); + } + + goto get_next_tuple; + } +} +#ifdef ZHEAPDEBUGALL +#define ZHEAPDEBUG_1 \ + elog(DEBUG2, "zheap_getnext([%s,nkeys=%d],dir=%d) called", \ + RelationGetRelationName(scan->rs_rd), scan->rs_nkeys, (int) direction) +#define ZHEAPDEBUG_2 \ + elog(DEBUG2, "zheap_getnext returning EOS") +#define ZHEAPDEBUG_3 \ + elog(DEBUG2, "zheap_getnext returning tuple") +#else +#define ZHEAPDEBUG_1 +#define ZHEAPDEBUG_2 +#define ZHEAPDEBUG_3 +#endif /* !defined(ZHEAPDEBUGALL) */ + + +ZHeapTuple +zheap_getnext(TableScanDesc sscan, ScanDirection direction) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + ZHeapTuple zhtup = NULL; + + /* Skip metapage */ + if (scan->rs_scan.rs_startblock == ZHEAP_METAPAGE) + scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1; + + /* Note: no locking manipulations needed */ + + ZHEAPDEBUG_1; /* zheap_getnext( info ) */ + + /* + * The key will be passed only for catalog table scans and catalog tables + * are always a heap table!. So incase of zheap it should be set to NULL. + */ + Assert (scan->rs_scan.rs_key == NULL); + + if (scan->rs_scan.rs_pageatatime) + zhtup = zheapgettup_pagemode(scan, direction); + else + zhtup = zheapgettup(scan, direction); + + if (zhtup == NULL) + { + ZHEAPDEBUG_2; /* zheap_getnext returning EOS */ + return NULL; + } + + scan->rs_cztup = zhtup; + + /* + * if we get here it means we have a new current scan tuple, so point to + * the proper return buffer and return the tuple. + */ + ZHEAPDEBUG_3; /* zheap_getnext returning tuple */ + + pgstat_count_heap_getnext(scan->rs_scan.rs_rd); + + return zhtup; +} + +TupleTableSlot * +zheap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + ZHeapTuple zhtup = NULL; + + /* Skip metapage */ + if (scan->rs_scan.rs_startblock == ZHEAP_METAPAGE) + scan->rs_scan.rs_startblock = ZHEAP_METAPAGE + 1; + + ZHEAPDEBUG_1; /* zheap_getnext( info ) */ + + /* + * The key will be passed only for catalog table scans and catalog tables + * are always a heap table!. So incase of zheap it should be set to NULL. + */ + Assert (scan->rs_scan.rs_key == NULL); + + if (scan->rs_scan.rs_pageatatime) + zhtup = zheapgettup_pagemode(scan, direction); + else + zhtup = zheapgettup(scan, direction); + + if (zhtup == NULL) + { + ZHEAPDEBUG_2; /* zheap_getnext returning EOS */ + ExecClearTuple(slot); + return NULL; + } + + scan->rs_cztup = zhtup; + + /* + * if we get here it means we have a new current scan tuple, so point to + * the proper return buffer and return the tuple. + */ + ZHEAPDEBUG_3; /* zheap_getnext returning tuple */ + + pgstat_count_heap_getnext(scan->rs_scan.rs_rd); + + return ExecStoreZTuple(zhtup, slot, scan->rs_cbuf, false); +} + +bool +zheap_scan_bitmap_pagescan(TableScanDesc sscan, + TBMIterateResult *tbmres) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + BlockNumber page = tbmres->blockno; + Page dp; + Buffer buffer; + Snapshot snapshot; + int ntup; + + scan->rs_cindex = 0; + scan->rs_ntuples = 0; + + /* + * Ignore any claimed entries past what we think is the end of the + * relation. (This is probably not necessary given that we got at + * least AccessShareLock on the table before performing any of the + * indexscans, but let's be safe.) + */ + if (page >= scan->rs_scan.rs_nblocks) + return false; + + if (page == ZHEAP_METAPAGE) + return false; + + scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf, + scan->rs_scan.rs_rd, + page); + buffer = scan->rs_cbuf; + snapshot = scan->rs_scan.rs_snapshot; + + ntup = 0; + + /* + * We must hold share lock on the buffer content while examining tuple + * visibility. Afterwards, however, the tuples we have found to be + * visible are guaranteed good as long as we hold the buffer pin. + */ + LockBuffer(buffer, BUFFER_LOCK_SHARE); + dp = (Page) BufferGetPage(buffer); + + /* + * Skip TPD pages. As of now, the size of special space in TPD pages is + * different from other zheap pages like metapage and regular zheap page, + * however, if that changes, we might need to explicitly store pagetype + * flag somewhere. + * + * Fixme - As an exception, the size of special space for zheap page + * with one transaction slot will match with TPD page's special size. + */ + if (PageGetSpecialSize(dp) == MAXALIGN(sizeof(TPDPageOpaqueData))) + { + UnlockReleaseBuffer(buffer); + return false; + } + /* + * We need two separate strategies for lossy and non-lossy cases. + */ + if (tbmres->ntuples >= 0) + { + /* + * Bitmap is non-lossy, so we just look through the offsets listed in + * tbmres; + */ + int curslot; + + for (curslot = 0; curslot < tbmres->ntuples; curslot++) + { + OffsetNumber offnum = tbmres->offsets[curslot]; + ItemPointerData tid; + ZHeapTuple ztuple; + + ItemPointerSet(&tid, page, offnum); + ztuple = zheap_search_buffer(&tid, scan->rs_scan.rs_rd, buffer, snapshot, NULL); + if (ztuple != NULL) + scan->rs_visztuples[ntup++] = ztuple; + } + } + else + { + /* + * Bitmap is lossy, so we must examine each item pointer on the page. + */ + OffsetNumber maxoff = PageGetMaxOffsetNumber(dp); + OffsetNumber offnum; + + for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) + { + ItemId lpp; + ZHeapTuple loctup = NULL; + ZHeapTuple resulttup = NULL; + Size loctup_len; + bool valid = false; + ItemPointerData tid; + + lpp = PageGetItemId(dp, offnum); + if (!ItemIdIsNormal(lpp)) + continue; + + ItemPointerSet(&tid, page, offnum); + loctup_len = ItemIdGetLength(lpp); + + loctup = palloc(ZHEAPTUPLESIZE + loctup_len); + loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE); + + loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd); + loctup->t_len = loctup_len; + loctup->t_self = tid; + + /* + * We always need to make a copy of zheap tuple as once we release + * the buffer an in-place update can change the tuple. + */ + memcpy(loctup->t_data, ((ZHeapTupleHeader) PageGetItem((Page) dp, lpp)), loctup->t_len); + + resulttup = ZHeapTupleSatisfies(loctup, snapshot, buffer, NULL); + valid = resulttup ? true : false; + + if (valid) + { + PredicateLockTid(scan->rs_scan.rs_rd, &(resulttup->t_self), snapshot, + IsSerializableXact() ? + zheap_fetchinsertxid(resulttup, buffer) : + InvalidTransactionId); + } + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut(valid, scan->rs_scan.rs_rd, (void *) &tid, + buffer, snapshot); + + if (valid) + scan->rs_visztuples[ntup++] = resulttup; + } + } + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + Assert(ntup <= MaxZHeapTuplesPerPage); + scan->rs_ntuples = ntup; + return true; +} + +bool +zheap_scan_bitmap_pagescan_next(TableScanDesc sscan, struct TupleTableSlot *slot) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + + if (scan->rs_cindex < 0 || scan->rs_cindex >= scan->rs_ntuples) + return false; + + scan->rs_cztup = scan->rs_visztuples[scan->rs_cindex]; + + /* + * Set up the result slot to point to this tuple. We don't need + * to keep the pin on the buffer, since we only scan tuples in page + * mode. + */ + ExecStoreZTuple(scan->rs_cztup, + slot, + InvalidBuffer, + true); + + scan->rs_cindex++; + + return true; +} + +/* + * zheap_search_buffer - search tuple satisfying snapshot + * + * On entry, *tid is the TID of a tuple, and buffer is the buffer holding + * this tuple. We search for the first visible member satisfying the given + * snapshot. If one is found, we return the tuple, in addition to updating + * *tid. Return NULL otherwise. + * + * The caller must already have pin and (at least) share lock on the buffer; + * it is still pinned/locked at exit. Also, We do not report any pgstats + * count; caller may do so if wanted. + */ +ZHeapTuple +zheap_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, + Snapshot snapshot, bool *all_dead) +{ + Page dp = (Page) BufferGetPage(buffer); + ItemId lp; + OffsetNumber offnum; + ZHeapTuple loctup = NULL; + ZHeapTupleData loctup_tmp; + ZHeapTuple resulttup = NULL; + Size loctup_len; + + if (all_dead) + *all_dead = false; + + Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer)); + offnum = ItemPointerGetOffsetNumber(tid); + /* check for bogus TID */ + if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp)) + return NULL; + + lp = PageGetItemId(dp, offnum); + + /* check for unused or dead items */ + if (!(ItemIdIsNormal(lp) || ItemIdIsDeleted(lp))) + { + if (all_dead) + *all_dead = true; + return NULL; + } + + /* + * If the record is deleted, its place in the page might have been taken + * by another of its kind. Try to get it from the UNDO if it is still + * visible. + */ + if (ItemIdIsDeleted(lp)) + { + resulttup = ZHeapGetVisibleTuple(offnum, snapshot, buffer, all_dead); + } + else + { + loctup_len = ItemIdGetLength(lp); + + loctup = palloc(ZHEAPTUPLESIZE + loctup_len); + loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + ZHEAPTUPLESIZE); + + loctup->t_tableOid = RelationGetRelid(relation); + loctup->t_len = loctup_len; + loctup->t_self = *tid; + + /* + * We always need to make a copy of zheap tuple as once we release the + * buffer an in-place update can change the tuple. + */ + memcpy(loctup->t_data, ((ZHeapTupleHeader) PageGetItem((Page) dp, lp)), loctup->t_len); + + /* If it's visible per the snapshot, we must return it */ + resulttup = ZHeapTupleSatisfies(loctup, snapshot, buffer, NULL); + } + + if (resulttup) + PredicateLockTid(relation, &(resulttup->t_self), snapshot, + IsSerializableXact() ? + zheap_fetchinsertxid(resulttup, buffer) : + InvalidTransactionId); + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut((resulttup != NULL), relation, (void *) tid, + buffer, snapshot); + + if (resulttup) + { + /* set the tid */ + *tid = resulttup->t_self; + } + else if (!ItemIdIsDeleted(lp)) + { + /* + * Temporarily get the copy of tuple from page to check if tuple is + * surely dead. We can't rely on the copy of local tuple (loctup) + * that is prepared for the visibility test as that would have been + * freed. + */ + loctup_tmp.t_tableOid = RelationGetRelid(relation); + loctup_tmp.t_data = (ZHeapTupleHeader) PageGetItem((Page) dp, lp); + loctup_tmp.t_len = ItemIdGetLength(lp); + loctup_tmp.t_self = *tid; + + /* + * If we can't see it, maybe no one else can either. At caller + * request, check whether tuple is dead to all transactions. + */ + if (!resulttup && all_dead && + ZHeapTupleIsSurelyDead(&loctup_tmp, + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo), + buffer)) + *all_dead = true; + } + else + { + /* For deleted item pointers, we've already set the value for all_dead. */ + return NULL; + } + + return resulttup; +} + +/* + * zheap_search - search for a zheap tuple satisfying snapshot. + * + * This is the same API as zheap_search_buffer, except that the caller + * does not provide the buffer containing the page, rather we access it + * locally. + */ +bool +zheap_search(ItemPointer tid, Relation relation, Snapshot snapshot, + bool *all_dead) +{ + Buffer buffer; + ZHeapTuple zheapTuple = NULL; + + buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid)); + LockBuffer(buffer, BUFFER_LOCK_SHARE); + zheapTuple = zheap_search_buffer(tid, relation, buffer, snapshot, all_dead); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + ReleaseBuffer(buffer); + + return (zheapTuple != NULL); +} + +/* + * zheap_fetch - Fetch a tuple based on TID. + * + * This function is quite similar to heap_fetch with few differences like + * it will always allocate the memory for tuple and do a memcpy of the tuple + * instead of pointing it to disk tuple. It is the responsibility of the + * caller to free the tuple. + */ +bool +zheap_fetch(Relation relation, + Snapshot snapshot, + ItemPointer tid, + ZHeapTuple *tuple, + Buffer *userbuf, + bool keep_buf, + Relation stats_relation) +{ + ZHeapTuple resulttup; + ItemId lp; + Buffer buffer; + Page page; + Size tup_len; + OffsetNumber offnum; + bool valid; + ItemPointerData ctid; + + /* + * Fetch and pin the appropriate page of the relation. + */ + buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid)); + + /* + * Need share lock on buffer to examine tuple commit status. + */ + LockBuffer(buffer, BUFFER_LOCK_SHARE); + page = BufferGetPage(buffer); + + /* + * We'd better check for out-of-range offnum in case of VACUUM since the + * TID was obtained. Exit if this is metapage. + */ + offnum = ItemPointerGetOffsetNumber(tid); + if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(page) || + BufferGetBlockNumber(buffer) == ZHEAP_METAPAGE) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + if (keep_buf) + *userbuf = buffer; + else + { + ReleaseBuffer(buffer); + *userbuf = InvalidBuffer; + } + *tuple = NULL; + return false; + } + + /* + * get the item line pointer corresponding to the requested tid + */ + lp = PageGetItemId(page, offnum); + + /* + * Must check for dead and unused items. + */ + if (!ItemIdIsNormal(lp) && !ItemIdIsDeleted(lp)) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + if (keep_buf) + *userbuf = buffer; + else + { + ReleaseBuffer(buffer); + *userbuf = InvalidBuffer; + } + *tuple = NULL; + return false; + } + + *tuple = NULL; + if (ItemIdIsDeleted(lp)) + { + CommandId tup_cid; + TransactionId tup_xid; + + resulttup = ZHeapGetVisibleTuple(offnum, snapshot, buffer, NULL); + ctid = *tid; + ZHeapPageGetNewCtid(buffer, &ctid, &tup_xid, &tup_cid); + valid = resulttup ? true : false; + } + else + { + /* + * fill in *tuple fields + */ + tup_len = ItemIdGetLength(lp); + + *tuple = palloc(ZHEAPTUPLESIZE + tup_len); + (*tuple)->t_data = (ZHeapTupleHeader) ((char *) (*tuple) + ZHEAPTUPLESIZE); + + (*tuple)->t_tableOid = RelationGetRelid(relation); + (*tuple)->t_len = tup_len; + (*tuple)->t_self = *tid; + + /* + * We always need to make a copy of zheap tuple as once we release + * the lock on buffer an in-place update can change the tuple. + */ + memcpy((*tuple)->t_data, ((ZHeapTupleHeader) PageGetItem(page, lp)), tup_len); + ItemPointerSetInvalid(&ctid); + + /* + * check time qualification of tuple, then release lock + */ + resulttup = ZHeapTupleSatisfies(*tuple, snapshot, buffer, &ctid); + valid = resulttup ? true : false; + } + + if (valid) + PredicateLockTid(relation, &((resulttup)->t_self), snapshot, + IsSerializableXact() ? + zheap_fetchinsertxid(resulttup, buffer) : + InvalidTransactionId); + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut(valid, relation, (void *) tid, + buffer, snapshot); + + /* + * Pass back the ctid if the tuple is invisible because it was updated. + * Apart from SnapshotAny, ctid must be changed only when current + * tuple in not visible. + */ + if (ItemPointerIsValid(&ctid)) + { + if (snapshot == SnapshotAny || !valid) + { + *tid = ctid; + } + } + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + if (valid) + { + /* + * All checks passed, so return the tuple as valid. Caller is now + * responsible for releasing the buffer. + */ + *userbuf = buffer; + *tuple = resulttup; + + /* Count the successful fetch against appropriate rel, if any */ + if (stats_relation != NULL) + pgstat_count_heap_fetch(stats_relation); + + return true; + } + + /* Tuple failed time qual, but maybe caller wants to see it anyway. */ + if (keep_buf) + *userbuf = buffer; + else + { + ReleaseBuffer(buffer); + *userbuf = InvalidBuffer; + } + + return false; +} + +/* + * zheap_fetch_undo_guts + * + * Main function for fetching the previous version of the tuple from the undo + * storage. + */ +ZHeapTuple +zheap_fetch_undo_guts(ZHeapTuple ztuple, Buffer buffer, ItemPointer tid) +{ + UnpackedUndoRecord *urec; + UndoRecPtr urec_ptr; + ZHeapTuple undo_tup; + int out_slot_no PG_USED_FOR_ASSERTS_ONLY; + + out_slot_no = GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(tid), + ZHeapTupleHeaderGetXactSlot(ztuple->t_data), + NULL, + NULL, + &urec_ptr, + true, + false); + + /* + * See the Asserts below to know why the transaction slot can't be frozen. + */ + Assert(out_slot_no != ZHTUP_SLOT_FROZEN); + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(tid), + ItemPointerGetOffsetNumber(tid), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* + * This function is used for trigger to retrieve previous version of the + * tuple from undolog. Since, the transaction that is updating the tuple + * is still in progress, neither undo record can be discarded nor it's + * transaction slot can be reused. + */ + Assert(urec != NULL); + Assert(urec->uur_type == UNDO_INPLACE_UPDATE); + + undo_tup = CopyTupleFromUndoRecord(urec, NULL, NULL, NULL, false, NULL); + UndoRecordRelease(urec); + + return undo_tup; +} + +/* + * zheap_fetch_undo + * + * Fetch the previous version of the tuple from the undo. In case of IN_PLACE + * update old tuple and new tuple has the same TID. And, trigger just + * stores the tid for fetching the old and new tuple so for fetching the older + * tuple this function should be called. + */ +bool +zheap_fetch_undo(Relation relation, + Snapshot snapshot, + ItemPointer tid, + ZHeapTuple *tuple, + Buffer *userbuf, + Relation stats_relation) +{ + ZHeapTuple undo_tup; + Buffer buffer; + + if (!zheap_fetch(relation, snapshot, tid, tuple, &buffer, true, NULL)) + return false; + + undo_tup = zheap_fetch_undo_guts(*tuple, buffer, tid); + zheap_freetuple(*tuple); + *tuple = undo_tup; + + ReleaseBuffer(buffer); + + return true; +} + +/* + * ZHeapTupleHeaderAdvanceLatestRemovedXid - Advance the latestremovexid, if + * tuple is deleted by a transaction greater than latestremovexid. This is + * required to generate conflicts on Hot Standby. + * + * If we change this function then we need a similar change in + * *_xlog_vacuum_get_latestRemovedXid functions as well. + * + * This is quite similar to HeapTupleHeaderAdvanceLatestRemovedXid. + */ +void +ZHeapTupleHeaderAdvanceLatestRemovedXid(ZHeapTupleHeader tuple, + TransactionId xid, + TransactionId *latestRemovedXid) +{ + /* + * Ignore tuples inserted by an aborted transaction. + * + * XXX we can ignore the tuple if it was non-in-place updated/deleted + * by the inserting transaction, but for that we need to traverse the + * complete undo chain to find the root tuple, is it really worth? + */ + if (TransactionIdDidCommit(xid)) + { + Assert (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED); + if (TransactionIdFollows(xid, *latestRemovedXid)) + *latestRemovedXid = xid; + } + + /* *latestRemovedXid may still be invalid at end */ +} + +/* + * ---------- + * Page related API's. Eventually we might need to split these API's + * into a separate file like bufzpage.c or buf_zheap_page.c or some + * thing like that. + * ---------- + */ + +/* + * ZPageAddItemExtended - Add an item to a zheap page. + * + * This is similar to PageAddItemExtended except for max tuples that can + * be accomodated on a page and alignment for each item (Ideally, we don't + * need to align space between tuples as we always make the copy of tuple to + * support in-place updates. However, there are places in zheap code where we + * access tuple header directly from page (ex. zheap_delete, zheap_update, + * etc.) for which we them to be aligned at two-byte boundary). It + * additionally handles the itemids that are marked as unused, but still + * can't be reused. + * + * Callers passed a valid input_page only incase there are constructing the + * in-memory copy of tuples and then directly sync the page. + */ +OffsetNumber +ZPageAddItemExtended(Buffer buffer, + Page input_page, + Item item, + Size size, + OffsetNumber offsetNumber, + int flags, + bool NoTPDBufLock) +{ + Page page; + Size alignedSize; + PageHeader phdr; + int lower; + int upper; + ItemId itemId; + OffsetNumber limit; + bool needshuffle = false; + + /* Either one of buffer or page could be valid. */ + if (BufferIsValid(buffer)) + { + Assert(!PageIsValid(input_page)); + page = BufferGetPage(buffer); + } + else + { + Assert(PageIsValid(input_page)); + page = input_page; + } + + phdr = (PageHeader) page; + + /* + * Be wary about corrupted page pointers + */ + if (phdr->pd_lower < SizeOfPageHeaderData || + phdr->pd_lower > phdr->pd_upper || + phdr->pd_upper > phdr->pd_special || + phdr->pd_special > BLCKSZ) + ereport(PANIC, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u", + phdr->pd_lower, phdr->pd_upper, phdr->pd_special))); + + /* + * Select offsetNumber to place the new item at + */ + limit = OffsetNumberNext(PageGetMaxOffsetNumber(page)); + + /* was offsetNumber passed in? */ + if (OffsetNumberIsValid(offsetNumber)) + { + /* yes, check it */ + if ((flags & PAI_OVERWRITE) != 0) + { + if (offsetNumber < limit) + { + itemId = PageGetItemId(phdr, offsetNumber); + if (ItemIdIsUsed(itemId) || ItemIdHasStorage(itemId)) + { + elog(WARNING, "will not overwrite a used ItemId"); + return InvalidOffsetNumber; + } + } + } + else + { + if (offsetNumber < limit) + needshuffle = true; /* need to move existing linp's */ + } + } + else + { + /* offsetNumber was not passed in, so find a free slot */ + /* if no free slot, we'll put it at limit (1st open slot) */ + if (PageHasFreeLinePointers(phdr)) + { + bool hasPendingXact = false; + + /* + * Look for "recyclable" (unused) ItemId. We check for no storage + * as well, just to be paranoid --- unused items should never have + * storage. + */ + for (offsetNumber = 1; offsetNumber < limit; offsetNumber++) + { + itemId = PageGetItemId(phdr, offsetNumber); + if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId)) + { + /* + * We allow Unused entries to be reused only if there is no + * transaction information for the entry or the transaction + * is committed. + */ + if (ItemIdHasPendingXact(itemId)) + { + TransactionId xid; + UndoRecPtr urec_ptr; + int trans_slot_id = ItemIdGetTransactionSlot(itemId); + uint32 epoch; + + /* + * We can't reach here for a valid input page as the + * callers passed it for the pages that wouldn't have + * been pruned. + */ + Assert(!PageIsValid(input_page)); + + /* + * Here, we are relying on the transaction information in + * slot as if the corresponding slot has been reused, then + * transaction information from the entry would have been + * cleared. See PageFreezeTransSlots. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + break; + trans_slot_id = GetTransactionSlotInfo(buffer, offsetNumber, + trans_slot_id, &epoch, &xid, + &urec_ptr, NoTPDBufLock, false); + /* + * It is quite possible that the item is showing some + * valid transaction slot, but actual slot has been frozen. + * This can happen when the slot belongs to TPD entry and + * the corresponding TPD entry is pruned. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + break; + if (TransactionIdIsValid(xid) && + !TransactionIdDidCommit(xid)) + { + hasPendingXact = true; + continue; + } + } + break; + } + } + if (offsetNumber >= limit && !hasPendingXact) + { + /* the hint is wrong, so reset it */ + PageClearHasFreeLinePointers(phdr); + } + } + else + { + /* don't bother searching if hint says there's no free slot */ + offsetNumber = limit; + } + } + + /* Reject placing items beyond the first unused line pointer */ + if (offsetNumber > limit) + { + elog(WARNING, "specified item offset is too large"); + return InvalidOffsetNumber; + } + + /* Reject placing items beyond heap boundary, if heap */ + if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxZHeapTuplesPerPage) + { + elog(WARNING, "can't put more than MaxHeapTuplesPerPage items in a heap page"); + return InvalidOffsetNumber; + } + + /* + * Compute new lower and upper pointers for page, see if it'll fit. + * + * Note: do arithmetic as signed ints, to avoid mistakes if, say, + * size > pd_upper. + */ + if (offsetNumber == limit || needshuffle) + lower = phdr->pd_lower + sizeof(ItemIdData); + else + lower = phdr->pd_lower; + + alignedSize = SHORTALIGN(size); + + upper = (int) phdr->pd_upper - (int) alignedSize; + + if (lower > upper) + return InvalidOffsetNumber; + + /* + * OK to insert the item. First, shuffle the existing pointers if needed. + */ + itemId = PageGetItemId(phdr, offsetNumber); + + if (needshuffle) + memmove(itemId + 1, itemId, + (limit - offsetNumber) * sizeof(ItemIdData)); + + /* set the item pointer */ + ItemIdSetNormal(itemId, upper, size); + + /* + * Items normally contain no uninitialized bytes. Core bufpage consumers + * conform, but this is not a necessary coding rule; a new index AM could + * opt to depart from it. However, data type input functions and other + * C-language functions that synthesize datums should initialize all + * bytes; datumIsEqual() relies on this. Testing here, along with the + * similar check in printtup(), helps to catch such mistakes. + * + * Values of the "name" type retrieved via index-only scans may contain + * uninitialized bytes; see comment in btrescan(). Valgrind will report + * this as an error, but it is safe to ignore. + */ + VALGRIND_CHECK_MEM_IS_DEFINED(item, size); + + /* copy the item's data onto the page */ + memcpy((char *) page + upper, item, size); + + /* adjust page header */ + phdr->pd_lower = (LocationIndex) lower; + phdr->pd_upper = (LocationIndex) upper; + + return offsetNumber; +} + +/* + * PageGetZHeapFreeSpace + * Returns the size of the free (allocatable) space on a zheap page, + * reduced by the space needed for a new line pointer. + * + * This is same as PageGetHeapFreeSpace except for max tuples that can + * be accomodated on a page or the way unused items are dealt. + */ +Size +PageGetZHeapFreeSpace(Page page) +{ + Size space; + + space = PageGetFreeSpace(page); + if (space > 0) + { + OffsetNumber offnum, + nline; + + nline = PageGetMaxOffsetNumber(page); + if (nline >= MaxZHeapTuplesPerPage) + { + if (PageHasFreeLinePointers((PageHeader) page)) + { + /* + * Since this is just a hint, we must confirm that there is + * indeed a free line pointer + */ + for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum)) + { + ItemId lp = PageGetItemId(page, offnum); + + /* + * The unused items that have pending xact information + * can't be reused. + */ + if (!ItemIdIsUsed(lp) && !ItemIdHasPendingXact(lp)) + break; + } + + if (offnum > nline) + { + /* + * The hint is wrong, but we can't clear it here since we + * don't have the ability to mark the page dirty. + */ + space = 0; + } + } + else + { + /* + * Although the hint might be wrong, PageAddItem will believe + * it anyway, so we must believe it too. + */ + space = 0; + } + } + } + return space; +} + +/* + * RelationPutZHeapTuple - Same as RelationPutHeapTuple, but for ZHeapTuple. + */ +static void +RelationPutZHeapTuple(Relation relation, + Buffer buffer, + ZHeapTuple tuple) +{ + OffsetNumber offnum; + + /* Add the tuple to the page. Caller must ensure to have a TPD page lock. */ + offnum = ZPageAddItem(buffer, NULL, (Item) tuple->t_data, tuple->t_len, + InvalidOffsetNumber, false, true, false); + + if (offnum == InvalidOffsetNumber) + elog(PANIC, "failed to add tuple to page"); + + /* Update tuple->t_self to the actual position where it was stored */ + ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum); +} + +/* + * CopyTupleFromUndoRecord + * Extract the tuple from undo record. Deallocate the previous version + * of tuple and form the new version. + * + * trans_slot_id - If non-NULL, then populate it with the transaction slot of + * transaction that has modified the tuple. + * cid - output command id + * free_zhtup - if true, free the previous version of tuple. + */ +ZHeapTuple +CopyTupleFromUndoRecord(UnpackedUndoRecord *urec, ZHeapTuple zhtup, + int *trans_slot_id, CommandId *cid, bool free_zhtup, + Page page) +{ + ZHeapTuple undo_tup; + + switch (urec->uur_type) + { + case UNDO_INSERT: + { + Assert(zhtup != NULL); + + /* + * We need to deal with undo of root tuple only for a special + * case where during non-inplace update operation, we + * propagate the lockers information to the freshly inserted + * tuple. But, we've to make sure the inserted tuple is locked only. + */ + Assert(ZHEAP_XID_IS_LOCKED_ONLY(zhtup->t_data->t_infomask)); + + undo_tup = palloc(ZHEAPTUPLESIZE + zhtup->t_len); + undo_tup->t_data = (ZHeapTupleHeader) ((char *) undo_tup + ZHEAPTUPLESIZE); + + undo_tup->t_tableOid = zhtup->t_tableOid; + undo_tup->t_len = zhtup->t_len; + undo_tup->t_self = zhtup->t_self; + memcpy(undo_tup->t_data, zhtup->t_data, zhtup->t_len); + + /* + * Ensure to clear the visibility related information from + * the tuple. This is required for the cases where the passed + * in tuple has lock only flags set on it. + */ + undo_tup->t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + + /* + * Free the previous version of tuple, see comments in + * UNDO_INPLACE_UPDATE case. + */ + if (free_zhtup) + zheap_freetuple(zhtup); + + /* Retrieve the TPD transaction slot from payload */ + if (trans_slot_id) + { + if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + *trans_slot_id = *(int *) urec->uur_payload.data; + else + *trans_slot_id = ZHeapTupleHeaderGetXactSlot(undo_tup->t_data); + } + if (cid) + *cid = urec->uur_cid; + } + break; + case UNDO_XID_LOCK_ONLY: + case UNDO_XID_LOCK_FOR_UPDATE: + case UNDO_XID_MULTI_LOCK_ONLY: + { + ZHeapTupleHeader undo_tup_hdr; + + Assert(zhtup != NULL); + + undo_tup_hdr = (ZHeapTupleHeader) urec->uur_tuple.data; + + /* + * For locked tuples, undo tuple data is always same as prior + * tuple's data as we don't modify it. + */ + undo_tup = palloc(ZHEAPTUPLESIZE + zhtup->t_len); + undo_tup->t_data = (ZHeapTupleHeader) ((char *) undo_tup + ZHEAPTUPLESIZE); + + undo_tup->t_tableOid = zhtup->t_tableOid; + undo_tup->t_len = zhtup->t_len; + undo_tup->t_self = zhtup->t_self; + memcpy(undo_tup->t_data, zhtup->t_data, zhtup->t_len); + + /* + * Free the previous version of tuple, see comments in + * UNDO_INPLACE_UPDATE case. + */ + if (free_zhtup) + zheap_freetuple(zhtup); + + /* + * override the tuple header values with values fetched from + * undo record + */ + undo_tup->t_data->t_infomask2 = undo_tup_hdr->t_infomask2; + undo_tup->t_data->t_infomask = undo_tup_hdr->t_infomask; + undo_tup->t_data->t_hoff = undo_tup_hdr->t_hoff; + + /* Retrieve the TPD transaction slot from payload */ + if (trans_slot_id) + { + if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + { + /* + * We first store the Lockmode and then transaction slot in + * payload, so retrieve it accordingly. + */ + *trans_slot_id = *(int *) ((char *) urec->uur_payload.data + sizeof(LockTupleMode)); + } + else + *trans_slot_id = ZHeapTupleHeaderGetXactSlot(undo_tup->t_data); + } + } + break; + case UNDO_DELETE: + case UNDO_UPDATE: + case UNDO_INPLACE_UPDATE: + { + Size offset = 0; + uint32 undo_tup_len; + + /* + * After this point, the previous version of tuple won't be used. + * If we don't free the previous version, then we might accumulate + * lot of memory when many prior versions needs to be traversed. + * + * XXX One way to save deallocation and allocation of memory is to + * only make a copy of prior version of tuple when it is determined + * that the version is visible to current snapshot. In practise, + * we don't need to traverse many prior versions, so let's be tidy. + */ + undo_tup_len = *((uint32 *) &urec->uur_tuple.data[offset]); + + undo_tup = palloc(ZHEAPTUPLESIZE + undo_tup_len); + undo_tup->t_data = (ZHeapTupleHeader) ((char *) undo_tup + ZHEAPTUPLESIZE); + + memcpy(&undo_tup->t_len, &urec->uur_tuple.data[offset], sizeof(uint32)); + offset += sizeof(uint32); + + memcpy(&undo_tup->t_self, &urec->uur_tuple.data[offset], sizeof(ItemPointerData)); + offset += sizeof(ItemPointerData); + + memcpy(&undo_tup->t_tableOid, &urec->uur_tuple.data[offset], sizeof(Oid)); + offset += sizeof(Oid); + + memcpy(undo_tup->t_data, (ZHeapTupleHeader) &urec->uur_tuple.data[offset], undo_tup_len); + + /* Retrieve the TPD transaction slot from payload */ + if (trans_slot_id) + { + if (urec->uur_info & UREC_INFO_PAYLOAD_CONTAINS_SLOT) + { + /* + * For UNDO_UPDATE, we first store the CTID and then + * transaction slot, so retrieve it accordingly. + */ + if (urec->uur_type == UNDO_UPDATE) + *trans_slot_id = *(int *) ((char *) urec->uur_payload.data + sizeof(ItemPointerData)); + else + *trans_slot_id = *(int *) urec->uur_payload.data; + } + else + *trans_slot_id = ZHeapTupleHeaderGetXactSlot(undo_tup->t_data); + } + + if (free_zhtup) + zheap_freetuple(zhtup); + } + break; + default: + elog(ERROR, "unsupported undo record type"); + /* + * During tests, we take down the server to notice the error easily. + * This can be removed later. + */ + Assert(0); + } + + /* + * If the undo tuple is pointing to the last slot of the page and the page + * has TPD slots that means the last slot information must move to the + * first slot of the TPD page so change the slot number as per that. + */ + if (page && (*trans_slot_id == ZHEAP_PAGE_TRANS_SLOTS) && + ZHeapPageHasTPDSlot((PageHeader) page)) + *trans_slot_id = ZHEAP_PAGE_TRANS_SLOTS + 1; + + return undo_tup; +} + +/* + * ZHeapGetUsableOffsetRanges + * + * Given a page and a set of tuples, it calculates how many tuples can fit in + * the page and the contiguous ranges of free offsets that can be used/reused + * in the same page to store those tuples. + */ +ZHeapFreeOffsetRanges * +ZHeapGetUsableOffsetRanges(Buffer buffer, + ZHeapTuple *tuples, + int ntuples, + Size saveFreeSpace) +{ + Page page; + PageHeader phdr; + int nthispage; + Size used_space; + Size avail_space; + OffsetNumber limit, offsetNumber; + ZHeapFreeOffsetRanges *zfree_offset_ranges; + + page = BufferGetPage(buffer); + phdr = (PageHeader) page; + + zfree_offset_ranges = (ZHeapFreeOffsetRanges *) + palloc0(sizeof(ZHeapFreeOffsetRanges)); + + zfree_offset_ranges->nranges = 0; + limit = OffsetNumberNext(PageGetMaxOffsetNumber(page)); + avail_space = PageGetExactFreeSpace(page); + nthispage = 0; + used_space = 0; + + if (PageHasFreeLinePointers(phdr)) + { + bool in_range = false; + /* + * Look for "recyclable" (unused) ItemId. We check for no storage + * as well, just to be paranoid --- unused items should never have + * storage. + */ + for (offsetNumber = 1; offsetNumber < limit; offsetNumber++) + { + ItemId itemId = PageGetItemId(phdr, offsetNumber); + + if (nthispage >= ntuples) + { + /* No more tuples to insert */ + break; + } + if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId)) + { + ZHeapTuple zheaptup = tuples[nthispage]; + Size needed_space = used_space + zheaptup->t_len + saveFreeSpace; + + /* Check if we can fit this tuple in the page */ + if (avail_space < needed_space) + { + /* No more space to insert tuples in this page */ + break; + } + + used_space += zheaptup->t_len; + nthispage++; + + if (!in_range) + { + /* Start of a new range */ + zfree_offset_ranges->nranges++; + zfree_offset_ranges->startOffset[zfree_offset_ranges->nranges - 1] = offsetNumber; + in_range = true; + } + zfree_offset_ranges->endOffset[zfree_offset_ranges->nranges - 1] = offsetNumber; + } + else + { + in_range = false; + } + } + } + + /* + * Now, there are no free line pointers. Check whether we can insert another + * tuple in the page, then we'll insert another range starting from limit to + * max required offset number. We can decide the actual end offset for this + * range while inserting tuples in the buffer. + */ + if ((limit <= MaxZHeapTuplesPerPage) && (nthispage < ntuples)) + { + ZHeapTuple zheaptup = tuples[nthispage]; + Size needed_space = used_space + sizeof(ItemIdData) + + zheaptup->t_len + saveFreeSpace; + + /* Check if we can fit this tuple + a new offset in the page */ + if (avail_space >= needed_space) + { + OffsetNumber max_required_offset; + int required_tuples = ntuples - nthispage; + + /* + * Choose minimum among MaxOffsetNumber and the maximum offsets + * required for tuples. + */ + max_required_offset = Min(MaxOffsetNumber, (limit + required_tuples)); + + zfree_offset_ranges->nranges++; + zfree_offset_ranges->startOffset[zfree_offset_ranges->nranges - 1] = limit; + zfree_offset_ranges->endOffset[zfree_offset_ranges->nranges - 1] = max_required_offset; + } + } + + return zfree_offset_ranges; +} + +/* + * zheap_multi_insert - insert multiple tuple into a zheap + * + * Similar to heap_multi_insert(), but inserts zheap tuples. + */ +void +zheap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, + CommandId cid, int options, BulkInsertState bistate) +{ + ZHeapTuple *zheaptuples; + int i; + int ndone; + char *scratch = NULL; + Page page; + bool needwal; + bool need_tuple_data = RelationIsLogicallyLogged(relation); + bool need_cids = RelationIsAccessibleInLogicalDecoding(relation); + Size saveFreeSpace; + TransactionId xid = GetTopTransactionId(); + uint32 epoch = GetEpochForXid(xid); + xl_undolog_meta undometa; + bool lock_reacquired; + bool skip_undo; + + needwal = RelationNeedsWAL(relation); + saveFreeSpace = RelationGetTargetPageFreeSpace(relation, + HEAP_DEFAULT_FILLFACTOR); + /* + * We can skip inserting undo records if the tuples are to be marked + * as frozen. + */ + skip_undo = (options & HEAP_INSERT_FROZEN); + + /* Toast and set header data in all the tuples */ + zheaptuples = palloc(ntuples * sizeof(ZHeapTuple)); + for (i = 0; i < ntuples; i++) + { + zheaptuples[i] = zheap_prepare_insert(relation, ExecGetZHeapTupleFromSlot(slots[i]), options); + + if (slots[i]->tts_tableOid != InvalidOid) + zheaptuples[i]->t_tableOid = slots[i]->tts_tableOid; + } + + /* + * Allocate some memory to use for constructing the WAL record. Using + * palloc() within a critical section is not safe, so we allocate this + * beforehand. This has consideration that offset ranges and tuples to be + * stored in page will have size lesser than BLCKSZ. This is true since a + * zheap page contains page header and transaction slots in special area + * which are not stored in scratch area. In future, if we reduce the number + * of transaction slots to one, we may need to allocate twice the BLCKSZ of + * scratch area. + */ + if (needwal) + scratch = palloc(BLCKSZ); + + /* + * See heap_multi_insert to know why checking conflicts is important + * before actually inserting the tuple. + */ + CheckForSerializableConflictIn(relation, NULL, InvalidBuffer); + + ndone = 0; + while (ndone < ntuples) + { + Buffer buffer; + Buffer vmbuffer = InvalidBuffer; + bool all_visible_cleared = false; + int nthispage = 0; + int trans_slot_id; + int ucnt = 0; + UndoRecPtr urecptr = InvalidUndoRecPtr, + prev_urecptr = InvalidUndoRecPtr; + UnpackedUndoRecord *undorecord = NULL; + ZHeapFreeOffsetRanges *zfree_offset_ranges; + OffsetNumber usedoff[MaxOffsetNumber]; + OffsetNumber max_required_offset; + uint8 vm_status; + + CHECK_FOR_INTERRUPTS(); + +reacquire_buffer: + /* + * Find buffer where at least the next tuple will fit. If the page is + * all-visible, this will also pin the requisite visibility map page. + */ + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + + buffer = RelationGetBufferForZTuple(relation, zheaptuples[ndone]->t_len, + InvalidBuffer, options, bistate, + &vmbuffer, NULL); + page = BufferGetPage(buffer); + + /* + * Get the unused offset ranges in the page. This is required for + * deciding the number of undo records to be prepared later. + */ + zfree_offset_ranges = ZHeapGetUsableOffsetRanges(buffer, + &zheaptuples[ndone], + ntuples - ndone, + saveFreeSpace); + + /* + * We've ensured at least one tuple fits in the page. So, there'll be + * at least one offset range. + */ + Assert(zfree_offset_ranges->nranges > 0); + + max_required_offset = + zfree_offset_ranges->endOffset[zfree_offset_ranges->nranges - 1]; + + /* + * If we're not inserting an undo record, we don't have to reserve + * a transaction slot as well. + */ + if (!skip_undo) + { + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + */ + trans_slot_id = PageReserveTransactionSlot(relation, + buffer, + max_required_offset, + epoch, + xid, + &prev_urecptr, + &lock_reacquired); + if (lock_reacquired) + goto reacquire_buffer; + + if (trans_slot_id == InvalidXactSlotId) + { + UnlockReleaseBuffer(buffer); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + goto reacquire_buffer; + } + + /* transaction slot must be reserved before adding tuple to page */ + Assert(trans_slot_id != InvalidXactSlotId); + + /* + * For every contiguous free or new offsets, we insert an undo record. + * In the payload data of each undo record, we store the start and end + * available offset for a contiguous range. + */ + undorecord = (UnpackedUndoRecord *) palloc(zfree_offset_ranges->nranges + * sizeof(UnpackedUndoRecord)); + /* Start UNDO prepare Stuff */ + urecptr = prev_urecptr; + for (i = 0; i < zfree_offset_ranges->nranges; i++) + { + /* prepare an undo record */ + undorecord[i].uur_type = UNDO_MULTI_INSERT; + undorecord[i].uur_info = 0; + undorecord[i].uur_prevlen = 0; /* Fixme - need to figure out how to set this value and then decide whether to WAL log it */ + undorecord[i].uur_reloid = relation->rd_id; + undorecord[i].uur_prevxid = FrozenTransactionId; + undorecord[i].uur_xid = xid; + undorecord[i].uur_cid = cid; + undorecord[i].uur_fork = MAIN_FORKNUM; + undorecord[i].uur_blkprev = urecptr; + undorecord[i].uur_block = BufferGetBlockNumber(buffer); + undorecord[i].uur_tuple.len = 0; + undorecord[i].uur_offset = 0; + undorecord[i].uur_payload.len = 2 * sizeof(OffsetNumber); + } + + UndoSetPrepareSize(undorecord, zfree_offset_ranges->nranges, + InvalidTransactionId, + UndoPersistenceForRelation(relation), &undometa); + + for (i = 0; i < zfree_offset_ranges->nranges; i++) + { + undorecord[i].uur_blkprev = urecptr; + urecptr = PrepareUndoInsert(&undorecord[i], + InvalidTransactionId, + UndoPersistenceForRelation(relation), + NULL); + + initStringInfo(&undorecord[i].uur_payload); + } + + Assert(UndoRecPtrIsValid(urecptr)); + elog(DEBUG1, "Undo record prepared: %d for Block Number: %d", + zfree_offset_ranges->nranges, BufferGetBlockNumber(buffer)); + /* End UNDO prepare Stuff */ + } + + /* + * If there is a valid vmbuffer get its status. The vmbuffer will not + * be valid if operated page is newly extended, see + * RelationGetBufferForZTupleand. Also, anyway by default vm status + * bits are clear for those pages hence no need to clear it again! + */ + vm_status = visibilitymap_get_status(relation, + BufferGetBlockNumber(buffer), &vmbuffer); + + /* + * Lock the TPD page before starting critical section. We might need + * to access it in ZPageAddItemExtended. Note that if the transaction + * slot belongs to TPD entry, then the TPD page must be locked during + * slot reservation. + * + * XXX We can optimize this by avoid taking TPD page lock unless the page + * has some unused item which requires us to fetch the transaction + * information from TPD. + */ + if (trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS && + ZHeapPageHasTPDSlot((PageHeader) page) && + PageHasFreeLinePointers((PageHeader) page)) + TPDPageLock(relation, buffer); + + /* NO EREPORT(ERROR) from here till changes are logged */ + START_CRIT_SECTION(); + + /* + * RelationGetBufferForZTuple has ensured that the first tuple fits. + * Keep calm and put that on the page, and then as many other tuples + * as fit. + */ + nthispage = 0; + for (i = 0; i < zfree_offset_ranges->nranges; i++) + { + OffsetNumber offnum; + + for (offnum = zfree_offset_ranges->startOffset[i]; + offnum <= zfree_offset_ranges->endOffset[i]; + offnum++) + { + ZHeapTuple zheaptup; + + if (ndone + nthispage == ntuples) + break; + + zheaptup = zheaptuples[ndone + nthispage]; + + /* Make sure that the tuple fits in the page. */ + if (PageGetZHeapFreeSpace(page) < zheaptup->t_len + saveFreeSpace) + break; + + if (!(options & HEAP_INSERT_FROZEN)) + ZHeapTupleHeaderSetXactSlot(zheaptup->t_data, trans_slot_id); + + RelationPutZHeapTuple(relation, buffer, zheaptup); + + /* + * Let's make sure that we've decided the offset ranges + * correctly. + */ + Assert(offnum == ItemPointerGetOffsetNumber(&(zheaptup->t_self))); + + /* track used offsets */ + usedoff[ucnt++] = offnum; + + /* + * We don't use heap_multi_insert for catalog tuples yet, but + * better be prepared... + * Fixme: This won't work as it needs to access cmin/cmax which + * we probably needs to retrieve from TPD or UNDO. + */ + if (needwal && need_cids) + { + /* log_heap_new_cid(relation, heaptup); */ + } + nthispage++; + } + + /* + * Store the offset ranges in undo payload. We've not calculated the + * end offset for the last range previously. Hence, we set it to + * offnum - 1. There is no harm in doing the same for previous undo + * records as well. + */ + zfree_offset_ranges->endOffset[i] = offnum - 1; + if (!skip_undo) + { + appendBinaryStringInfo(&undorecord[i].uur_payload, + (char *) &zfree_offset_ranges->startOffset[i], + sizeof(OffsetNumber)); + appendBinaryStringInfo(&undorecord[i].uur_payload, + (char *) &zfree_offset_ranges->endOffset[i], + sizeof(OffsetNumber)); + } + elog(DEBUG1, "start offset: %d, end offset: %d", + zfree_offset_ranges->startOffset[i], zfree_offset_ranges->endOffset[i]); + } + + if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) || + (vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE)) + { + all_visible_cleared = true; + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), + vmbuffer, VISIBILITYMAP_VALID_BITS); + } + + /* + * XXX Should we set PageSetPrunable on this page ? See heap_insert() + */ + + MarkBufferDirty(buffer); + + if (!skip_undo) + { + /* Insert the undo */ + InsertPreparedUndo(); + + /* + * We're sending the undo record for debugging purpose. So, just send + * the last one. + */ + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + PageSetUNDO(undorecord[zfree_offset_ranges->nranges - 1], + buffer, + trans_slot_id, + true, + epoch, + xid, + urecptr, + usedoff, + ucnt); + } + else + { + PageSetUNDO(undorecord[zfree_offset_ranges->nranges - 1], + buffer, + trans_slot_id, + true, + epoch, + xid, + urecptr, + NULL, + 0); + } + } + + /* XLOG stuff */ + if (needwal) + { + xl_undo_header xlundohdr; + XLogRecPtr recptr; + xl_zheap_multi_insert *xlrec; + uint8 info = XLOG_ZHEAP_MULTI_INSERT; + char *tupledata; + int totaldatalen; + char *scratchptr = scratch; + bool init; + int bufflags = 0; + XLogRecPtr RedoRecPtr; + bool doPageWrites; + + /* + * Store the information required to generate undo record during + * replay. All undo records have same information apart from the + * payload data. Hence, we can copy the same from the last record. + */ + xlundohdr.reloid = relation->rd_id; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = prev_urecptr; + + /* allocate xl_zheap_multi_insert struct from the scratch area */ + xlrec = (xl_zheap_multi_insert *) scratchptr; + xlrec->flags = all_visible_cleared ? XLZ_INSERT_ALL_VISIBLE_CLEARED : 0; + if (skip_undo) + xlrec->flags |= XLZ_INSERT_IS_FROZEN; + xlrec->ntuples = nthispage; + scratchptr += SizeOfZHeapMultiInsert; + + /* copy the offset ranges as well */ + memcpy((char *) scratchptr, (char *) &zfree_offset_ranges->nranges, sizeof(int)); + scratchptr += sizeof(int); + for (i = 0; i < zfree_offset_ranges->nranges; i++) + { + memcpy((char *)scratchptr, (char *)&zfree_offset_ranges->startOffset[i], sizeof(OffsetNumber)); + scratchptr += sizeof(OffsetNumber); + memcpy((char *)scratchptr, (char *)&zfree_offset_ranges->endOffset[i], sizeof(OffsetNumber)); + scratchptr += sizeof(OffsetNumber); + } + + /* the rest of the scratch space is used for tuple data */ + tupledata = scratchptr; + + /* + * Write out an xl_multi_insert_tuple and the tuple data itself + * for each tuple. + */ + for (i = 0; i < nthispage; i++) + { + ZHeapTuple zheaptup = zheaptuples[ndone + i]; + xl_multi_insert_ztuple *tuphdr; + int datalen; + + /* xl_multi_insert_tuple needs two-byte alignment. */ + tuphdr = (xl_multi_insert_ztuple *) SHORTALIGN(scratchptr); + scratchptr = ((char *) tuphdr) + SizeOfMultiInsertZTuple; + + tuphdr->t_infomask2 = zheaptup->t_data->t_infomask2; + tuphdr->t_infomask = zheaptup->t_data->t_infomask; + tuphdr->t_hoff = zheaptup->t_data->t_hoff; + + /* write bitmap [+ padding] [+ oid] + data */ + datalen = zheaptup->t_len - SizeofZHeapTupleHeader; + memcpy(scratchptr, + (char *) zheaptup->t_data + SizeofZHeapTupleHeader, + datalen); + tuphdr->datalen = datalen; + scratchptr += datalen; + } + totaldatalen = scratchptr - tupledata; + Assert((scratchptr - scratch) < BLCKSZ); + + if (need_tuple_data) + xlrec->flags |= XLZ_INSERT_CONTAINS_NEW_TUPLE; + + /* + * Signal that this is the last xl_zheap_multi_insert record + * emitted by this call to zheap_multi_insert(). Needed for logical + * decoding so it knows when to cleanup temporary data. + */ + if (ndone + nthispage == ntuples) + xlrec->flags |= XLZ_INSERT_LAST_IN_MULTI; + + /* + * If the page was previously empty, we can reinit the page + * instead of restoring the whole thing. + */ + init = (ItemPointerGetOffsetNumber(&(zheaptuples[ndone]->t_self)) == FirstOffsetNumber && + PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1); + + if (init) + { + info |= XLOG_ZHEAP_INIT_PAGE; + bufflags |= REGBUF_WILL_INIT; + } + + /* + * If we're doing logical decoding, include the new tuple data + * even if we take a full-page image of the page. + */ + if (need_tuple_data) + bufflags |= REGBUF_KEEP_DATA; + +prepare_xlog: + /* LOG undolog meta if this is the first WAL after the checkpoint. */ + LogUndoMetaData(&undometa); + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + + XLogBeginInsert(); + /* copy undo related info in maindata */ + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + /* copy xl_multi_insert_tuple in maindata */ + XLogRegisterData((char *) xlrec, tupledata - scratch); + + /* If we've skipped undo insertion, we don't need a slot in page. */ + if (!skip_undo && trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + xlrec->flags |= XLZ_INSERT_CONTAINS_TPD_SLOT; + XLogRegisterData((char *) &trans_slot_id, sizeof(trans_slot_id)); + } + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags); + + /* copy tuples in block data */ + XLogRegisterBufData(0, tupledata, totaldatalen); + if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT) + (void) RegisterTPDBuffer(page, 1); + + /* filtering by origin on a row level is much more efficient */ + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); + + recptr = XLogInsertExtended(RM_ZHEAP_ID, info, RedoRecPtr, + doPageWrites); + if (recptr == InvalidXLogRecPtr) + { + ResetRegisteredTPDBuffers(); + goto prepare_xlog; + } + + PageSetLSN(page, recptr); + if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + /* be tidy */ + if (!skip_undo) + { + for (i = 0; i < zfree_offset_ranges->nranges; i++) + pfree(undorecord[i].uur_payload.data); + pfree(undorecord); + } + pfree(zfree_offset_ranges); + + UnlockReleaseBuffer(buffer); + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + + ndone += nthispage; + } + + /* + * We're done with the actual inserts. Check for conflicts again, to + * ensure that all rw-conflicts in to these inserts are detected. Without + * this final check, a sequential scan of the heap may have locked the + * table after the "before" check, missing one opportunity to detect the + * conflict, and then scanned the table before the new tuples were there, + * missing the other chance to detect the conflict. + * + * For heap inserts, we only need to check for table-level SSI locks. Our + * new tuples can't possibly conflict with existing tuple locks, and heap + * page locks are only consolidated versions of tuple locks; they do not + * lock "gaps" as index page locks do. So we don't need to specify a + * buffer when making the call. + */ + CheckForSerializableConflictIn(relation, NULL, InvalidBuffer); + + /* + * If tuples are cachable, mark them for invalidation from the caches in + * case we abort. Note it is OK to do this after releasing the buffer, + * because the heaptuples data structure is all in local memory, not in + * the shared buffer. + */ + if (IsCatalogRelation(relation)) + { + /* + for (i = 0; i < ntuples; i++) + CacheInvalidateHeapTuple(relation, zheaptuples[i], NULL); */ + } + + /* + * Copy t_self fields back to the caller's original tuples. This does + * nothing for untoasted tuples (tuples[i] == heaptuples[i)], but it's + * probably faster to always copy than check. + */ + for (i = 0; i < ntuples; i++) + slots[i]->tts_tid = zheaptuples[i]->t_self; + + pgstat_count_heap_insert(relation, ntuples); +} + +/* + * Mask a zheap page before performing consistency checks on it. + */ +void +zheap_mask(char *pagedata, BlockNumber blkno) +{ + Page page = (Page) pagedata; + + mask_page_lsn_and_checksum(page); + + mask_page_hint_bits(page); + mask_unused_space(page); + + if (PageGetSpecialSize(page) == MAXALIGN(BLCKSZ)) + { + ZHeapMetaPage metap PG_USED_FOR_ASSERTS_ONLY; + metap = ZHeapPageGetMeta(page); + /* It's a meta-page, no need to mask further. */ + Assert(metap->zhm_magic == ZHEAP_MAGIC); + Assert(metap->zhm_version == ZHEAP_VERSION); + return; + } + + if (PageGetSpecialSize(page) == MAXALIGN(sizeof(TPDPageOpaqueData))) + { + /* It's a TPD page, no need to mask further. */ + return; + } +} + +/* + * Per-undorecord callback from UndoFetchRecord to check whether + * an undorecord satisfies the given conditions. + */ +bool +ZHeapSatisfyUndoRecord(UnpackedUndoRecord* urec, BlockNumber blkno, + OffsetNumber offset, TransactionId xid) +{ + Assert(urec != NULL); + Assert(blkno != InvalidBlockNumber); + + if ((urec->uur_block != blkno || + (TransactionIdIsValid(xid) && !TransactionIdEquals(xid, urec->uur_xid)))) + return false; + + switch (urec->uur_type) + { + case UNDO_MULTI_INSERT: + { + OffsetNumber start_offset; + OffsetNumber end_offset; + + start_offset = ((OffsetNumber *) urec->uur_payload.data)[0]; + end_offset = ((OffsetNumber *) urec->uur_payload.data)[1]; + + if (offset >= start_offset && offset <= end_offset) + return true; + } + break; + case UNDO_ITEMID_UNUSED: + { + /* + * We don't expect to check the visibility of any unused item, + * but the undo record of same can be present in chain which + * we need to ignore. + */ + } + break; + default: + { + Assert(offset != InvalidOffsetNumber); + if (urec->uur_offset == offset) + return true; + } + break; + } + + return false; +} + +/* + * zheap_get_latest_tid - get the latest tid of a specified tuple + * + * Functionally, it serves the same purpose as heap_get_latest_tid(), but it + * follows a different way of traversing the ctid chain of updated tuples. + */ +void +zheap_get_latest_tid(Relation relation, + Snapshot snapshot, + ItemPointer tid) +{ + BlockNumber blk; + ItemPointerData ctid; + TransactionId priorXmax; + int tup_len; + + /* this is to avoid Assert failures on bad input */ + if (!ItemPointerIsValid(tid)) + return; + + /* + * Since this can be called with user-supplied TID, don't trust the input + * too much. (RelationGetNumberOfBlocks is an expensive check, so we + * don't check t_ctid links again this way. Note that it would not do to + * call it just once and save the result, either.) + */ + blk = ItemPointerGetBlockNumber(tid); + if (blk >= RelationGetNumberOfBlocks(relation)) + elog(ERROR, "block number %u is out of range for relation \"%s\"", + blk, RelationGetRelationName(relation)); + + /* + * Loop to chase down ctid links. At top of loop, ctid is the tuple we + * need to examine, and *tid is the TID we will return if ctid turns out + * to be bogus. + * + * Note that we will loop until we reach the end of the t_ctid chain. + * Depending on the snapshot passed, there might be at most one visible + * version of the row, but we don't try to optimize for that. + */ + ctid = *tid; + priorXmax = InvalidTransactionId; + for (;;) + { + Buffer buffer; + Page page; + OffsetNumber offnum; + ItemId lp; + ZHeapTuple tp = NULL; + ZHeapTuple resulttup = NULL; + ItemPointerData new_ctid; + uint16 infomask; + + /* + * Read, pin, and lock the page. + */ + buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&ctid)); + LockBuffer(buffer, BUFFER_LOCK_SHARE); + page = BufferGetPage(buffer); + + /* + * Check for bogus item number. This is not treated as an error + * condition because it can happen while following a ctid link. We + * just assume that the prior tid is OK and return it unchanged. + */ + offnum = ItemPointerGetOffsetNumber(&ctid); + if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(page)) + { + UnlockReleaseBuffer(buffer); + break; + } + lp = PageGetItemId(page, offnum); + if (!ItemIdIsNormal(lp)) + { + UnlockReleaseBuffer(buffer); + break; + } + + /* + * We always need to make a copy of zheap tuple; if an older version is + * returned from the undo record, the passed in tuple gets freed. + */ + tup_len = ItemIdGetLength(lp); + tp = palloc(ZHEAPTUPLESIZE + tup_len); + tp->t_data = (ZHeapTupleHeader) (((char *) tp) + ZHEAPTUPLESIZE); + tp->t_tableOid = RelationGetRelid(relation); + tp->t_len = tup_len; + tp->t_self = ctid; + + memcpy(tp->t_data, ((ZHeapTupleHeader) PageGetItem(page, lp)), + tup_len); + + /* Save the infomask. The tuple might get freed, as mentioned above */ + infomask = tp->t_data->t_infomask; + + /* + * Ensure that the tuple is same as what we are expecting. If the + * the current or any prior version of tuple doesn't contain the + * effect of priorXmax, then the slot must have been recycled and + * reused for an unrelated tuple. This implies that the latest + * version of the row was deleted, so we need do nothing. + */ + if (TransactionIdIsValid(priorXmax) && + !ValidateTuplesXact(tp, snapshot, buffer, priorXmax, false)) + { + UnlockReleaseBuffer(buffer); + break; + } + + /* + * Get the transaction which modified this tuple. Ideally we need to + * get this only when there is a ctid chain to follow. But since the + * visibility function frees the tuple, we have to do this here + * regardless of the existence of a ctid chain. + */ + ZHeapTupleGetTransInfo(tp, buffer, NULL, NULL, &priorXmax, NULL, NULL, + false); + + /* + * Check time qualification of tuple; if visible, set it as the new + * result candidate. + */ + ItemPointerSetInvalid(&new_ctid); + resulttup = ZHeapTupleSatisfies(tp, snapshot, buffer, &new_ctid); + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut((resulttup != NULL), relation, + (void *) &ctid, + buffer, snapshot); + + /* Pass back the tuple ctid if it's visible */ + if (resulttup != NULL) + *tid = ctid; + + /* If there's a valid ctid link, follow it, else we're done. */ + if (!ItemPointerIsValid(&new_ctid) || + ZHEAP_XID_IS_LOCKED_ONLY(infomask) || + ZHeapTupleIsMoved(infomask) || + ItemPointerEquals(&ctid, &new_ctid)) + { + if (resulttup != NULL) + zheap_freetuple(resulttup); + UnlockReleaseBuffer(buffer); + break; + } + + ctid = new_ctid; + + if (resulttup != NULL) + zheap_freetuple(resulttup); + UnlockReleaseBuffer(buffer); + } /* end of loop */ +} + +/* + * Perform XLogInsert for a zheap-visible operation. vm_buffer is the buffer + * containing the corresponding visibility map block. The vm_buffer should + * have already been modified and dirtied. + */ +XLogRecPtr +log_zheap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer, + TransactionId cutoff_xid, uint8 vmflags) +{ + xl_zheap_visible xlrec; + XLogRecPtr recptr; + + Assert(BufferIsValid(heap_buffer)); + Assert(BufferIsValid(vm_buffer)); + + xlrec.cutoff_xid = cutoff_xid; + xlrec.flags = vmflags; + xlrec.heapBlk = BufferGetBlockNumber(heap_buffer); + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfZHeapVisible); + + XLogRegisterBuffer(0, vm_buffer, 0); + + recptr = XLogInsert(RM_ZHEAP2_ID, XLOG_ZHEAP_VISIBLE); + + return recptr; +} + +/* + * GetTransactionsSlotsForPage - returns transaction slots for a zheap page + * + * This method returns all the transaction slots for the input zheap page + * including the corresponding TPD page. It also returns the corresponding + * TPD buffer if there is one. + */ +TransInfo * +GetTransactionsSlotsForPage(Relation rel, Buffer buf, int *total_trans_slots, + BlockNumber *tpd_blkno) +{ + Page page; + PageHeader phdr; + TransInfo *tpd_trans_slots; + TransInfo *trans_slots = NULL; + bool tpd_e_pruned; + + *total_trans_slots = 0; + if (tpd_blkno) + *tpd_blkno = InvalidBlockNumber; + + page = BufferGetPage(buf); + phdr = (PageHeader) page; + + if (ZHeapPageHasTPDSlot(phdr)) + { + int num_tpd_trans_slots; + + tpd_trans_slots = TPDPageGetTransactionSlots(rel, + buf, + InvalidOffsetNumber, + false, + false, + NULL, + &num_tpd_trans_slots, + NULL, + &tpd_e_pruned, + NULL); + if (!tpd_e_pruned) + { + ZHeapPageOpaque zopaque; + TransInfo last_trans_slot_info; + + zopaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + last_trans_slot_info = zopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1]; + + if (tpd_blkno) + *tpd_blkno = last_trans_slot_info.xid_epoch; + + /* + * The last slot in page contains TPD information, so we don't need to + * include it. + */ + *total_trans_slots = num_tpd_trans_slots + ZHEAP_PAGE_TRANS_SLOTS - 1; + trans_slots = (TransInfo *) + palloc(*total_trans_slots * sizeof(TransInfo)); + /* Copy the transaction slots from the page. */ + memcpy(trans_slots, page + phdr->pd_special, + (ZHEAP_PAGE_TRANS_SLOTS - 1) * sizeof(TransInfo)); + /* Copy the transaction slots from the tpd entry. */ + memcpy((char *) trans_slots + ((ZHEAP_PAGE_TRANS_SLOTS - 1) * sizeof(TransInfo)), + tpd_trans_slots, num_tpd_trans_slots * sizeof(TransInfo)); + + pfree(tpd_trans_slots); + } + } + + if (!ZHeapPageHasTPDSlot(phdr) || tpd_e_pruned) + { + Assert (trans_slots == NULL); + + *total_trans_slots = ZHEAP_PAGE_TRANS_SLOTS; + trans_slots = (TransInfo *) + palloc(*total_trans_slots * sizeof(TransInfo)); + memcpy(trans_slots, page + phdr->pd_special, + *total_trans_slots * sizeof(TransInfo)); + } + + Assert(*total_trans_slots >= ZHEAP_PAGE_TRANS_SLOTS); + + return trans_slots; +} + +/* + * CheckAndLockTPDPage - Check and lock the TPD page before starting critical + * section. + * + * We might need to access it in ZPageAddItemExtended. Note that if the + * transaction slot belongs to TPD entry, then the TPD page must be locked during + * slot reservation. Also, if the old buffer and new buffer refers to the + * same TPD page and the old transaction slot corresponds to a TPD slot, + * the TPD page must be locked during slot reservation. + * + * XXX We can optimize this by avoid taking TPD page lock unless the page + * has some unused item which requires us to fetch the transaction + * information from TPD. + */ +static inline void +CheckAndLockTPDPage(Relation relation, int new_trans_slot_id, int old_trans_slot_id, + Buffer newbuf, Buffer oldbuf) +{ + if (new_trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS && + ZHeapPageHasTPDSlot((PageHeader) BufferGetPage(newbuf)) && + PageHasFreeLinePointers((PageHeader)BufferGetPage(newbuf))) + { + /* + * If the old buffer and new buffer refers to the same TPD page + * and the old transaction slot corresponds to a TPD slot, + * we must have locked the TPD page during slot reservation. + */ + if (ZHeapPageHasTPDSlot((PageHeader) BufferGetPage(oldbuf)) && + (old_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)) + { + Page oldpage, newpage; + ZHeapPageOpaque oldopaque, newopaque; + BlockNumber oldtpdblk, newtpdblk; + + oldpage = BufferGetPage(oldbuf); + newpage = BufferGetPage(newbuf); + oldopaque = (ZHeapPageOpaque) PageGetSpecialPointer(oldpage); + newopaque = (ZHeapPageOpaque) PageGetSpecialPointer(newpage); + + oldtpdblk = oldopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch; + newtpdblk = newopaque->transinfo[ZHEAP_PAGE_TRANS_SLOTS - 1].xid_epoch; + + if (oldtpdblk != newtpdblk) + TPDPageLock(relation, newbuf); + } + else + TPDPageLock(relation, newbuf); + } +} diff --git a/src/backend/access/zheap/zheapam_handler.c b/src/backend/access/zheap/zheapam_handler.c new file mode 100644 index 0000000000..2ecbeccc0d --- /dev/null +++ b/src/backend/access/zheap/zheapam_handler.c @@ -0,0 +1,1867 @@ +/*------------------------------------------------------------------------- + * + * zheapam_handler.c + * zheap table access method code + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/zheap/zheapam_handler.c + * + * + * NOTES + * This file contains the zheap_ routines which implement + * the POSTGRES zheap table access method used for all POSTGRES + * relations. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + + +#include "miscadmin.h" + +#include "access/zheap.h" +#include "access/relscan.h" +#include "access/rewritezheap.h" +#include "access/tableam.h" +#include "access/tpd.h" +#include "access/tsmapi.h" +#include "access/visibilitymap.h" +#include "access/zheapscan.h" +#include "access/zheaputils.h" +#include "access/xact.h" +#include "catalog/pg_am_d.h" +#include "catalog/catalog.h" +#include "catalog/storage_xlog.h" +#include "commands/vacuum.h" +#include "pgstat.h" +#include "storage/lmgr.h" +#include "storage/bufpage.h" +#include "storage/bufmgr.h" +#include "storage/predicate.h" +#include "storage/procarray.h" +#include "storage/smgr.h" +#include "utils/builtins.h" +#include "utils/rel.h" +#include "utils/tqual.h" + + +/* ---------------------------------------------------------------- + * storage AM support routines for heapam + * ---------------------------------------------------------------- + */ + +static bool +zheapam_fetch_row_version(Relation relation, + ItemPointer tid, + Snapshot snapshot, + TupleTableSlot *slot, + Relation stats_relation) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + Buffer buffer; + + ExecClearTuple(slot); + + if (zheap_fetch(relation, snapshot, tid, &zslot->tuple, &buffer, false, stats_relation)) + { + ExecStoreZTuple(zslot->tuple, slot, buffer, true); + ReleaseBuffer(buffer); + + slot->tts_tableOid = RelationGetRelid(relation); + + return true; + } + + slot->tts_tableOid = RelationGetRelid(relation); + + return false; +} + +/* + * Insert a heap tuple from a slot, which may contain an OID and speculative + * insertion token. + */ +static void +zheapam_insert(Relation relation, TupleTableSlot *slot, CommandId cid, + int options, BulkInsertState bistate) +{ + ZHeapTuple tuple = ExecGetZHeapTupleFromSlot(slot); + + /* Update the tuple with table oid */ + slot->tts_tableOid = RelationGetRelid(relation); + if (slot->tts_tableOid != InvalidOid) + tuple->t_tableOid = slot->tts_tableOid; + + /* Perform the insertion, and copy the resulting ItemPointer */ + zheap_insert(relation, tuple, cid, options, bistate); + ItemPointerCopy(&tuple->t_self, &slot->tts_tid); +} + +static void +zheapam_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid, + int options, BulkInsertState bistate, uint32 specToken) +{ + ZHeapTuple tuple = ExecGetZHeapTupleFromSlot(slot); + + /* Update the tuple with table oid */ + slot->tts_tableOid = RelationGetRelid(relation); + if (slot->tts_tableOid != InvalidOid) + tuple->t_tableOid = slot->tts_tableOid; + +#ifdef ZBORKED + HeapTupleHeaderSetSpeculativeToken(tuple->t_data, specToken); +#endif + + /* Perform the insertion, and copy the resulting ItemPointer */ + zheap_insert(relation, tuple, cid, options, bistate); + ItemPointerCopy(&tuple->t_self, &slot->tts_tid); +} + +static void +zheapam_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken, + bool succeeded) +{ + ZHeapTuple tuple = ExecGetZHeapTupleFromSlot(slot); + + /* adjust the tuple's state accordingly */ + if (!succeeded) + zheap_finish_speculative(relation, tuple); + else + { + zheap_abort_speculative(relation, tuple); + } +} + + +static HTSU_Result +zheapam_delete(Relation relation, ItemPointer tid, CommandId cid, + Snapshot snapshot, Snapshot crosscheck, bool wait, + HeapUpdateFailureData *hufd, bool changingPart) +{ + /* + * Currently Deleting of index tuples are handled at vacuum, in case + * if the storage itself is cleaning the dead tuples by itself, it is + * the time to call the index tuple deletion also. + */ + return zheap_delete(relation, tid, cid, crosscheck, snapshot, wait, hufd, changingPart); +} + + +/* + * Locks tuple and fetches its newest version and TID. + * + * relation - table containing tuple + * tid - TID of tuple to lock + * snapshot - snapshot indentifying required version (used for assert check only) + * slot - tuple to be returned + * cid - current command ID (used for visibility test, and stored into + * tuple's cmax if lock is successful) + * mode - indicates if shared or exclusive tuple lock is desired + * wait_policy - what to do if tuple lock is not available + * flags – indicating how do we handle updated tuples + * *hufd - filled in failure cases + * + * Function result may be: + * HeapTupleMayBeUpdated: lock was successfully acquired + * HeapTupleInvisible: lock failed because tuple was never visible to us + * HeapTupleSelfUpdated: lock failed because tuple updated by self + * HeapTupleUpdated: lock failed because tuple updated by other xact + * HeapTupleDeleted: lock failed because tuple deleted by other xact + * HeapTupleWouldBlock: lock couldn't be acquired and wait_policy is skip + * + * In the failure cases other than HeapTupleInvisible, the routine fills + * *hufd with the tuple's t_ctid, t_xmax (resolving a possible MultiXact, + * if necessary), and t_cmax (the last only for HeapTupleSelfUpdated, + * since we cannot obtain cmax from a combocid generated by another + * transaction). + * See comments for struct HeapUpdateFailureData for additional info. + */ +static HTSU_Result +zheapam_lock_tuple(Relation relation, ItemPointer tid, Snapshot snapshot, + TupleTableSlot *slot, CommandId cid, LockTupleMode mode, + LockWaitPolicy wait_policy, uint8 flags, + HeapUpdateFailureData *hufd) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + HTSU_Result result; + Buffer buffer; + ZHeapTuple tuple = &zslot->tupdata; + bool doWeirdEval = (flags & TUPLE_LOCK_FLAG_WEIRD) != 0; + + hufd->traversed = false; + +retry: + result = zheap_lock_tuple(relation, tid, cid, mode, wait_policy, + (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) ? true : false, + doWeirdEval, + snapshot, tuple, &buffer, hufd); + + if (result == HeapTupleUpdated && + (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION)) + { + SnapshotData SnapshotDirty; + TransactionId priorXmax = hufd->xmax; + + ReleaseBuffer(buffer); + + /* Should not encounter speculative tuple on recheck */ + Assert(!(tuple->t_data->t_infomask & ZHEAP_SPECULATIVE_INSERT)); + + /* it was updated, so look at the updated version */ + *tid = hufd->ctid; + /* updated row should have xmin matching this xmax */ + priorXmax = hufd->xmax; + + if (ItemPointerEquals(&hufd->ctid, &tuple->t_self) && false) + { + /* tuple was deleted, so give up */ + return HeapTupleDeleted; + } + + /* + * fetch target tuple + * + * Loop here to deal with updated or busy tuples + */ + InitDirtySnapshot(SnapshotDirty); + for (;;) + { + /* check whether next version would be in a different partition */ + if (ItemPointerIndicatesMovedPartitions(&hufd->ctid)) + ereport(ERROR, + (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), + errmsg("tuple to be locked was already moved to another partition due to concurrent update"))); + + if (zheap_fetch(relation, &SnapshotDirty, tid, &tuple, &buffer, true, NULL)) + { + /* + * Ensure that the tuple is same as what we are expecting. If the + * the current or any prior version of tuple doesn't contain the + * effect of priorXmax, then the slot must have been recycled and + * reused for an unrelated tuple. This implies that the latest + * version of the row was deleted, so we need do nothing. + */ + if (!ValidateTuplesXact(tuple, &SnapshotDirty, buffer, priorXmax, false)) + { + ReleaseBuffer(buffer); + return HeapTupleDeleted; + } + + /* otherwise xmin should not be dirty... */ + if (TransactionIdIsValid(SnapshotDirty.xmin)) + elog(ERROR, "t_xmin is uncommitted in tuple to be updated"); + + /* + * If tuple is being updated by other (sub)transaction then we have to + * wait for its commit/abort, or die trying. + */ + if (SnapshotDirty.subxid != InvalidSubTransactionId && + TransactionIdIsValid(SnapshotDirty.xmax)) + { + ReleaseBuffer(buffer); + switch (wait_policy) + { + case LockWaitBlock: + SubXactLockTableWait(SnapshotDirty.xmax, + SnapshotDirty.subxid, + relation, &tuple->t_self, + XLTW_FetchUpdated); + break; + case LockWaitSkip: + if (!ConditionalSubXactLockTableWait(SnapshotDirty.xmax, + SnapshotDirty.subxid)) + return result; /* skip instead of waiting */ + break; + case LockWaitError: + if (ConditionalSubXactLockTableWait(SnapshotDirty.xmax, + SnapshotDirty.subxid)) + ereport(ERROR, + (errcode(ERRCODE_LOCK_NOT_AVAILABLE), + errmsg("could not obtain lock on row in relation \"%s\"", + RelationGetRelationName(relation)))); + + break; + } + continue; /* loop back to repeat zheap_fetch */ + } + else if (TransactionIdIsValid(SnapshotDirty.xmax)) + { + ReleaseBuffer(buffer); + switch (wait_policy) + { + case LockWaitBlock: + XactLockTableWait(SnapshotDirty.xmax, relation, + &tuple->t_self, XLTW_FetchUpdated); + break; + case LockWaitSkip: + if (!ConditionalXactLockTableWait(SnapshotDirty.xmax)) + return result; /* skip instead of waiting */ + break; + case LockWaitError: + if (!ConditionalXactLockTableWait(SnapshotDirty.xmax)) + ereport(ERROR, + (errcode(ERRCODE_LOCK_NOT_AVAILABLE), + errmsg("could not obtain lock on row in relation \"%s\"", + RelationGetRelationName(relation)))); + break; + } + continue; /* loop back to repeat zheap_fetch */ + } + + /* + * If tuple was inserted by our own transaction, we have to check + * cmin against es_output_cid: cmin >= current CID means our + * command cannot see the tuple, so we should ignore it. Otherwise + * zheap_lock_tuple() will throw an error, and so would any later + * attempt to update or delete the tuple. (We need not check cmax + * because ZHeapTupleSatisfiesDirty will consider a tuple deleted + * by our transaction dead, regardless of cmax.) We just checked + * that priorXmax == xmin, so we can test that variable instead of + * doing ZHeapTupleHeaderGetXid again. + */ + if (TransactionIdIsCurrentTransactionId(priorXmax)) + { + LockBuffer(buffer, BUFFER_LOCK_SHARE); + /* + * Fixme -If the tuple is updated such that its transaction slot + * has been changed, then we will never be able to get the correct + * tuple from undo. To avoid, that we need to get the latest tuple + * from page rather than relying on it's in-memory copy. See + * ValidateTuplesXact. + */ + if (ZHeapTupleGetCid(tuple, buffer, InvalidUndoRecPtr, + InvalidXactSlotId) >= cid) + { + UnlockReleaseBuffer(buffer); + // ZBORKED: is this correct? + return HeapTupleSelfUpdated; + } + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + } + + doWeirdEval = true; + hufd->traversed = true; + ReleaseBuffer(buffer); + goto retry; + } + + /* + * If we don't get any tuple, the latest version of the row must have + * been deleted, so we need do nothing. + */ + if (tuple == NULL) + { + ReleaseBuffer(buffer); + return HeapTupleDeleted; + } + + /* Ensure that the tuple is same as what we are expecting as above. */ + if (!ValidateTuplesXact(tuple, &SnapshotDirty, buffer, priorXmax, true)) + { + if (BufferIsValid(buffer)) + ReleaseBuffer(buffer); + return HeapTupleDeleted; + } + + /* check whether next version would be in a different partition */ + if (ZHeapTupleIsMoved(tuple->t_data->t_infomask)) + ereport(ERROR, + (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), + errmsg("tuple to be locked was already moved to another partition due to concurrent update"))); + + if (ItemPointerEquals(&(tuple->t_self), tid)) + { + /* deleted, so forget about it */ + ReleaseBuffer(buffer); + return HeapTupleDeleted; + } + + /* updated row should have xid matching this xmax */ + ZHeapTupleGetTransInfo(tuple, buffer, NULL, NULL, &priorXmax, NULL, + NULL, true); + + /* + * As we still hold a snapshot to which priorXmax is not visible, neither + * the transaction slot on tuple can be marked as frozen nor the + * corresponding undo be discarded. + */ + Assert(TransactionIdIsValid(priorXmax)); + + priorXmax = hufd->xmax; + + /* be tidy */ + zheap_freetuple(tuple); + ReleaseBuffer(buffer); + /* loop back to fetch next in chain */ + } + } + + slot->tts_tableOid = RelationGetRelid(relation); + ExecStoreZTuple(tuple, slot, buffer, false); + ReleaseBuffer(buffer); // FIXME: invent option to just transfer pin? + + return result; +} + + +static HTSU_Result +zheapam_update(Relation relation, ItemPointer otid, TupleTableSlot *slot, + CommandId cid, Snapshot snapshot, Snapshot crosscheck, + bool wait, HeapUpdateFailureData *hufd, LockTupleMode *lockmode, + bool *update_indexes) +{ + ZHeapTuple ztuple = ExecGetZHeapTupleFromSlot(slot); + HTSU_Result result; + + /* Update the tuple with table oid */ + if (slot->tts_tableOid != InvalidOid) + ztuple->t_tableOid = slot->tts_tableOid; + + result = zheap_update(relation, otid, ztuple, cid, crosscheck, snapshot, + wait, hufd, lockmode); + ItemPointerCopy(&ztuple->t_self, &slot->tts_tid); + + slot->tts_tableOid = RelationGetRelid(relation); + + /* + * Note: instead of having to update the old index tuples associated with + * the heap tuple, all we do is form and insert new index tuples. This is + * because UPDATEs are actually DELETEs and INSERTs, and index tuple + * deletion is done later by VACUUM (see notes in ExecDelete). All we do + * here is insert new index tuples. -cim 9/27/89 + */ + + /* + * insert index entries for tuple + * + * Note: heap_update returns the tid (location) of the new tuple in the + * t_self field. + * + * If it's a HOT update, we mustn't insert new index entries. + */ + *update_indexes = result == HeapTupleMayBeUpdated && + !ZHeapTupleIsInPlaceUpdated(ztuple->t_data->t_infomask); + + return result; +} + +static const TupleTableSlotOps * +zheapam_slot_callbacks(Relation relation) +{ + return &TTSOpsZHeapTuple; +} + +static bool +zheapam_satisfies(Relation rel, TupleTableSlot *slot, Snapshot snapshot) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + Buffer buffer; + Page page; + ItemId lp; + ItemPointer tid; + ZHeapTupleData zhtup; + ZHeapTuple tup; + Oid tableoid; + bool res; + + Assert(TTS_IS_ZHEAP(slot)); + Assert(zslot->tuple); + + tableoid = zslot->tuple->t_tableOid; + tid = &(zslot->tuple->t_self); + + buffer = ReadBuffer(rel, ItemPointerGetBlockNumber(tid)); + LockBuffer(buffer, BUFFER_LOCK_SHARE); + + page = BufferGetPage(buffer); + lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid)); + + /* + * Since the current transaction has inserted/updated the tuple, it + * can't be deleted. + */ + Assert(ItemIdIsNormal(lp)); + + zhtup.t_tableOid = tableoid; + zhtup.t_data = (ZHeapTupleHeader) PageGetItem((Page) page, lp); + zhtup.t_len = ItemIdGetLength(lp); + zhtup.t_self = *tid; + + tup = ZHeapTupleSatisfies(&zhtup, snapshot, buffer, tid); + + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + ReleaseBuffer(buffer); + + if (!tup) + { + /* satisfies routine returned no tuple, so clearly invisible */ + res = false; + } + else if (tup->t_len != zslot->tuple->t_len) + { + /* length differs, the input tuple can't be visible */ + res = false; + } + else if (memcmp(tup->t_data, zslot->tuple->t_data, zhtup.t_len) != 0) + { + /* + * ZBORKED: compare tuple contents, to be sure the tuple returned by + * the visibility routine is the input tuple. There *got* to be a + * better solution than this. + */ + res = false; + } + else + res = true; + + if (tup && tup != &zhtup) + pfree(tup); + + return res; +} + +static IndexFetchTableData* +zheapam_begin_index_fetch(Relation rel) +{ + IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData)); + + hscan->xs_base.rel = rel; + hscan->xs_cbuf = InvalidBuffer; + //hscan->xs_continue_hot = false; + + return &hscan->xs_base; +} + + +static void +zheapam_reset_index_fetch(IndexFetchTableData* scan) +{ + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan; + + if (BufferIsValid(hscan->xs_cbuf)) + { + ReleaseBuffer(hscan->xs_cbuf); + hscan->xs_cbuf = InvalidBuffer; + } + + //hscan->xs_continue_hot = false; +} + +static void +zheapam_end_index_fetch(IndexFetchTableData* scan) +{ + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan; + + zheapam_reset_index_fetch(scan); + + pfree(hscan); +} + +static bool +zheapam_fetch_follow(struct IndexFetchTableData *scan, + ItemPointer tid, + Snapshot snapshot, + TupleTableSlot *slot, + bool *call_again, bool *all_dead) +{ + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan; + ZHeapTuple zheapTuple = NULL; + + /* + * No HOT chains in zheap. + */ + Assert(!*call_again); + + /* Switch to correct buffer if we don't have it already */ + hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf, + hscan->xs_base.rel, + ItemPointerGetBlockNumber(tid)); + + LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE); + zheapTuple = zheap_search_buffer(tid, hscan->xs_base.rel, + hscan->xs_cbuf, + snapshot, + all_dead); + LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK); + + if (zheapTuple) + { + slot->tts_tableOid = RelationGetRelid(scan->rel); + ExecStoreZTuple(zheapTuple, slot, hscan->xs_cbuf, false); + } + + return zheapTuple != NULL; +} + +/* + * Similar to IndexBuildHeapRangeScan, but for zheap relations. + */ +static double +IndexBuildZHeapRangeScan(Relation heapRelation, + Relation indexRelation, + IndexInfo *indexInfo, + bool allow_sync, + bool anyvisible, + BlockNumber start_blockno, + BlockNumber numblocks, + IndexBuildCallback callback, + void *callback_state, + TableScanDesc sscan) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + bool is_system_catalog; + bool checking_uniqueness; + HeapTuple heapTuple; + ZHeapTuple zheapTuple; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + double reltuples; + ExprState *predicate; + TupleTableSlot *slot; + ZHeapTupleTableSlot *zslot; + EState *estate; + ExprContext *econtext; + Snapshot snapshot; + TransactionId OldestXmin; + bool need_unregister_snapshot = false; + SubTransactionId subxid_xwait = InvalidSubTransactionId; + + /* + * sanity checks + */ + Assert(OidIsValid(indexRelation->rd_rel->relam)); + Assert(RelationStorageIsZHeap(heapRelation)); + + /* Remember if it's a system catalog */ + is_system_catalog = IsSystemRelation(heapRelation); + + /* See whether we're verifying uniqueness/exclusion properties */ + checking_uniqueness = (indexInfo->ii_Unique || + indexInfo->ii_ExclusionOps != NULL); + + /* + * "Any visible" mode is not compatible with uniqueness checks; make sure + * only one of those is requested. + */ + Assert(!(anyvisible && checking_uniqueness)); + + /* + * Need an EState for evaluation of index expressions and partial-index + * predicates. Also a slot to hold the current tuple. + */ + estate = CreateExecutorState(); + econtext = GetPerTupleExprContext(estate); + slot = table_gimmegimmeslot(heapRelation, NULL); + zslot = (ZHeapTupleTableSlot *) slot; + + /* Arrange for econtext's scan tuple to be the tuple under test */ + econtext->ecxt_scantuple = slot; + + /* Set up execution state for predicate, if any. */ + predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate); + + + heapTuple = (HeapTuple) palloc0(SizeofHeapTupleHeader); + + /* + * Prepare for scan of the base relation. In a normal index build, we use + * SnapshotAny because we must retrieve all tuples and do our own time + * qual checks (because we have to index RECENTLY_DEAD tuples). In a + * concurrent build, or during bootstrap, we take a regular MVCC snapshot + * and index whatever's live according to that. + */ + OldestXmin = InvalidTransactionId; + + /* okay to ignore lazy VACUUMs here */ + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent) + OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM); + + if (!scan) + { + /* + * Serial index build. + * + * Must begin our own heap scan in this case. We may also need to + * register a snapshot whose lifetime is under our direct control. + */ + if (!TransactionIdIsValid(OldestXmin)) + { + snapshot = RegisterSnapshot(GetTransactionSnapshot()); + need_unregister_snapshot = true; + } + else + snapshot = SnapshotAny; + + sscan = table_beginscan_strat(heapRelation, /* relation */ + snapshot, /* snapshot */ + 0, /* number of keys */ + NULL, /* scan key */ + true, /* buffer access strategy OK */ + allow_sync); /* syncscan OK? */ + scan = (ZHeapScanDesc) sscan; + } + else + { + /* + * Parallel index build. + * + * Parallel case never registers/unregisters own snapshot. Snapshot + * is taken from parallel heap scan, and is SnapshotAny or an MVCC + * snapshot, based on same criteria as serial case. + */ + Assert(!IsBootstrapProcessingMode()); + Assert(allow_sync); + snapshot = scan->rs_scan.rs_snapshot; + } + + /* + * Must call GetOldestXmin() with SnapshotAny. Should never call + * GetOldestXmin() with MVCC snapshot. (It's especially worth checking + * this for parallel builds, since ambuild routines that support parallel + * builds must work these details out for themselves.) + */ + Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot)); + Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) : + !TransactionIdIsValid(OldestXmin)); + Assert(snapshot == SnapshotAny || !anyvisible); + + /* set our scan endpoints */ + if (!allow_sync) + zheap_setscanlimits(sscan, start_blockno, numblocks); + else + { + /* syncscan can only be requested on whole relation */ + Assert(start_blockno == 0); + start_blockno = ZHEAP_METAPAGE + 1; + Assert(numblocks == InvalidBlockNumber); + } + + reltuples = 0; + + /* + * Scan all tuples in the base relation. + */ + // ZBORKED: move to slot API + while ((zheapTuple = zheap_getnext(sscan, ForwardScanDirection)) != NULL) + { + bool tupleIsAlive; + ZHeapTuple targztuple = NULL; + + CHECK_FOR_INTERRUPTS(); + + if (snapshot == SnapshotAny) + { + /* do our own time qual check */ + bool indexIt; + TransactionId xwait; + + recheck: + + /* + * We could possibly get away with not locking the buffer here, + * since caller should hold ShareLock on the relation, but let's + * be conservative about it. + */ + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); + + targztuple = zheap_copytuple(zheapTuple); + switch (ZHeapTupleSatisfiesOldestXmin(&targztuple, OldestXmin, + scan->rs_cbuf, &xwait, &subxid_xwait)) + { + case HEAPTUPLE_DEAD: + /* Definitely dead, we can ignore it */ + indexIt = false; + tupleIsAlive = false; + break; + case HEAPTUPLE_LIVE: + /* Normal case, index and unique-check it */ + indexIt = true; + tupleIsAlive = true; + break; + case HEAPTUPLE_RECENTLY_DEAD: + /* + * If tuple is recently deleted then we must index it + * anyway to preserve MVCC semantics. (Pre-existing + * transactions could try to use the index after we finish + * building it, and may need to see such tuples.) + */ + indexIt = true; + tupleIsAlive = false; + break; + case HEAPTUPLE_INSERT_IN_PROGRESS: + + /* + * In "anyvisible" mode, this tuple is visible and we + * don't need any further checks. + */ + if (anyvisible) + { + indexIt = true; + tupleIsAlive = true; + break; + } + + /* + * Since caller should hold ShareLock or better, normally + * the only way to see this is if it was inserted earlier + * in our own transaction. However, it can happen in + * system catalogs, since we tend to release write lock + * before commit there. Give a warning if neither case + * applies. + */ + if (!TransactionIdIsCurrentTransactionId(xwait)) + { + if (!is_system_catalog) + elog(WARNING, "concurrent insert in progress within table \"%s\"", + RelationGetRelationName(heapRelation)); + + /* + * If we are performing uniqueness checks, indexing + * such a tuple could lead to a bogus uniqueness + * failure. In that case we wait for the inserting + * transaction to finish and check again. + */ + if (checking_uniqueness) + { + /* + * Must drop the lock on the buffer before we wait + */ + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + if (subxid_xwait != InvalidSubTransactionId) + SubXactLockTableWait(xwait, subxid_xwait, heapRelation, + &zheapTuple->t_self, + XLTW_InsertIndexUnique); + else + XactLockTableWait(xwait, heapRelation, + &zheapTuple->t_self, + XLTW_InsertIndexUnique); + CHECK_FOR_INTERRUPTS(); + + if (targztuple != NULL) + pfree(targztuple); + + goto recheck; + } + } + + /* + * We must index such tuples, since if the index build + * commits then they're good. + */ + indexIt = true; + tupleIsAlive = true; + break; + case HEAPTUPLE_DELETE_IN_PROGRESS: + + /* + * As with INSERT_IN_PROGRESS case, this is unexpected + * unless it's our own deletion or a system catalog; but + * in anyvisible mode, this tuple is visible. + */ + if (anyvisible) + { + indexIt = true; + tupleIsAlive = false; + break; + } + + if (!TransactionIdIsCurrentTransactionId(xwait)) + { + if (!is_system_catalog) + elog(WARNING, "concurrent insert in progress within table \"%s\"", + RelationGetRelationName(heapRelation)); + + /* + * If we are performing uniqueness checks, indexing + * such a tuple could lead to a bogus uniqueness + * failure. In that case we wait for the inserting + * transaction to finish and check again. + */ + if (checking_uniqueness) + { + /* + * Must drop the lock on the buffer before we wait + */ + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + if (subxid_xwait != InvalidTransactionId) + SubXactLockTableWait(xwait, subxid_xwait, + heapRelation, + &zheapTuple->t_self, + XLTW_InsertIndexUnique); + else + XactLockTableWait(xwait, heapRelation, + &zheapTuple->t_self, + XLTW_InsertIndexUnique); + CHECK_FOR_INTERRUPTS(); + + if (targztuple != NULL) + pfree(targztuple); + + goto recheck; + } + + /* + * Otherwise index it but don't check for uniqueness, + * the same as a RECENTLY_DEAD tuple. + */ + indexIt = true; + } + else + { + /* + * It's a regular tuple deleted by our own xact. Index + * it but don't check for uniqueness, the same as a + * RECENTLY_DEAD tuple. + */ + indexIt = true; + } + /* In any case, exclude the tuple from unique-checking */ + tupleIsAlive = false; + break; + default: + elog(ERROR, "unexpected ZHeapTupleSatisfiesOldestXmin result"); + indexIt = tupleIsAlive = false; /* keep compiler quiet */ + break; + } + + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + + if (!indexIt) + continue; + } + else + { + /* zheap_getnext did the time qual check */ + tupleIsAlive = true; + targztuple = zheapTuple; + } + + reltuples += 1; + + MemoryContextReset(econtext->ecxt_per_tuple_memory); + + /* Set up for predicate or expression evaluation */ + /* ZBORKED: shouldfree = true if !scan->rs_pagescan.rs_pageatatime */ + ExecStoreZTuple(zheapTuple, slot, InvalidBuffer, false); + + /* + * In a partial index, discard tuples that don't satisfy the + * predicate. + */ + if (predicate != NULL) + { + if (!ExecQual(predicate, econtext)) + { + /* + * For SnapshotAny, targztuple is locally palloced above. So, + * free it. + */ + if (snapshot == SnapshotAny && targztuple != NULL) + pfree(targztuple); + continue; + } + } + + /* + * For the current tuple, extract all the attributes we use in this + * index, and note which are null. This also performs evaluation + * of any expressions needed. + * + * NOTE: We can't free the zheap tuple fetched by the scan method before + * next iteration since this tuple is also referenced by scan->rs_cztup. + * which is used by zheap scan API's to fetch the next tuple. But, for + * forming and creating the index, we've to store the correct version of + * the tuple in the slot. Hence, after forming the index and calling the + * callback function, we restore the zheap tuple fetched by the scan + * method in the slot. + */ + zslot->tuple = targztuple; + FormIndexDatum(indexInfo, + slot, + estate, + values, + isnull); + + /* + * FIXME: buildCallback functions accepts heaptuple as an argument. But, + * it needs only the tid. So, we set t_self for the zheap tuple and call + * the AM's callback. + */ + heapTuple->t_self = zheapTuple->t_self; + + /* Call the AM's callback routine to process the tuple */ + callback(indexRelation, heapTuple, values, isnull, tupleIsAlive, + callback_state); + + zslot->tuple = zheapTuple; + + /* + * For SnapshotAny, targztuple is locally palloced above. So, + * free it. + */ + if (snapshot == SnapshotAny && targztuple != NULL) + pfree(targztuple); + } + + table_endscan(sscan); + + /* we can now forget our snapshot, if set and registered by us */ + if (need_unregister_snapshot) + UnregisterSnapshot(snapshot); + + ExecDropSingleTupleTableSlot(slot); + + pfree(heapTuple); + + /* These may have been pointing to the now-gone estate */ + indexInfo->ii_ExpressionsState = NIL; + indexInfo->ii_PredicateState = NULL; + + return reltuples; +} + +/* + * validate_index_zheapscan - second table scan for concurrent index build + * + * This has much code in common with IndexBuildZHeapScan, but it's enough + * different that it seems cleaner to have two routines not one. + */ +static void +validate_index_zheapscan(Relation heapRelation, + Relation indexRelation, + IndexInfo *indexInfo, + Snapshot snapshot, + ValidateIndexState *state) +{ + TableScanDesc sscan; + HeapScanDesc scan; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + ExprState *predicate; + TupleTableSlot *slot; + EState *estate; + ExprContext *econtext; + bool in_index[MaxHeapTuplesPerPage]; + + /* state variables for the merge */ + ItemPointer indexcursor = NULL; + ItemPointerData decoded; + bool tuplesort_empty = false; + + /* + * sanity checks + */ + Assert(OidIsValid(indexRelation->rd_rel->relam)); + + /* + * Need an EState for evaluation of index expressions and partial-index + * predicates. Also a slot to hold the current tuple. + */ + estate = CreateExecutorState(); + econtext = GetPerTupleExprContext(estate); + slot = table_gimmegimmeslot(heapRelation, NULL); + + /* Arrange for econtext's scan tuple to be the tuple under test */ + econtext->ecxt_scantuple = slot; + + /* Set up execution state for predicate, if any. */ + predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate); + + /* + * Prepare for scan of the base relation. We need just those tuples + * satisfying the passed-in reference snapshot. We must disable syncscan + * here, because it's critical that we read from block zero forward to + * match the sorted TIDs. + */ + sscan = table_beginscan_strat(heapRelation, /* relation */ + snapshot, /* snapshot */ + 0, /* number of keys */ + NULL, /* scan key */ + true, /* buffer access strategy OK */ + false); /* syncscan not OK */ + scan = (HeapScanDesc) sscan; + + /* + * Scan all tuples matching the snapshot. + */ + while (zheap_getnextslot(sscan, ForwardScanDirection, slot)) + { + OffsetNumber offnum = ItemPointerGetOffsetNumber(&slot->tts_tid); + + CHECK_FOR_INTERRUPTS(); + + state->htups += 1; + + /* + * "merge" by skipping through the index tuples until we find or pass + * the current tuple. + */ + while (!tuplesort_empty && + (!indexcursor || + ItemPointerCompare(indexcursor, &slot->tts_tid) < 0)) + { + Datum ts_val; + bool ts_isnull; + + if (indexcursor) + { + /* + * Remember index items seen earlier on the current heap page + */ + if (ItemPointerGetBlockNumber(indexcursor) == scan->rs_cblock) + in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true; + } + + tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true, + &ts_val, &ts_isnull, NULL); + Assert(tuplesort_empty || !ts_isnull); + if (!tuplesort_empty) + { + itemptr_decode(&decoded, DatumGetInt64(ts_val)); + indexcursor = &decoded; + + /* If int8 is pass-by-ref, free (encoded) TID Datum memory */ +#ifndef USE_FLOAT8_BYVAL + pfree(DatumGetPointer(ts_val)); +#endif + } + else + { + /* Be tidy */ + indexcursor = NULL; + } + } + + /* + * If the tuplesort has overshot *and* we didn't see a match earlier, + * then this tuple is missing from the index, so insert it. + */ + if ((tuplesort_empty || + ItemPointerCompare(indexcursor, &slot->tts_tid) > 0) && + !in_index[offnum - 1]) + { + + /* Set up for predicate or expression evaluation */ + + /* + * In a partial index, discard tuples that don't satisfy the + * predicate. + */ + if (predicate != NULL) + { + if (!ExecQual(predicate, econtext)) + continue; + } + + /* + * For the current heap tuple, extract all the attributes we use + * in this index, and note which are null. This also performs + * evaluation of any expressions needed. + */ + FormIndexDatum(indexInfo, + slot, + estate, + values, + isnull); + + /* + * You'd think we should go ahead and build the index tuple here, + * but some index AMs want to do further processing on the data + * first. So pass the values[] and isnull[] arrays, instead. + */ + + /* + * If the tuple is already committed dead, you might think we + * could suppress uniqueness checking, but this is no longer true + * in the presence of HOT, because the insert is actually a proxy + * for a uniqueness check on the whole HOT-chain. That is, the + * tuple we have here could be dead because it was already + * HOT-updated, and if so the updating transaction will not have + * thought it should insert index entries. The index AM will + * check the whole HOT-chain and correctly detect a conflict if + * there is one. + */ + + index_insert(indexRelation, + values, + isnull, + &slot->tts_tid, + heapRelation, + indexInfo->ii_Unique ? + UNIQUE_CHECK_YES : UNIQUE_CHECK_NO, + indexInfo); + + state->tups_inserted += 1; + + MemoryContextReset(econtext->ecxt_per_tuple_memory); + } + } + + table_endscan(sscan); + + ExecDropSingleTupleTableSlot(slot); + + FreeExecutorState(estate); + + /* These may have been pointing to the now-gone estate */ + indexInfo->ii_ExpressionsState = NIL; + indexInfo->ii_PredicateState = NULL; +} + +static void +zheapam_scan_analyze_next_block(TableScanDesc sscan, BlockNumber blockno, BufferAccessStrategy bstrategy) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + + /* + * We must maintain a pin on the target page's buffer to ensure that + * the maxoffset value stays good (else concurrent VACUUM might delete + * tuples out from under us). Hence, pin the page until we are done + * looking at it. We also choose to hold sharelock on the buffer + * throughout --- we could release and re-acquire sharelock for each + * tuple, but since we aren't doing much work per tuple, the extra + * lock traffic is probably better avoided. + */ + scan->rs_cblock = blockno; + scan->rs_cindex = FirstOffsetNumber; + if (blockno != ZHEAP_METAPAGE) + { + scan->rs_cbuf = ReadBufferExtended(scan->rs_scan.rs_rd, MAIN_FORKNUM, blockno, + RBM_NORMAL, bstrategy); + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); + } +} + +static bool +zheapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, double *liverows, double *deadrows, TupleTableSlot *slot) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + Page targpage; + OffsetNumber maxoffset; + ZHeapTupleTableSlot *zslot; + + Assert(TTS_IS_ZHEAP(slot)); + zslot = (ZHeapTupleTableSlot *) slot; + + if (scan->rs_cblock == ZHEAP_METAPAGE) + return false; + + targpage = BufferGetPage(scan->rs_cbuf); + maxoffset = PageGetMaxOffsetNumber(targpage); + + /* Skip TPD pages for zheap relations. */ + if (PageGetSpecialSize(targpage) == sizeof(TPDPageOpaqueData)) + { + UnlockReleaseBuffer(scan->rs_cbuf); + scan->rs_cbuf = InvalidBuffer; + + return false; + } + + /* Inner loop over all tuples on the selected page */ + for (; scan->rs_cindex <= maxoffset; scan->rs_cindex++) + { + ItemId itemid; + ZHeapTuple targtuple = &zslot->tupdata; + Size targztuple_len; + bool sample_it = false; + TransactionId xid; + + itemid = PageGetItemId(targpage, scan->rs_cindex); + + /* + * For zheap, we need to count delete committed rows towards + * dead rows which would have been same, if the tuple was + * present in heap. + */ + if (ItemIdIsDeleted(itemid)) + { + *deadrows += 1; + continue; + } + + /* + * We ignore unused and redirect line pointers. DEAD line + * pointers should be counted as dead, because we need vacuum to + * run to get rid of them. Note that this rule agrees with the + * way that heap_page_prune() counts things. + */ + if (!ItemIdIsNormal(itemid)) + { + if (ItemIdIsDead(itemid)) + *deadrows += 1; + continue; + } + + targztuple_len = ItemIdGetLength(itemid); + targtuple->t_len = targztuple_len; + targtuple->t_data = palloc(targztuple_len); + targtuple->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd); + + ItemPointerSet(&targtuple->t_self, scan->rs_cblock, scan->rs_cindex); + memcpy(targtuple->t_data, + ((ZHeapTupleHeader) PageGetItem((Page) targpage, itemid)), + targztuple_len); + + switch (ZHeapTupleSatisfiesOldestXmin(&targtuple, OldestXmin, scan->rs_cbuf, &xid, NULL)) + { + case HEAPTUPLE_LIVE: + sample_it = true; + *liverows += 1; + break; + + case HEAPTUPLE_DEAD: + case HEAPTUPLE_RECENTLY_DEAD: + /* Count dead and recently-dead rows */ + *deadrows += 1; + break; + + case HEAPTUPLE_INSERT_IN_PROGRESS: + + /* + * Insert-in-progress rows are not counted. We assume + * that when the inserting transaction commits or aborts, + * it will send a stats message to increment the proper + * count. This works right only if that transaction ends + * after we finish analyzing the table; if things happen + * in the other order, its stats update will be + * overwritten by ours. However, the error will be large + * only if the other transaction runs long enough to + * insert many tuples, so assuming it will finish after us + * is the safer option. + * + * A special case is that the inserting transaction might + * be our own. In this case we should count and sample + * the row, to accommodate users who load a table and + * analyze it in one transaction. (pgstat_report_analyze + * has to adjust the numbers we send to the stats + * collector to make this come out right.) + */ + if (TransactionIdIsCurrentTransactionId(xid)) + { + sample_it = true; + *liverows += 1; + } + break; + + case HEAPTUPLE_DELETE_IN_PROGRESS: + + /* + * We count delete-in-progress rows as still live, using + * the same reasoning given above; but we don't bother to + * include them in the sample. + * + * If the delete was done by our own transaction, however, + * we must count the row as dead to make + * pgstat_report_analyze's stats adjustments come out + * right. (Note: this works out properly when the row was + * both inserted and deleted in our xact.) + */ + if (TransactionIdIsCurrentTransactionId(xid)) + *deadrows += 1; + else + *liverows += 1; + break; + + default: + elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result"); + break; + } + + if (sample_it) + { + ExecStoreZTuple(targtuple, slot, InvalidBuffer, false); + scan->rs_cindex++; + + /* note that we leave the buffer locked here! */ + return true; + } + } + + /* Now release the lock and pin on the page */ + UnlockReleaseBuffer(scan->rs_cbuf); + scan->rs_cbuf = InvalidBuffer; + + return false; +} + +static bool +zheap_scan_sample_next_block(TableScanDesc sscan, struct SampleScanState *scanstate) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + TsmRoutine *tsm = scanstate->tsmroutine; + BlockNumber blockno; + + /* return false immediately if relation is empty */ + if (scan->rs_scan.rs_nblocks == 0) + return false; + +nextblock: + if (tsm->NextSampleBlock) + { + blockno = tsm->NextSampleBlock(scanstate, scan->rs_scan.rs_nblocks); + scan->rs_cblock = blockno; + } + else + { + /* scanning table sequentially */ + + if (scan->rs_cblock == InvalidBlockNumber) + { + Assert(!scan->rs_inited); + blockno = scan->rs_scan.rs_startblock; + } + else + { + Assert(scan->rs_inited); + + blockno = scan->rs_cblock + 1; + + if (blockno >= scan->rs_scan.rs_nblocks) + { + /* wrap to begining of rel, might not have started at 0 */ + blockno = 0; + } + + /* + * Report our new scan position for synchronization purposes. + * + * Note: we do this before checking for end of scan so that the + * final state of the position hint is back at the start of the + * rel. That's not strictly necessary, but otherwise when you run + * the same query multiple times the starting position would shift + * a little bit backwards on every invocation, which is confusing. + * We don't guarantee any specific ordering in general, though. + */ + if (scan->rs_scan.rs_syncscan) + ss_report_location(scan->rs_scan.rs_rd, blockno); + + if (blockno == scan->rs_scan.rs_startblock) + { + blockno = InvalidBlockNumber; + } + } + } + + if (!BlockNumberIsValid(blockno)) + { + if (BufferIsValid(scan->rs_cbuf)) + ReleaseBuffer(scan->rs_cbuf); + scan->rs_cbuf = InvalidBuffer; + scan->rs_cblock = InvalidBlockNumber; + scan->rs_inited = false; + + return false; + } + + scan->rs_inited = true; + + /* + * If the target block isn't valid, e.g. because it's a tpd page, got to + * the next block. + */ + if (!zheapgetpage(sscan, blockno)) + { + CHECK_FOR_INTERRUPTS(); + goto nextblock; + } + + return true; +} + +static bool +zheap_scan_sample_next_tuple(TableScanDesc sscan, struct SampleScanState *scanstate, TupleTableSlot *slot) +{ + ZHeapScanDesc scan = (ZHeapScanDesc) sscan; + TsmRoutine *tsm = scanstate->tsmroutine; + BlockNumber blockno = scan->rs_cblock; + bool pagemode = scan->rs_scan.rs_pageatatime; + Page page; + bool all_visible; + OffsetNumber maxoffset; + uint8 vmstatus; + Buffer vmbuffer = InvalidBuffer; + + ExecClearTuple(slot); + + if (scan->rs_cindex == -1) + return false; + + /* + * When not using pagemode, we must lock the buffer during tuple + * visibility checks. + */ + if (!pagemode) + { + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); + page = (Page) BufferGetPage(scan->rs_cbuf); + maxoffset = PageGetMaxOffsetNumber(page); + + if (!scan->rs_scan.rs_snapshot->takenDuringRecovery) + { + vmstatus = visibilitymap_get_status(scan->rs_scan.rs_rd, + BufferGetBlockNumber(scan->rs_cbuf), + &vmbuffer); + + all_visible = vmstatus; + + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + } + else + all_visible = false; + } + else + { + maxoffset = scan->rs_ntuples; + } + + for (;;) + { + OffsetNumber tupoffset; + + CHECK_FOR_INTERRUPTS(); + + /* Ask the tablesample method which tuples to check on this page. */ + tupoffset = tsm->NextSampleTuple(scanstate, + blockno, + maxoffset); + + if (OffsetNumberIsValid(tupoffset)) + { + ZHeapTuple tuple; + + if (!pagemode) + { + ItemId itemid; + bool visible; + ZHeapTuple loctup = NULL; + Size loctup_len; + ItemPointerData tid; + + /* Skip invalid tuple pointers. */ + itemid = PageGetItemId(page, tupoffset); + if (!ItemIdIsNormal(itemid)) + continue; + + ItemPointerSet(&tid, blockno, tupoffset); + loctup_len = ItemIdGetLength(itemid); + + loctup = palloc(ZHEAPTUPLESIZE + loctup_len); + loctup->t_data = (ZHeapTupleHeader) ((char *) loctup + + ZHEAPTUPLESIZE); + + loctup->t_tableOid = RelationGetRelid(scan->rs_scan.rs_rd); + loctup->t_len = loctup_len; + loctup->t_self = tid; + + /* + * We always need to make a copy of zheap tuple as once we release + * the buffer an in-place update can change the tuple. + */ + memcpy(loctup->t_data, + ((ZHeapTupleHeader) PageGetItem((Page) page, itemid)), + loctup->t_len); + + if (all_visible) + { + tuple = loctup; + visible = true; + } + else + { + tuple = ZHeapTupleSatisfies(loctup, + scan->rs_scan.rs_snapshot, + scan->rs_cbuf, + NULL); + visible = (tuple != NULL); + } + + /* + * If any prior version is visible, we pass latest visible as + * true. The state of latest version of tuple is determined by + * the called function. + * + * Note that, it's possible that tuple is updated in-place and + * we're seeing some prior version of that. We handle that case + * in ZHeapTupleHasSerializableConflictOut. + */ + CheckForSerializableConflictOut(visible, scan->rs_scan.rs_rd, (void *) &tid, + scan->rs_cbuf, scan->rs_scan.rs_snapshot); + + /* Try next tuple from same page. */ + if (!visible) + continue; + + ExecStoreZTuple(tuple, slot, InvalidBuffer, false); + + /* Found visible tuple, return it. */ + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + + /* Count successfully-fetched tuples as heap fetches */ + pgstat_count_heap_getnext(scan->rs_scan.rs_rd); + + return true; + } + else + { + tuple = scan->rs_visztuples[tupoffset - 1]; + if (tuple == NULL) + continue; + + ExecStoreZTuple(tuple, slot, InvalidBuffer, false); + + return true; + } + } + else + { + /* + * If we get here, it means we've exhausted the items on this page and + * it's time to move to the next. + */ + if (!pagemode) + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + + break; + } + } + + return false; +} + +static void +zheap_copy_for_cluster(Relation OldHeap, Relation NewHeap, Relation OldIndex, + bool use_sort, + TransactionId OldestXmin, TransactionId FreezeXid, MultiXactId MultiXactCutoff, + double *num_tuples, double *tups_vacuumed, double *tups_recently_dead) +{ + RewriteZheapState rwstate; + IndexScanDesc indexScan; + TableScanDesc heapScan; + bool use_wal; + Tuplesortstate *tuplesort; + TupleDesc oldTupDesc = RelationGetDescr(OldHeap); + TupleDesc newTupDesc = RelationGetDescr(NewHeap); + TupleTableSlot *slot; + int natts; + Datum *values; + bool *isnull; + + /* + * We need to log the copied data in WAL iff WAL archiving/streaming is + * enabled AND it's a WAL-logged rel. + */ + use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); + + /* use_wal off requires smgr_targblock be initially invalid */ + Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); + + /* Preallocate values/isnull arrays */ + natts = newTupDesc->natts; + values = (Datum *) palloc(natts * sizeof(Datum)); + isnull = (bool *) palloc(natts * sizeof(bool)); + + /* Initialize the rewrite operation */ + rwstate = begin_zheap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid, + MultiXactCutoff, use_wal); + + + /* Set up sorting if wanted */ + if (use_sort) + tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, + maintenance_work_mem, + NULL, false); + else + tuplesort = NULL; + + /* + * Prepare to scan the OldHeap. + * + * We don't have a way to copy visibility information in zheap, so we + * just copy LIVE tuples. See comments atop rewritezheap.c + * + * While scanning, we skip meta and tpd pages (done by *getnext API's) + * which is okay because we mark the tuples as frozen. However, when we + * extend current implementation to copy visibility information of tuples, + * we would require to copy meta page and or TPD page information as well + */ + if (OldIndex != NULL && !use_sort) + { + heapScan = NULL; + indexScan = index_beginscan(OldHeap, OldIndex, GetTransactionSnapshot(), 0, 0); + index_rescan(indexScan, NULL, 0, NULL, 0); + } + else + { + heapScan = table_beginscan(OldHeap, GetTransactionSnapshot(), 0, (ScanKey) NULL); + indexScan = NULL; + } + + slot = table_gimmegimmeslot(OldHeap, NULL); + + /* + * Scan through the OldHeap, either in OldIndex order or sequentially; + * copy each tuple into the NewHeap, or transiently to the tuplesort + * module. Note that we don't bother sorting dead tuples (they won't get + * to the new table anyway). While scanning, we skip meta and tpd pages + * (done by *getnext API's) which is okay because we mark the tuples as + * frozen. However, when we extend current implementation to copy + * visibility information of tuples, we would require to copy meta page + * and or TPD page information as well. + */ + for (;;) + { + CHECK_FOR_INTERRUPTS(); + + if (indexScan != NULL) + { + if (!index_getnext_slot(indexScan, ForwardScanDirection, slot)) + break; + + /* Since we used no scan keys, should never need to recheck */ + if (indexScan->xs_recheck) + elog(ERROR, "CLUSTER does not support lossy index conditions"); + } + else + { + if (!table_scan_getnextslot(heapScan, ForwardScanDirection, slot)) + break; + } + + num_tuples += 1; + if (tuplesort != NULL) + tuplesort_puttupleslot(tuplesort, slot); + else + reform_and_rewrite_ztuple(ExecGetZHeapTupleFromSlot(slot), oldTupDesc, newTupDesc, + values, isnull, rwstate); + } + + if (indexScan != NULL) + index_endscan(indexScan); + if (heapScan != NULL) + table_endscan(heapScan); + + ExecDropSingleTupleTableSlot(slot); + + /* + * In scan-and-sort mode, complete the sort, then read out all live tuples + * from the tuplestore and write them to the new relation. + */ + if (tuplesort != NULL) + { + tuplesort_performsort(tuplesort); + + for (;;) + { + HeapTuple heapTuple; + ZHeapTuple zheapTuple; + + CHECK_FOR_INTERRUPTS(); + + heapTuple = tuplesort_getheaptuple(tuplesort, true); + if (heapTuple == NULL) + break; + + zheapTuple = heap_to_zheap(heapTuple, oldTupDesc); + + reform_and_rewrite_ztuple(zheapTuple, oldTupDesc, newTupDesc, + values, isnull, rwstate); + /* be tidy */ + pfree(zheapTuple); + } + + tuplesort_end(tuplesort); + } + + /* Write out any remaining tuples, and fsync if needed */ + end_zheap_rewrite(rwstate); + + /* Clean up */ + pfree(values); + pfree(isnull); +} + +/* + * Set up an init fork for an unlogged table so that it can be correctly + * reinitialized on restart. An immediate sync is required even if the + * page has been logged, because the write did not go through + * shared_buffers and therefore a concurrent checkpoint may have moved + * the redo pointer past our xlog record. Recovery may as well remove it + * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE + * record. Therefore, logging is necessary even if wal_level=minimal. + */ +static void +zheap_create_init_fork(Relation rel) +{ + Assert(rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_MATVIEW || + rel->rd_rel->relkind == RELKIND_TOASTVALUE); + RelationOpenSmgr(rel); + smgrcreate(rel->rd_smgr, INIT_FORKNUM, false); + log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM); + smgrimmedsync(rel->rd_smgr, INIT_FORKNUM); + + /* ZBORKED: This causes separate WAL, which doesn't seem optimal */ + ZheapInitMetaPage(rel, INIT_FORKNUM); +} + +static const TableAmRoutine zheapam_methods = { + .type = T_TableAmRoutine, + + .slot_callbacks = zheapam_slot_callbacks, + + .snapshot_satisfies = zheapam_satisfies, + + .scan_begin = zheap_beginscan, + .scansetlimits = zheap_setscanlimits, + .scan_getnextslot = zheap_getnextslot, + .scan_end = heap_endscan, + .scan_rescan = zheap_rescan, + .scan_update_snapshot = zheap_update_snapshot, + + .scan_bitmap_pagescan = zheap_scan_bitmap_pagescan, + .scan_bitmap_pagescan_next = zheap_scan_bitmap_pagescan_next, + + .scan_sample_next_block = zheap_scan_sample_next_block, + .scan_sample_next_tuple = zheap_scan_sample_next_tuple, + + .tuple_fetch_row_version = zheapam_fetch_row_version, + .tuple_fetch_follow = zheapam_fetch_follow, + .tuple_insert = zheapam_insert, + .tuple_insert_speculative = zheapam_insert_speculative, + .tuple_complete_speculative = zheapam_complete_speculative, + .tuple_delete = zheapam_delete, + .tuple_update = zheapam_update, + .tuple_lock = zheapam_lock_tuple, + .multi_insert = zheap_multi_insert, + + .tuple_get_latest_tid = zheap_get_latest_tid, + + .relation_vacuum = lazy_vacuum_zheap_rel, + .scan_analyze_next_block = zheapam_scan_analyze_next_block, + .scan_analyze_next_tuple = zheapam_scan_analyze_next_tuple, + .relation_copy_for_cluster = zheap_copy_for_cluster, + .relation_create_init_fork = zheap_create_init_fork, + .relation_sync = heap_sync, + + .begin_index_fetch = zheapam_begin_index_fetch, + .reset_index_fetch = zheapam_reset_index_fetch, + .end_index_fetch = zheapam_end_index_fetch, + + .index_build_range_scan = IndexBuildZHeapRangeScan, + .index_validate_scan = validate_index_zheapscan +}; + +Datum +zheap_tableam_handler(PG_FUNCTION_ARGS) +{ + PG_RETURN_POINTER(&zheapam_methods); +} diff --git a/src/backend/access/zheap/zheapamutils.c b/src/backend/access/zheap/zheapamutils.c new file mode 100644 index 0000000000..f4bf32d56b --- /dev/null +++ b/src/backend/access/zheap/zheapamutils.c @@ -0,0 +1,121 @@ +/*------------------------------------------------------------------------- + * + * zheapamutils.c + * zheap utility method code + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/zheap/zheapamutils.c + * + * + * INTERFACE ROUTINES + * zheap_to_heap - convert zheap tuple to heap tuple + * + * NOTES + * This file contains utility functions for the zheap + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/htup_details.h" +#include "access/xact.h" +#include "access/zheap.h" +#include "access/zheaputils.h" +#include "storage/bufmgr.h" + +/* + * zheap_to_heap + * + * convert zheap tuple to heap tuple + */ +HeapTuple +zheap_to_heap(ZHeapTuple ztuple, TupleDesc tupDesc) +{ + HeapTuple tuple; + Datum *values = palloc0(sizeof(Datum) * tupDesc->natts); + bool *nulls = palloc0(sizeof(bool) * tupDesc->natts); + + zheap_deform_tuple(ztuple, tupDesc, values, nulls); + tuple = heap_form_tuple(tupDesc, values, nulls); + tuple->t_self = ztuple->t_self; + tuple->t_tableOid = ztuple->t_tableOid; + + pfree(values); + pfree(nulls); + + return tuple; +} + +/* + * zheap_to_heap + * + * convert zheap tuple to a minimal tuple + */ +MinimalTuple +zheap_to_minimal(ZHeapTuple ztuple, TupleDesc tupDesc) +{ + MinimalTuple tuple; + Datum *values = palloc0(sizeof(Datum) * tupDesc->natts); + bool *nulls = palloc0(sizeof(bool) * tupDesc->natts); + + zheap_deform_tuple(ztuple, tupDesc, values, nulls); + tuple = heap_form_minimal_tuple(tupDesc, values, nulls); + + pfree(values); + pfree(nulls); + + return tuple; +} + +/* + * heap_to_zheap + * + * convert heap tuple to zheap tuple + */ +ZHeapTuple +heap_to_zheap(HeapTuple tuple, TupleDesc tupDesc) +{ + ZHeapTuple ztuple; + Datum *values = palloc0(sizeof(Datum) * tupDesc->natts); + bool *nulls = palloc0(sizeof(bool) * tupDesc->natts); + + heap_deform_tuple(tuple, tupDesc, values, nulls); + ztuple = zheap_form_tuple(tupDesc, values, nulls); + ztuple->t_self = tuple->t_self; + ztuple->t_tableOid = tuple->t_tableOid; + + pfree(values); + pfree(nulls); + + return ztuple; +} + +/* ---------------- + * zheap_copytuple + * + * returns a copy of an entire tuple + * + * The ZHeapTuple struct, tuple header, and tuple data are all allocated + * as a single palloc() block. + * ---------------- + */ +ZHeapTuple +zheap_copytuple(ZHeapTuple tuple) +{ + ZHeapTuple newTuple; + + if (!ZHeapTupleIsValid(tuple) || tuple->t_data == NULL) + return NULL; + + newTuple = (ZHeapTuple) palloc(ZHEAPTUPLESIZE + tuple->t_len); + newTuple->t_len = tuple->t_len; + newTuple->t_self = tuple->t_self; + newTuple->t_tableOid = tuple->t_tableOid; + newTuple->t_data = (ZHeapTupleHeader) ((char *) newTuple + ZHEAPTUPLESIZE); + memcpy((char *) newTuple->t_data, (char *) tuple->t_data, tuple->t_len); + return newTuple; +} diff --git a/src/backend/access/zheap/zheapamxlog.c b/src/backend/access/zheap/zheapamxlog.c new file mode 100644 index 0000000000..0caa82d18d --- /dev/null +++ b/src/backend/access/zheap/zheapamxlog.c @@ -0,0 +1,2198 @@ +/*------------------------------------------------------------------------- + * + * zheapamxlog.c + * WAL replay logic for zheap. + * + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/zheap/zheapamxlog.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/visibilitymap.h" +#include "access/xlog.h" +#include "access/xlogutils.h" +#include "access/zheap.h" +#include "access/zheapam_xlog.h" +#include "storage/standby.h" +#include "storage/freespace.h" + +static void +zheap_xlog_insert(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_header *xlundohdr = (xl_undo_header *) XLogRecGetData(record); + xl_zheap_insert *xlrec; + Buffer buffer; + Page page; + union + { + ZHeapTupleHeaderData hdr; + char data[MaxZHeapTupleSize]; + } tbuf; + ZHeapTupleHeader zhtup; + UnpackedUndoRecord undorecord; + UndoRecPtr urecptr = InvalidUndoRecPtr; + xl_zheap_header xlhdr; + uint32 newlen; + RelFileNode target_node; + BlockNumber blkno; + ItemPointerData target_tid; + XLogRedoAction action; + int *tpd_trans_slot_id = NULL; + TransactionId xid = XLogRecGetXid(record); + uint32 xid_epoch = GetEpochForXid(xid); + bool skip_undo; + + xlrec = (xl_zheap_insert *) ((char *) xlundohdr + SizeOfUndoHeader); + if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT) + tpd_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapInsert); + + XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno); + ItemPointerSetBlockNumber(&target_tid, blkno); + ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum); + + /* + * The visibility map may need to be fixed even if the heap page is + * already up-to-date. + */ + if (xlrec->flags & XLZ_INSERT_ALL_VISIBLE_CLEARED) + { + Relation reln = CreateFakeRelcacheEntry(target_node); + Buffer vmbuffer = InvalidBuffer; + + visibilitymap_pin(reln, blkno, &vmbuffer); + visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); + ReleaseBuffer(vmbuffer); + FreeFakeRelcacheEntry(reln); + } + + /* + * We can skip inserting undo records if the tuples are to be marked + * as frozen. + */ + skip_undo = (xlrec->flags & XLZ_INSERT_IS_FROZEN); + + if (!skip_undo) + { + /* prepare an undo record */ + undorecord.uur_type = UNDO_INSERT; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = xlundohdr->reloid; + undorecord.uur_prevxid = FrozenTransactionId; + undorecord.uur_xid = xid; + undorecord.uur_cid = FirstCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = xlundohdr->blkprev; + undorecord.uur_block = ItemPointerGetBlockNumber(&target_tid); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&target_tid); + undorecord.uur_payload.len = 0; + undorecord.uur_tuple.len = 0; + + /* + * For speculative insertions, we store the dummy speculative token in the + * undorecord so that, the size of undorecord in DO function matches with + * the size of undorecord in REDO function. This ensures that, for INSERT + * ... ON CONFLICT statements, the assert condition used later in this + * function to ensure that the undo pointer in DO and REDO function remains + * the same is true. However, it might not be useful in the REDO function as + * it is just required in the master node to detect conflicts for insert ... + * on conflict. + * + * Fixme - Once we have undo consistency checker that we can remove the + * assertion as well dummy speculative token. + */ + if (xlrec->flags & XLZ_INSERT_IS_SPECULATIVE) + { + uint32 dummy_specToken = 1; + + initStringInfo(&undorecord.uur_payload); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *)&dummy_specToken, + sizeof(uint32)); + } + else + undorecord.uur_payload.len = 0; + + urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL); + InsertPreparedUndo(); + + /* + * undo should be inserted at same location as it was during the actual + * insert (DO operation). + */ + + Assert(urecptr == xlundohdr->urec_ptr); + } + + /* + * If we inserted the first and only tuple on the page, re-initialize the + * page from scratch. + */ + if (XLogRecGetInfo(record) & XLOG_ZHEAP_INIT_PAGE) + { + /* It is asked for page init, insert should not have tpd slot. */ + Assert(!(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT)); + buffer = XLogInitBufferForRedo(record, 0); + page = BufferGetPage(buffer); + ZheapInitPage(page, BufferGetPageSize(buffer)); + action = BLK_NEEDS_REDO; + } + else + action = XLogReadBufferForRedo(record, 0, &buffer); + if (action == BLK_NEEDS_REDO) + { + Size datalen; + char *data; + int trans_slot_id; + + page = BufferGetPage(buffer); + + if (PageGetMaxOffsetNumber(page) + 1 < xlrec->offnum) + elog(PANIC, "invalid max offset number"); + + data = XLogRecGetBlockData(record, 0, &datalen); + + newlen = datalen - SizeOfZHeapHeader; + Assert(datalen > SizeOfZHeapHeader && newlen <= MaxZHeapTupleSize); + memcpy((char *) &xlhdr, data, SizeOfZHeapHeader); + data += SizeOfZHeapHeader; + + zhtup = &tbuf.hdr; + MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader); + /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */ + memcpy((char *) zhtup + SizeofZHeapTupleHeader, + data, + newlen); + newlen += SizeofZHeapTupleHeader; + zhtup->t_infomask2 = xlhdr.t_infomask2; + zhtup->t_infomask = xlhdr.t_infomask; + zhtup->t_hoff = xlhdr.t_hoff; + + if (ZPageAddItem(buffer, NULL, (Item) zhtup, newlen, xlrec->offnum, + true, true, true) == InvalidOffsetNumber) + elog(PANIC, "failed to add tuple"); + + if (!skip_undo) + { + if (tpd_trans_slot_id) + trans_slot_id = *tpd_trans_slot_id; + else + trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup); + + PageSetUNDO(undorecord, buffer, trans_slot_id, false, xid_epoch, + xid, urecptr, NULL, 0); + } + + PageSetLSN(page, lsn); + + MarkBufferDirty(buffer); + } + + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + /* We can't have a valid transaction slot when we are skipping undo. */ + Assert(!skip_undo); + + /* + * We need to replay the record for TPD only when this record contains + * slot from TPD. + */ + Assert(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT); + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + TPDPageSetUndo(buffer, + *tpd_trans_slot_id, + true, + xid_epoch, + xid, + urecptr, + &undorecord.uur_offset, + 1); + TPDPageSetLSN(BufferGetPage(buffer), lsn); + } + } + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); +} + +static void +zheap_xlog_delete(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_header *xlundohdr = (xl_undo_header *) XLogRecGetData(record); + Size recordlen = XLogRecGetDataLen(record); + xl_zheap_delete *xlrec; + Buffer buffer; + Page page; + ZHeapTupleData zheaptup; + UnpackedUndoRecord undorecord; + UndoRecPtr urecptr; + RelFileNode target_node; + BlockNumber blkno; + ItemPointerData target_tid; + XLogRedoAction action; + Relation reln; + ItemId lp = NULL; + TransactionId xid = XLogRecGetXid(record); + uint32 xid_epoch = GetEpochForXid(xid); + int *tpd_trans_slot_id = NULL; + bool hasPayload = false; + + xlrec = (xl_zheap_delete *) ((char *) xlundohdr + SizeOfUndoHeader); + if (xlrec->flags & XLZ_DELETE_CONTAINS_TPD_SLOT) + tpd_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapDelete); + + XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno); + ItemPointerSetBlockNumber(&target_tid, blkno); + ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum); + + reln = CreateFakeRelcacheEntry(target_node); + + /* + * The visibility map may need to be fixed even if the heap page is + * already up-to-date. + */ + if (xlrec->flags & XLZ_DELETE_ALL_VISIBLE_CLEARED) + { + Buffer vmbuffer = InvalidBuffer; + + visibilitymap_pin(reln, blkno, &vmbuffer); + visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); + ReleaseBuffer(vmbuffer); + } + + action = XLogReadBufferForRedo(record, 0, &buffer); + + page = BufferGetPage(buffer); + + if (PageGetMaxOffsetNumber(page) >= xlrec->offnum) + lp = PageGetItemId(page, xlrec->offnum); + + if (PageGetMaxOffsetNumber(page) < xlrec->offnum || !ItemIdIsNormal(lp)) + elog(PANIC, "invalid lp"); + + zheaptup.t_tableOid = RelationGetRelid(reln); + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + zheaptup.t_self = target_tid; + + /* + * If the WAL stream contains undo tuple, then replace it with the + * explicitly stored tuple. + */ + if (xlrec->flags & XLZ_HAS_DELETE_UNDOTUPLE) + { + char *data; + xl_zheap_header xlhdr; + union + { + ZHeapTupleHeaderData hdr; + char data[MaxZHeapTupleSize]; + } tbuf; + ZHeapTupleHeader zhtup; + Size datalen; + + if (xlrec->flags & XLZ_DELETE_CONTAINS_TPD_SLOT) + { + data = (char *) xlrec + SizeOfZHeapDelete + + sizeof(*tpd_trans_slot_id); + datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapDelete - + SizeOfZHeapHeader - sizeof(*tpd_trans_slot_id); + } + else + { + data = (char *) xlrec + SizeOfZHeapDelete; + datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapDelete - + SizeOfZHeapHeader; + } + memcpy((char *) &xlhdr, data, SizeOfZHeapHeader); + data += SizeOfZHeapHeader; + + zhtup = &tbuf.hdr; + MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader); + /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */ + memcpy((char *) zhtup + SizeofZHeapTupleHeader, + data, + datalen); + datalen += SizeofZHeapTupleHeader; + zhtup->t_infomask2 = xlhdr.t_infomask2; + zhtup->t_infomask = xlhdr.t_infomask; + zhtup->t_hoff = xlhdr.t_hoff; + + zheaptup.t_data = zhtup; + zheaptup.t_len = datalen; + } + + /* prepare an undo record */ + undorecord.uur_type = UNDO_DELETE; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = xlundohdr->reloid; + undorecord.uur_prevxid = xlrec->prevxid; + undorecord.uur_xid = xid; + undorecord.uur_cid = FirstCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = xlundohdr->blkprev; + undorecord.uur_block = ItemPointerGetBlockNumber(&target_tid); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&target_tid); + + initStringInfo(&undorecord.uur_tuple); + + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &zheaptup.t_len, + sizeof(uint32)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &zheaptup.t_self, + sizeof(ItemPointerData)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &zheaptup.t_tableOid, + sizeof(Oid)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) zheaptup.t_data, + zheaptup.t_len); + + if (xlrec->flags & XLZ_DELETE_CONTAINS_TPD_SLOT) + { + initStringInfo(&undorecord.uur_payload); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) tpd_trans_slot_id, + sizeof(*tpd_trans_slot_id)); + hasPayload = true; + } + + /* + * For sub-tranasctions, we store the dummy contains subxact token in the + * undorecord so that, the size of undorecord in DO function matches with + * the size of undorecord in REDO function. This ensures that, for + * sub-transactions, the assert condition used later in this + * function to ensure that the undo pointer in DO and REDO function remains + * the same is true. + */ + if (xlrec->flags & XLZ_DELETE_CONTAINS_SUBXACT) + { + SubTransactionId dummy_subXactToken = 1; + + if (!hasPayload) + { + initStringInfo(&undorecord.uur_payload); + hasPayload = true; + } + + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &dummy_subXactToken, + sizeof(SubTransactionId)); + } + + if (!hasPayload) + undorecord.uur_payload.len = 0; + + urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL); + InsertPreparedUndo(); + + /* + * undo should be inserted at same location as it was during the actual + * insert (DO operation). + */ + Assert (urecptr == xlundohdr->urec_ptr); + + if (action == BLK_NEEDS_REDO) + { + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + ZHeapTupleHeaderSetXactSlot(zheaptup.t_data, xlrec->trans_slot_id); + zheaptup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + zheaptup.t_data->t_infomask = xlrec->infomask; + + if (xlrec->flags & XLZ_DELETE_IS_PARTITION_MOVE) + ZHeapTupleHeaderSetMovedPartitions(zheaptup.t_data); + + PageSetUNDO(undorecord, buffer, xlrec->trans_slot_id, + false, xid_epoch, xid, urecptr, NULL, 0); + + /* Mark the page as a candidate for pruning */ + ZPageSetPrunable(page, XLogRecGetXid(record)); + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + TPDPageSetUndo(buffer, + xlrec->trans_slot_id, + true, + xid_epoch, + xid, + urecptr, + &undorecord.uur_offset, + 1); + TPDPageSetLSN(page, lsn); + } + } + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + /* be tidy */ + pfree(undorecord.uur_tuple.data); + if (undorecord.uur_payload.len > 0) + pfree(undorecord.uur_payload.data); + + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + FreeFakeRelcacheEntry(reln); +} + +static void +zheap_xlog_update(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_header *xlundohdr; + xl_undo_header *xlnewundohdr = NULL; + xl_zheap_header xlhdr; + Size recordlen; + Size freespace = 0; + xl_zheap_update *xlrec; + Buffer oldbuffer, newbuffer; + Page oldpage, newpage; + ZHeapTupleData oldtup; + ZHeapTupleHeader newtup; + union + { + ZHeapTupleHeaderData hdr; + char data[MaxZHeapTupleSize]; + } tbuf; + UnpackedUndoRecord undorecord, newundorecord; + UndoRecPtr urecptr = InvalidUndoRecPtr; + UndoRecPtr newurecptr = InvalidUndoRecPtr; + RelFileNode rnode; + BlockNumber oldblk, newblk; + ItemPointerData oldtid, newtid; + XLogRedoAction oldaction, newaction; + Relation reln; + ItemId lp = NULL; + TransactionId xid = XLogRecGetXid(record); + uint32 xid_epoch = GetEpochForXid(xid); + int *old_tup_trans_slot_id = NULL; + int *new_trans_slot_id = NULL; + int trans_slot_id; + bool inplace_update; + + xlundohdr = (xl_undo_header *) XLogRecGetData(record); + xlrec = (xl_zheap_update *) ((char *) xlundohdr + SizeOfUndoHeader); + recordlen = XLogRecGetDataLen(record); + + if (xlrec->flags & XLZ_UPDATE_OLD_CONTAINS_TPD_SLOT) + { + old_tup_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapUpdate); + } + if (xlrec->flags & XLZ_NON_INPLACE_UPDATE) + { + inplace_update = false; + if (old_tup_trans_slot_id) + xlnewundohdr = (xl_undo_header *) ((char *) old_tup_trans_slot_id + sizeof(*old_tup_trans_slot_id)); + else + xlnewundohdr = (xl_undo_header *) ((char *) xlrec + SizeOfZHeapUpdate); + + if (xlrec->flags & XLZ_UPDATE_NEW_CONTAINS_TPD_SLOT) + new_trans_slot_id = (int *) ((char *) xlnewundohdr + SizeOfUndoHeader); + } + else + { + inplace_update = true; + } + + XLogRecGetBlockTag(record, 0, &rnode, NULL, &newblk); + if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk)) + { + /* inplace updates are never done across pages */ + Assert(!inplace_update); + } + else + oldblk = newblk; + + ItemPointerSet(&oldtid, oldblk, xlrec->old_offnum); + ItemPointerSet(&newtid, newblk, xlrec->new_offnum); + + reln = CreateFakeRelcacheEntry(rnode); + + /* + * The visibility map may need to be fixed even if the zheap page is + * already up-to-date. + */ + if (xlrec->flags & XLZ_UPDATE_OLD_ALL_VISIBLE_CLEARED) + { + Buffer vmbuffer = InvalidBuffer; + + visibilitymap_pin(reln, oldblk, &vmbuffer); + visibilitymap_clear(reln, oldblk, vmbuffer, VISIBILITYMAP_VALID_BITS); + ReleaseBuffer(vmbuffer); + } + + oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1, &oldbuffer); + + oldpage = BufferGetPage(oldbuffer); + + if (PageGetMaxOffsetNumber(oldpage) >= xlrec->old_offnum) + lp = PageGetItemId(oldpage, xlrec->old_offnum); + + if (PageGetMaxOffsetNumber(oldpage) < xlrec->old_offnum || !ItemIdIsNormal(lp)) + elog(PANIC, "invalid lp"); + + oldtup.t_tableOid = RelationGetRelid(reln); + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(oldpage, lp); + oldtup.t_len = ItemIdGetLength(lp); + oldtup.t_self = oldtid; + + /* + * If the WAL stream contains undo tuple, then replace it with the + * explicitly stored tuple. + */ + if (xlrec->flags & XLZ_HAS_UPDATE_UNDOTUPLE) + { + ZHeapTupleHeader zhtup; + Size datalen; + char *data; + + /* There is an additional undo header for non-inplace-update. */ + if (inplace_update) + { + if (old_tup_trans_slot_id) + { + data = (char *) ((char *) old_tup_trans_slot_id + sizeof(*old_tup_trans_slot_id)); + datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapUpdate - + sizeof(*old_tup_trans_slot_id) - SizeOfZHeapHeader; + } + else + { + data = (char *) xlrec + SizeOfZHeapUpdate; + datalen = recordlen - SizeOfUndoHeader - SizeOfZHeapUpdate - SizeOfZHeapHeader; + } + } + else + { + if (old_tup_trans_slot_id && new_trans_slot_id) + { + datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate - + sizeof(*old_tup_trans_slot_id) - sizeof(*new_trans_slot_id) - + SizeOfZHeapHeader; + data = (char *) ((char *) new_trans_slot_id + sizeof(*new_trans_slot_id)); + } + else if (new_trans_slot_id) + { + datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate - + sizeof(*new_trans_slot_id) - SizeOfZHeapHeader; + data = (char *) ((char *) new_trans_slot_id + sizeof(*new_trans_slot_id)); + } + else if (old_tup_trans_slot_id) + { + datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate - + sizeof(*old_tup_trans_slot_id) - SizeOfZHeapHeader; + data = (char *) xlnewundohdr + SizeOfUndoHeader; + } + else + { + datalen = recordlen - (2 * SizeOfUndoHeader) - SizeOfZHeapUpdate - + SizeOfZHeapHeader; + data = (char *) xlnewundohdr + SizeOfUndoHeader; + } + } + + memcpy((char *) &xlhdr, data, SizeOfZHeapHeader); + data += SizeOfZHeapHeader; + + zhtup = &tbuf.hdr; + MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader); + /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */ + memcpy((char *) zhtup + SizeofZHeapTupleHeader, + data, + datalen); + datalen += SizeofZHeapTupleHeader; + zhtup->t_infomask2 = xlhdr.t_infomask2; + zhtup->t_infomask = xlhdr.t_infomask; + zhtup->t_hoff = xlhdr.t_hoff; + + oldtup.t_data = zhtup; + oldtup.t_len = datalen; + } + + /* prepare an undo record */ + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = xlundohdr->reloid; + undorecord.uur_prevxid = xlrec->prevxid; + undorecord.uur_xid = xid; + undorecord.uur_cid = FirstCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = xlundohdr->blkprev; + undorecord.uur_block = ItemPointerGetBlockNumber(&oldtid); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&oldtid); + undorecord.uur_payload.len = 0; + + initStringInfo(&undorecord.uur_tuple); + + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &oldtup.t_len, + sizeof(uint32)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &oldtup.t_self, + sizeof(ItemPointerData)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) &oldtup.t_tableOid, + sizeof(Oid)); + appendBinaryStringInfo(&undorecord.uur_tuple, + (char *) oldtup.t_data, + oldtup.t_len); + + if (inplace_update) + { + bool hasPayload = false; + + undorecord.uur_type = UNDO_INPLACE_UPDATE; + if (old_tup_trans_slot_id) + { + Assert(*old_tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS); + initStringInfo(&undorecord.uur_payload); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) old_tup_trans_slot_id, + sizeof(*old_tup_trans_slot_id)); + hasPayload = true; + } + + /* + * For sub-tranasctions, we store the dummy contains subxact token in the + * undorecord so that, the size of undorecord in DO function matches with + * the size of undorecord in REDO function. This ensures that, for + * sub-transactions, the assert condition used later in this + * function to ensure that the undo pointer in DO and REDO function remains + * the same is true. + */ + if (xlrec->flags & XLZ_UPDATE_CONTAINS_SUBXACT) + { + SubTransactionId dummy_subXactToken = 1; + + if (!hasPayload) + { + initStringInfo(&undorecord.uur_payload); + hasPayload = true; + } + + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &dummy_subXactToken, + sizeof(SubTransactionId)); + } + + if (!hasPayload) + undorecord.uur_payload.len = 0; + + urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL); + } + else + { + UnpackedUndoRecord undorec[2]; + + undorecord.uur_type = UNDO_UPDATE; + initStringInfo(&undorecord.uur_payload); + /* update new tuple location in undo record */ + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &newtid, + sizeof(ItemPointerData)); + /* add the TPD slot id */ + if (old_tup_trans_slot_id) + { + Assert(*old_tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) old_tup_trans_slot_id, + sizeof(*old_tup_trans_slot_id)); + } + + /* + * For sub-tranasctions, we store the dummy contains subxact token in the + * undorecord so that, the size of undorecord in DO function matches with + * the size of undorecord in REDO function. This ensures that, for + * sub-transactions, the assert condition used later in this + * function to ensure that the undo pointer in DO and REDO function remains + * the same is true. + */ + if (xlrec->flags & XLZ_UPDATE_CONTAINS_SUBXACT) + { + SubTransactionId dummy_subXactToken = 1; + + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &dummy_subXactToken, + sizeof(SubTransactionId)); + } + + /* prepare an undo record for new tuple */ + newundorecord.uur_type = UNDO_INSERT; + newundorecord.uur_info = 0; + newundorecord.uur_prevlen = 0; + newundorecord.uur_reloid = xlnewundohdr->reloid; + newundorecord.uur_prevxid = xid; + newundorecord.uur_xid = xid; + newundorecord.uur_cid = FirstCommandId; + newundorecord.uur_fork = MAIN_FORKNUM; + newundorecord.uur_blkprev = xlnewundohdr->blkprev; + newundorecord.uur_block = ItemPointerGetBlockNumber(&newtid); + newundorecord.uur_offset = ItemPointerGetOffsetNumber(&newtid); + newundorecord.uur_tuple.len = 0; + + if (new_trans_slot_id) + { + Assert(*new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS); + initStringInfo(&newundorecord.uur_payload); + appendBinaryStringInfo(&newundorecord.uur_payload, + (char *) new_trans_slot_id, + sizeof(*new_trans_slot_id)); + } + else + newundorecord.uur_payload.len = 0; + + undorec[0] = undorecord; + undorec[1] = newundorecord; + + UndoSetPrepareSize(undorec, 2, xid, UNDO_PERMANENT, NULL); + undorecord = undorec[0]; + newundorecord = undorec[1]; + + urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL); + newurecptr = PrepareUndoInsert(&newundorecord, xid, UNDO_PERMANENT, NULL); + + Assert (newurecptr == xlnewundohdr->urec_ptr); + } + + /* + * undo should be inserted at same location as it was during the actual + * insert (DO operation). + */ + Assert (urecptr == xlundohdr->urec_ptr); + + InsertPreparedUndo(); + + /* Ensure old tuple points to the tuple in page. */ + oldtup.t_data = (ZHeapTupleHeader) PageGetItem(oldpage, lp); + oldtup.t_len = ItemIdGetLength(lp); + + /* First deal with old tuple */ + if (oldaction == BLK_NEEDS_REDO) + { + oldtup.t_data->t_infomask &= ~ZHEAP_VIS_STATUS_MASK; + oldtup.t_data->t_infomask = xlrec->old_infomask; + ZHeapTupleHeaderSetXactSlot(oldtup.t_data, xlrec->old_trans_slot_id); + + if (oldblk != newblk) + PageSetUNDO(undorecord, oldbuffer, xlrec->old_trans_slot_id, + false, xid_epoch, xid, urecptr, NULL, + 0); + + /* Mark the page as a candidate for pruning */ + if (!inplace_update) + ZPageSetPrunable(oldpage, XLogRecGetXid(record)); + + PageSetLSN(oldpage, lsn); + MarkBufferDirty(oldbuffer); + } + + /* + * Read the page the new tuple goes into, if different from old. + */ + if (oldblk == newblk) + { + newbuffer = oldbuffer; + newaction = oldaction; + } + else if (XLogRecGetInfo(record) & XLOG_ZHEAP_INIT_PAGE) + { + newbuffer = XLogInitBufferForRedo(record, 0); + newpage = (Page) BufferGetPage(newbuffer); + ZheapInitPage(newpage, BufferGetPageSize(newbuffer)); + newaction = BLK_NEEDS_REDO; + } + else + newaction = XLogReadBufferForRedo(record, 0, &newbuffer); + + newpage = BufferGetPage(newbuffer); + + /* + * The visibility map may need to be fixed even if the zheap page is + * already up-to-date. + */ + if (xlrec->flags & XLZ_UPDATE_NEW_ALL_VISIBLE_CLEARED) + { + Buffer vmbuffer = InvalidBuffer; + + visibilitymap_pin(reln, newblk, &vmbuffer); + visibilitymap_clear(reln, newblk, vmbuffer, VISIBILITYMAP_VALID_BITS); + ReleaseBuffer(vmbuffer); + } + + if (newaction == BLK_NEEDS_REDO) + { + uint16 prefixlen = 0, + suffixlen = 0; + char *newp; + char *recdata; + char *recdata_end; + Size datalen; + Size tuplen; + uint32 newlen; + + if (PageGetMaxOffsetNumber(newpage) + 1 < xlrec->new_offnum) + elog(PANIC, "invalid max offset number"); + + recdata = XLogRecGetBlockData(record, 0, &datalen); + recdata_end = recdata + datalen; + + if (xlrec->flags & XLZ_UPDATE_PREFIX_FROM_OLD) + { + Assert(newblk == oldblk); + memcpy(&prefixlen, recdata, sizeof(uint16)); + recdata += sizeof(uint16); + } + if (xlrec->flags & XLZ_UPDATE_SUFFIX_FROM_OLD) + { + Assert(newblk == oldblk); + memcpy(&suffixlen, recdata, sizeof(uint16)); + recdata += sizeof(uint16); + } + + memcpy((char *) &xlhdr, recdata, SizeOfZHeapHeader); + recdata += SizeOfZHeapHeader; + + tuplen = recdata_end - recdata; + Assert(tuplen <= MaxZHeapTupleSize); + + newtup = &tbuf.hdr; + MemSet((char *) newtup, 0, SizeofZHeapTupleHeader); + + /* + * Reconstruct the new tuple using the prefix and/or suffix from the + * old tuple, and the data stored in the WAL record. + */ + newp = (char *) newtup + SizeofZHeapTupleHeader; + if (prefixlen > 0) + { + int len; + + /* copy bitmap [+ padding] [+ oid] from WAL record */ + len = xlhdr.t_hoff - SizeofZHeapTupleHeader; + memcpy(newp, recdata, len); + recdata += len; + newp += len; + + /* copy prefix from old tuple */ + memcpy(newp, (char *) oldtup.t_data + oldtup.t_data->t_hoff, prefixlen); + newp += prefixlen; + + /* copy new tuple data from WAL record */ + len = tuplen - (xlhdr.t_hoff - SizeofZHeapTupleHeader); + memcpy(newp, recdata, len); + recdata += len; + newp += len; + } + else + { + /* + * copy bitmap [+ padding] [+ oid] + data from record, all in one + * go + */ + memcpy(newp, recdata, tuplen); + recdata += tuplen; + newp += tuplen; + } + Assert(recdata == recdata_end); + + /* copy suffix from old tuple */ + if (suffixlen > 0) + memcpy(newp, (char *) oldtup.t_data + oldtup.t_len - suffixlen, suffixlen); + + newlen = SizeofZHeapTupleHeader + tuplen + prefixlen + suffixlen; + newtup->t_infomask2 = xlhdr.t_infomask2; + newtup->t_infomask = xlhdr.t_infomask; + newtup->t_hoff = xlhdr.t_hoff; + if (new_trans_slot_id) + trans_slot_id = *new_trans_slot_id; + else + trans_slot_id = ZHeapTupleHeaderGetXactSlot(newtup); + + if (inplace_update) + { + /* + * For inplace updates, we copy the entire data portion including the + * tuple header. + */ + ItemIdChangeLen(lp, newlen); + memcpy((char *) oldtup.t_data, (char *) newtup, newlen); + + if (newlen < oldtup.t_len) + { + /* new tuple is smaller, a prunable cadidate */ + Assert (oldpage == newpage); + ZPageSetPrunable(newpage, XLogRecGetXid(record)); + } + + PageSetUNDO(undorecord, newbuffer, xlrec->old_trans_slot_id, + false, xid_epoch, xid, urecptr, + NULL, 0); + } + else + { + if (ZPageAddItem(newbuffer, NULL, (Item) newtup, newlen, xlrec->new_offnum, + true, true, true) == InvalidOffsetNumber) + elog(PANIC, "failed to add tuple"); + PageSetUNDO((newbuffer == oldbuffer) ? undorecord : newundorecord, + newbuffer, trans_slot_id, false, xid_epoch, xid, + newurecptr, NULL, 0); + } + + freespace = PageGetHeapFreeSpace(newpage); /* needed to update FSM below */ + + PageSetLSN(newpage, lsn); + MarkBufferDirty(newbuffer); + } + + /* replay the record for tpd buffer corresponding to oldbuf */ + if (XLogRecHasBlockRef(record, 2)) + { + if (XLogReadTPDBuffer(record, 2) == BLK_NEEDS_REDO) + { + OffsetNumber usedoff[2]; + int ucnt; + + if (!inplace_update && newbuffer == oldbuffer) + { + usedoff[0] = undorecord.uur_offset; + usedoff[1] = newundorecord.uur_offset; + ucnt = 2; + } + else + { + usedoff[0] = undorecord.uur_offset; + ucnt = 1; + } + if (xlrec->old_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + if (inplace_update) + { + TPDPageSetUndo(oldbuffer, + xlrec->old_trans_slot_id, + true, + xid_epoch, + xid, + urecptr, + usedoff, + ucnt); + } + else + { + TPDPageSetUndo(oldbuffer, + xlrec->old_trans_slot_id, + true, + xid_epoch, + xid, + (oldblk == newblk) ? newurecptr : urecptr, + usedoff, + ucnt); + } + TPDPageSetLSN(oldpage, lsn); + } + } + } + + /* replay the record for tpd buffer corresponding to newbuf */ + if (XLogRecHasBlockRef(record, 3)) + { + if (XLogReadTPDBuffer(record, 3) == BLK_NEEDS_REDO) + { + TPDPageSetUndo(newbuffer, + *new_trans_slot_id, + true, + xid_epoch, + xid, + newurecptr, + &newundorecord.uur_offset, + 1); + TPDPageSetLSN(newpage, lsn); + } + } + else if (new_trans_slot_id && (*new_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS)) + { + TPDPageSetUndo(newbuffer, + *new_trans_slot_id, + true, + xid_epoch, + xid, + newurecptr, + &newundorecord.uur_offset, + 1); + TPDPageSetLSN(newpage, lsn); + } + if (BufferIsValid(newbuffer) && newbuffer != oldbuffer) + UnlockReleaseBuffer(newbuffer); + if (BufferIsValid(oldbuffer)) + UnlockReleaseBuffer(oldbuffer); + + /* be tidy */ + pfree(undorecord.uur_tuple.data); + if (undorecord.uur_payload.len > 0) + pfree(undorecord.uur_payload.data); + + if (!inplace_update && newundorecord.uur_payload.len > 0) + pfree(newundorecord.uur_payload.data); + + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + FreeFakeRelcacheEntry(reln); + + /* + * Update the freespace. We don't need to update it for inplace updates as + * they won't freeup any space or consume any extra space assuming the new + * tuple is about the same size as the old one. See heap_xlog_update. + */ + if (newaction == BLK_NEEDS_REDO && !inplace_update && freespace < BLCKSZ / 5) + XLogRecordPageWithFreeSpace(rnode, newblk, freespace); +} + +static void +zheap_xlog_freeze_xact_slot(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer buffer; + Page page; + xl_zheap_freeze_xact_slot *xlrec = + (xl_zheap_freeze_xact_slot *) XLogRecGetData(record); + XLogRedoAction action, tpdaction = -1; + int *frozen; + int i; + bool hasTPDSlot = false; + + /* There must be some frozen slots.*/ + Assert(xlrec->nFrozen > 0); + + /* + * In Hot Standby mode, ensure that no running query conflicts with the + * frozen xids. + */ + if (InHotStandby) + { + RelFileNode rnode; + + /* + * FIXME: We need some handling for transaction wraparound. + */ + TransactionId lastestFrozenXid = xlrec->lastestFrozenXid; + + XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL); + ResolveRecoveryConflictWithSnapshot(lastestFrozenXid, rnode); + } + + frozen = (int *) ((char *) xlrec + SizeOfZHeapFreezeXactSlot); + + action = XLogReadBufferForRedo(record, 0, &buffer); + if (XLogRecHasBlockRef(record, 1)) + { + tpdaction = XLogReadTPDBuffer(record, 1); + hasTPDSlot = true; + } + + page = BufferGetPage(buffer); + + if (action == BLK_NEEDS_REDO) + { + ZHeapPageOpaque opaque; + int slot_no; + if (hasTPDSlot) + { + zheap_freeze_or_invalidate_tuples(buffer, xlrec->nFrozen, frozen, + true, true); + } + else + { + zheap_freeze_or_invalidate_tuples(buffer, xlrec->nFrozen, frozen, + true, false); + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + /* Initialize the frozen slots. */ + for (i = 0; i < xlrec->nFrozen; i++) + { + slot_no = frozen[i]; + + opaque->transinfo[slot_no].xid_epoch = 0; + opaque->transinfo[slot_no].xid = InvalidTransactionId; + opaque->transinfo[slot_no].urec_ptr = InvalidUndoRecPtr; + } + } + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + + if (tpdaction == BLK_NEEDS_REDO) + { + /* Initialize the frozen slots. */ + for (i = 0; i < xlrec->nFrozen; i++) + { + int tpd_slot_id; + + /* Calculate the actual slot no. */ + tpd_slot_id = frozen[i] + ZHEAP_PAGE_TRANS_SLOTS + 1; + + /* Clear slot information from the TPD slot. */ + TPDPageSetTransactionSlotInfo(buffer, tpd_slot_id, 0, + InvalidTransactionId, + InvalidUndoRecPtr); + } + + TPDPageSetLSN(page, lsn); + } + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + UnlockReleaseTPDBuffers(); +} + +static void +zheap_xlog_invalid_xact_slot(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer buffer; + Page page; + char *data = XLogRecGetData(record); + uint16 nCompletedSlots; + XLogRedoAction action, tpdaction = -1; + int *completed_slots; + int i; + bool hasTPDSlot = false; + + nCompletedSlots = *(uint16 *) data; + + /* There must be some frozen slots.*/ + Assert(nCompletedSlots > 0); + + completed_slots = (int *) ((char *) data + sizeof(uint16)); + + action = XLogReadBufferForRedo(record, 0, &buffer); + if (XLogRecHasBlockRef(record, 1)) + { + tpdaction = XLogReadTPDBuffer(record, 1); + hasTPDSlot = true; + } + page = BufferGetPage(buffer); + + if (action == BLK_NEEDS_REDO) + { + ZHeapPageOpaque opaque; + int slot_no; + + opaque = (ZHeapPageOpaque) PageGetSpecialPointer(page); + + /* clear the transaction slot info on tuples. */ + if (hasTPDSlot) + { + zheap_freeze_or_invalidate_tuples(buffer, nCompletedSlots, + completed_slots, false, true); + } + else + { + zheap_freeze_or_invalidate_tuples(buffer, nCompletedSlots, + completed_slots, false, false); + + /* Clear xid from the slots. */ + for (i = 0; i < nCompletedSlots; i++) + { + slot_no = completed_slots[i]; + opaque->transinfo[slot_no].xid_epoch = 0; + opaque->transinfo[slot_no].xid = InvalidTransactionId; + } + } + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (tpdaction == BLK_NEEDS_REDO) + { + TransInfo *tpd_slots; + + /* + * Read TPD slot array. So that we can keep the slot urec_ptr + * intact while clearing the transaction id from the slot. + */ + tpd_slots = TPDPageGetTransactionSlots(NULL, buffer, + InvalidOffsetNumber, + true, false, NULL, NULL, + NULL, NULL, NULL); + + for (i = 0; i < nCompletedSlots; i++) + { + int tpd_slot_id; + + /* Calculate the actual slot no. */ + tpd_slot_id = completed_slots[i] + ZHEAP_PAGE_TRANS_SLOTS + 1; + + /* Clear the XID information from the TPD. */ + TPDPageSetTransactionSlotInfo(buffer, tpd_slot_id, 0, + InvalidTransactionId, + tpd_slots[completed_slots[i]].urec_ptr); + } + + TPDPageSetLSN(page, lsn); + } + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + UnlockReleaseTPDBuffers(); +} + +static void +zheap_xlog_lock(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_header *xlundohdr = (xl_undo_header *) XLogRecGetData(record); + xl_zheap_lock *xlrec; + Buffer buffer; + Page page; + ZHeapTupleData zheaptup; + char *tup_hdr; + UnpackedUndoRecord undorecord; + UndoRecPtr urecptr; + RelFileNode target_node; + BlockNumber blkno; + ItemPointerData target_tid; + XLogRedoAction action; + Relation reln; + ItemId lp = NULL; + TransactionId xid = XLogRecGetXid(record); + uint32 xid_epoch = GetEpochForXid(xid); + int *trans_slot_for_urec = NULL; + int *tup_trans_slot_id = NULL; + int undo_slot_no; + + xlrec = (xl_zheap_lock *) ((char *) xlundohdr + SizeOfUndoHeader); + + XLogRecGetBlockTag(record, 0, &target_node, NULL, &blkno); + ItemPointerSet(&target_tid, blkno, xlrec->offnum); + + reln = CreateFakeRelcacheEntry(target_node); + action = XLogReadBufferForRedo(record, 0, &buffer); + page = BufferGetPage(buffer); + + if (PageGetMaxOffsetNumber(page) >= xlrec->offnum) + lp = PageGetItemId(page, xlrec->offnum); + + if (PageGetMaxOffsetNumber(page) < xlrec->offnum || !ItemIdIsNormal(lp)) + elog(PANIC, "invalid lp"); + + zheaptup.t_tableOid = RelationGetRelid(reln); + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + zheaptup.t_self = target_tid; + + /* + * WAL stream contains undo tuple header, replace it with the explicitly + * stored tuple header. + */ + tup_hdr = (char *) xlrec + SizeOfZHeapLock; + + /* prepare an undo record */ + if (ZHeapTupleHasMultiLockers(xlrec->infomask)) + undorecord.uur_type = UNDO_XID_MULTI_LOCK_ONLY; + else if (xlrec->flags & XLZ_LOCK_FOR_UPDATE) + undorecord.uur_type = UNDO_XID_LOCK_FOR_UPDATE; + else + undorecord.uur_type = UNDO_XID_LOCK_ONLY; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = xlundohdr->reloid; + undorecord.uur_prevxid = xlrec->prev_xid; + undorecord.uur_xid = xid; + undorecord.uur_cid = FirstCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = xlundohdr->blkprev; + undorecord.uur_block = ItemPointerGetBlockNumber(&target_tid); + undorecord.uur_offset = ItemPointerGetOffsetNumber(&target_tid); + + initStringInfo(&undorecord.uur_payload); + initStringInfo(&undorecord.uur_tuple); + appendBinaryStringInfo(&undorecord.uur_tuple, + tup_hdr, + SizeofZHeapTupleHeader); + + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) (tup_hdr + SizeofZHeapTupleHeader), + sizeof(LockTupleMode)); + + if (xlrec->flags & XLZ_LOCK_TRANS_SLOT_FOR_UREC) + { + trans_slot_for_urec = (int *) ((char *) tup_hdr + + SizeofZHeapTupleHeader + sizeof(LockTupleMode)); + if (xlrec->trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &(xlrec->trans_slot_id), + sizeof(int)); + } + else if (xlrec->flags & XLZ_LOCK_CONTAINS_TPD_SLOT) + { + tup_trans_slot_id = (int *) ((char *) tup_hdr + + SizeofZHeapTupleHeader + sizeof(LockTupleMode)); + /* + * We must have logged the tuple's original transaction slot if it is a TPD + * slot. + */ + Assert(*tup_trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS); + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) tup_trans_slot_id, + sizeof(*tup_trans_slot_id)); + } + + /* + * For sub-tranasctions, we store the dummy contains subxact token in the + * undorecord so that, the size of undorecord in DO function matches with + * the size of undorecord in REDO function. This ensures that, for + * sub-transactions, the assert condition used later in this + * function to ensure that the undo pointer in DO and REDO function remains + * the same is true. + */ + if (xlrec->flags & XLZ_LOCK_CONTAINS_SUBXACT) + { + SubTransactionId dummy_subXactToken = 1; + + appendBinaryStringInfo(&undorecord.uur_payload, + (char *) &dummy_subXactToken, + sizeof(SubTransactionId)); + } + + urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL); + InsertPreparedUndo(); + + /* + * undo should be inserted at same location as it was during the actual + * insert (DO operation). + */ + Assert (urecptr == xlundohdr->urec_ptr); + + if (trans_slot_for_urec) + undo_slot_no = *trans_slot_for_urec; + else + undo_slot_no = xlrec->trans_slot_id; + + if (action == BLK_NEEDS_REDO) + { + zheaptup.t_data = (ZHeapTupleHeader) PageGetItem(page, lp); + zheaptup.t_len = ItemIdGetLength(lp); + ZHeapTupleHeaderSetXactSlot(zheaptup.t_data, xlrec->trans_slot_id); + zheaptup.t_data->t_infomask = xlrec->infomask; + PageSetUNDO(undorecord, buffer, undo_slot_no, false, xid_epoch, + xid, urecptr, NULL, 0); + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + TPDPageSetUndo(buffer, + undo_slot_no, + (xlrec->flags & XLZ_LOCK_FOR_UPDATE) ? true : false, + xid_epoch, + xid, + urecptr, + &undorecord.uur_offset, + 1); + TPDPageSetLSN(page, lsn); + } + } + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + /* be tidy */ + pfree(undorecord.uur_tuple.data); + pfree(undorecord.uur_payload.data); + + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + FreeFakeRelcacheEntry(reln); +} + +static void +zheap_xlog_multi_insert(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_header *xlundohdr; + xl_zheap_multi_insert *xlrec; + RelFileNode rnode; + BlockNumber blkno; + Buffer buffer; + Page page; + union + { + ZHeapTupleHeaderData hdr; + char data[MaxZHeapTupleSize]; + } tbuf; + ZHeapTupleHeader zhtup; + uint32 newlen; + UnpackedUndoRecord *undorecord = NULL; + UndoRecPtr urecptr = InvalidUndoRecPtr, + prev_urecptr = InvalidUndoRecPtr; + int i; + int nranges; + int ucnt = 0; + OffsetNumber usedoff[MaxOffsetNumber]; + bool isinit = (XLogRecGetInfo(record) & XLOG_ZHEAP_INIT_PAGE) != 0; + XLogRedoAction action; + char *ranges_data; + int *tpd_trans_slot_id = NULL; + Size ranges_data_size = 0; + TransactionId xid = XLogRecGetXid(record); + uint32 xid_epoch = GetEpochForXid(xid); + ZHeapFreeOffsetRanges *zfree_offset_ranges; + bool skip_undo; + + xlundohdr = (xl_undo_header *) XLogRecGetData(record); + xlrec = (xl_zheap_multi_insert *) ((char *) xlundohdr + SizeOfUndoHeader); + + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); + + /* + * The visibility map may need to be fixed even if the heap page is + * already up-to-date. + */ + if (xlrec->flags & XLZ_INSERT_ALL_VISIBLE_CLEARED) + { + Relation reln = CreateFakeRelcacheEntry(rnode); + Buffer vmbuffer = InvalidBuffer; + + visibilitymap_pin(reln, blkno, &vmbuffer); + visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); + ReleaseBuffer(vmbuffer); + FreeFakeRelcacheEntry(reln); + } + + if (isinit) + { + /* It is asked for page init, insert should not have tpd slot. */ + Assert(!(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT)); + buffer = XLogInitBufferForRedo(record, 0); + page = BufferGetPage(buffer); + ZheapInitPage(page, BufferGetPageSize(buffer)); + action = BLK_NEEDS_REDO; + } + else + action = XLogReadBufferForRedo(record, 0, &buffer); + + /* allocate the information related to offset ranges */ + ranges_data = (char *) xlrec + SizeOfZHeapMultiInsert; + + /* fetch number of distinct ranges */ + nranges = *(int *) ranges_data; + ranges_data += sizeof(int); + ranges_data_size += sizeof(int); + + zfree_offset_ranges = (ZHeapFreeOffsetRanges *) palloc0(sizeof(ZHeapFreeOffsetRanges)); + Assert(nranges > 0); + for (i = 0; i < nranges; i++) + { + memcpy(&zfree_offset_ranges->startOffset[i],(char *) ranges_data, sizeof(OffsetNumber)); + ranges_data += sizeof(OffsetNumber); + memcpy(&zfree_offset_ranges->endOffset[i],(char *) ranges_data, sizeof(OffsetNumber)); + ranges_data += sizeof(OffsetNumber); + } + + /* + * We can skip inserting undo records if the tuples are to be marked + * as frozen. + */ + skip_undo= (xlrec->flags & XLZ_INSERT_IS_FROZEN); + if (!skip_undo) + { + undorecord = (UnpackedUndoRecord *) palloc(nranges * sizeof(UnpackedUndoRecord)); + + /* Start UNDO prepare Stuff */ + prev_urecptr = xlundohdr->blkprev; + urecptr = prev_urecptr; + + for (i = 0; i < nranges; i++) + { + /* prepare an undo record */ + undorecord[i].uur_type = UNDO_MULTI_INSERT; + undorecord[i].uur_info = 0; + undorecord[i].uur_prevlen = 0; + undorecord[i].uur_reloid = xlundohdr->reloid; + undorecord[i].uur_prevxid = xid; + undorecord[i].uur_prevxid = FrozenTransactionId; + undorecord[i].uur_cid = FirstCommandId; + undorecord[i].uur_fork = MAIN_FORKNUM; + undorecord[i].uur_blkprev = urecptr; + undorecord[i].uur_block = blkno; + undorecord[i].uur_offset = 0; + undorecord[i].uur_tuple.len = 0; + undorecord[i].uur_payload.len = 2 * sizeof(OffsetNumber); + initStringInfo(&undorecord[i].uur_payload); + appendBinaryStringInfo(&undorecord[i].uur_payload, + (char *) ranges_data, + 2 * sizeof(OffsetNumber)); + + ranges_data += undorecord[i].uur_payload.len; + ranges_data_size += undorecord[i].uur_payload.len; + } + + UndoSetPrepareSize(undorecord, nranges, xid, + UNDO_PERMANENT, NULL); + for (i = 0; i < nranges; i++) + { + undorecord[i].uur_blkprev = urecptr; + urecptr = PrepareUndoInsert(&undorecord[i], xid, UNDO_PERMANENT, NULL); + } + + elog(DEBUG1, "Undo record prepared: %d for Block Number: %d", + nranges, blkno); + + /* + * undo should be inserted at same location as it was during the actual + * insert (DO operation). + */ + Assert (urecptr == xlundohdr->urec_ptr); + + InsertPreparedUndo(); + } + + /* Get the tpd transaction slot number */ + if (xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT) + { + tpd_trans_slot_id = (int *) ((char *) xlrec + SizeOfZHeapMultiInsert + + ranges_data_size); + } + + /* Apply the wal for data */ + if (action == BLK_NEEDS_REDO) + { + char *tupdata; + char *endptr; + int trans_slot_id = 0; + int prev_trans_slot_id PG_USED_FOR_ASSERTS_ONLY; + Size len; + OffsetNumber offnum; + int j = 0; + bool first_time = true; + + prev_trans_slot_id = -1; + page = BufferGetPage(buffer); + + /* Tuples are stored as block data */ + tupdata = XLogRecGetBlockData(record, 0, &len); + endptr = tupdata + len; + + offnum = zfree_offset_ranges->startOffset[j]; + for (i = 0; i < xlrec->ntuples; i++) + { + xl_multi_insert_ztuple *xlhdr; + + /* + * If we're reinitializing the page, the tuples are stored in + * order from FirstOffsetNumber. Otherwise there's an array of + * offsets in the WAL record, and the tuples come after that. + */ + if (isinit) + offnum = FirstOffsetNumber + i; + else + { + /* + * Change the offset range if we've reached the end of current + * range. + */ + if (offnum > zfree_offset_ranges->endOffset[j]) + { + j++; + offnum = zfree_offset_ranges->startOffset[j]; + } + } + if (PageGetMaxOffsetNumber(page) + 1 < offnum) + elog(PANIC, "invalid max offset number"); + + xlhdr = (xl_multi_insert_ztuple *) SHORTALIGN(tupdata); + tupdata = ((char *) xlhdr) + SizeOfMultiInsertZTuple; + + newlen = xlhdr->datalen; + Assert(newlen <= MaxZHeapTupleSize); + zhtup = &tbuf.hdr; + MemSet((char *) zhtup, 0, SizeofZHeapTupleHeader); + /* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */ + memcpy((char *) zhtup + SizeofZHeapTupleHeader, + (char *) tupdata, + newlen); + tupdata += newlen; + + newlen += SizeofZHeapTupleHeader; + zhtup->t_infomask2 = xlhdr->t_infomask2; + zhtup->t_infomask = xlhdr->t_infomask; + zhtup->t_hoff = xlhdr->t_hoff; + + if (ZPageAddItem(buffer, NULL, (Item) zhtup, newlen, offnum, + true, true, true) == InvalidOffsetNumber) + elog(PANIC, "failed to add tuple"); + + /* track used offsets */ + usedoff[ucnt++] = offnum; + + /* increase the offset to store next tuple */ + offnum++; + + if (!skip_undo) + { + if (tpd_trans_slot_id) + trans_slot_id = *tpd_trans_slot_id; + else + trans_slot_id = ZHeapTupleHeaderGetXactSlot(zhtup); + if (first_time) + { + prev_trans_slot_id = trans_slot_id; + first_time = false; + } + else + { + /* All the tuples must refer to same transaction slot. */ + Assert(prev_trans_slot_id == trans_slot_id); + prev_trans_slot_id = trans_slot_id; + } + } + } + + if (!skip_undo) + PageSetUNDO(undorecord[nranges-1], buffer, trans_slot_id, false, + xid_epoch, xid, urecptr, NULL, 0); + + PageSetLSN(page, lsn); + + MarkBufferDirty(buffer); + + if (tupdata != endptr) + elog(ERROR, "total tuple length mismatch"); + } + + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + /* + * We need to replay the record for TPD only when this record contains + * slot from TPD. + */ + Assert(xlrec->flags & XLZ_INSERT_CONTAINS_TPD_SLOT); + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + /* prepare for the case where the data page is restored as is */ + if (ucnt == 0) + { + for (i = 0; i < nranges; i++) + { + OffsetNumber start_off, + end_off; + + start_off = ((OffsetNumber *) undorecord[i].uur_payload.data)[0]; + end_off = ((OffsetNumber *) undorecord[i].uur_payload.data)[1]; + + while (start_off <= end_off) + usedoff[ucnt++] = start_off++; + } + } + + TPDPageSetUndo(buffer, + *tpd_trans_slot_id, + true, + xid_epoch, + xid, + urecptr, + usedoff, + ucnt); + TPDPageSetLSN(BufferGetPage(buffer), lsn); + } + } + + /* be tidy */ + if (!skip_undo) + { + for (i = 0; i < nranges; i++) + pfree(undorecord[i].uur_payload.data); + pfree(undorecord); + } + pfree(zfree_offset_ranges); + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); +} + +/* + * Handles ZHEAP_CLEAN record type + */ +static void +zheap_xlog_clean(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_zheap_clean *xlrec = (xl_zheap_clean *) XLogRecGetData(record); + Buffer buffer; + Size freespace = 0; + RelFileNode rnode; + BlockNumber blkno; + XLogRedoAction action; + OffsetNumber *target_offnum; + Size *space_required; + + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); + + /* + * We're about to remove tuples. In Hot Standby mode, ensure that there's + * no queries running for which the removed tuples are still visible. + * + * Not all ZHEAP_CLEAN records remove tuples with xids, so we only want to + * conflict on the records that cause MVCC failures for user queries. If + * latestRemovedXid is invalid, skip conflict processing. + */ + if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid)) + ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode); + + /* + * If we have a full-page image, restore it (using a cleanup lock) and + * we're done. + */ + action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, + &buffer); + if (action == BLK_NEEDS_REDO) + { + Page page = (Page)BufferGetPage(buffer); + OffsetNumber *end; + OffsetNumber *deleted; + OffsetNumber *nowdead; + OffsetNumber *nowunused; + OffsetNumber tmp_target_off; + int ndeleted; + int ndead; + int nunused; + Size datalen; + Size tmp_spc_rqd; + + deleted = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen); + + ndeleted = xlrec->ndeleted; + ndead = xlrec->ndead; + end = (OffsetNumber *) ((char *) deleted + datalen); + nowdead = deleted + (ndeleted * 2); + nowunused = nowdead + ndead; + nunused = (end - nowunused); + Assert(nunused >= 0); + + /* Update all item pointers per the record, and repair fragmentation */ + if (xlrec->flags & XLZ_CLEAN_CONTAINS_OFFSET) + { + target_offnum = (OffsetNumber *) ((char *) xlrec + SizeOfZHeapClean); + space_required = (Size *) ((char *) target_offnum + sizeof(OffsetNumber)); + } + else + { + target_offnum = &tmp_target_off; + *target_offnum = InvalidOffsetNumber; + space_required = &tmp_spc_rqd; + *space_required = 0; + } + + zheap_page_prune_execute(buffer, *target_offnum, deleted, ndeleted, + nowdead, ndead, nowunused, nunused); + + if (xlrec->flags & XLZ_CLEAN_ALLOW_PRUNING) + { + bool pruned PG_USED_FOR_ASSERTS_ONLY = false; + Page tmppage = NULL; + + /* + * We prepare the temporary copy of the page so that during page + * repair fragmentation we can use it to copy the actual tuples. + * See comments atop zheap_page_prune_guts. + */ + tmppage = PageGetTempPageCopy(BufferGetPage(buffer)); + ZPageRepairFragmentation(buffer, tmppage, *target_offnum, + *space_required, true, &pruned); + + /* + * Pruning must be successful at redo time, otherwise the page + * contents on master and standby might differ. + */ + Assert(pruned); + + /* be tidy. */ + pfree(tmppage); + } + + freespace = PageGetZHeapFreeSpace(page); /* needed to update FSM below */ + + /* + * Note: we don't worry about updating the page's prunability hints. + * At worst this will cause an extra prune cycle to occur soon. + */ + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + /* + * Update the FSM as well. + * + * XXX: Don't do this if the page was restored from full page image. We + * don't bother to update the FSM in that case, it doesn't need to be + * totally accurate anyway. + */ + if (action == BLK_NEEDS_REDO) + XLogRecordPageWithFreeSpace(rnode, blkno, freespace); +} + +/* + * Handles XLOG_ZHEAP_CONFIRM record type + */ +static void +zheap_xlog_confirm(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_zheap_confirm *xlrec = (xl_zheap_confirm *) XLogRecGetData(record); + Buffer buffer; + Page page; + OffsetNumber offnum; + ItemId lp = NULL; + ZHeapTupleHeader zhtup; + + if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO) + { + page = BufferGetPage(buffer); + + offnum = xlrec->offnum; + if (PageGetMaxOffsetNumber(page) >= offnum) + lp = PageGetItemId(page, offnum); + + if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp)) + elog(PANIC, "invalid lp"); + + zhtup = (ZHeapTupleHeader) PageGetItem(page, lp); + + if (xlrec->flags == XLZ_SPEC_INSERT_SUCCESS) + { + /* Confirm tuple as actually inserted */ + zhtup->t_infomask &= ~ZHEAP_SPECULATIVE_INSERT; + } + else + { + Assert(xlrec->flags == XLZ_SPEC_INSERT_FAILED); + ItemIdSetDead(lp); + ZPageSetPrunable(page, XLogRecGetXid(record)); + } + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); +} + +/* + * Handles XLOG_ZHEAP_UNUSED record type + */ +static void +zheap_xlog_unused(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_undo_header *xlundohdr; + xl_zheap_unused *xlrec; + UnpackedUndoRecord undorecord; + UndoRecPtr urecptr; + TransactionId xid = XLogRecGetXid(record); + uint32 xid_epoch = GetEpochForXid(xid); + uint16 i, uncnt; + Buffer buffer; + OffsetNumber *unused; + Size freespace = 0; + RelFileNode rnode; + BlockNumber blkno; + XLogRedoAction action; + + xlundohdr = (xl_undo_header *) XLogRecGetData(record); + xlrec = (xl_zheap_unused *) ((char *) xlundohdr + SizeOfUndoHeader); + /* extract the information related to unused offsets */ + unused = (OffsetNumber *) ((char *) xlrec + SizeOfZHeapUnused); + uncnt = xlrec->nunused; + + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); + + /* + * We're about to remove tuples. In Hot Standby mode, ensure that there's + * no queries running for which the removed tuples are still visible. + * + * Not all ZHEAP_UNUSED records remove tuples with xids, so we only want to + * conflict on the records that cause MVCC failures for user queries. If + * latestRemovedXid is invalid, skip conflict processing. + */ + if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid)) + ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode); + + /* prepare an undo record */ + undorecord.uur_type = UNDO_ITEMID_UNUSED; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = xlundohdr->reloid; + undorecord.uur_prevxid = xid; + undorecord.uur_xid = xid; + undorecord.uur_cid = FirstCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = xlundohdr->blkprev; + undorecord.uur_block = blkno; + undorecord.uur_offset = 0; + undorecord.uur_tuple.len = 0; + undorecord.uur_payload.len = uncnt * sizeof(OffsetNumber); + undorecord.uur_payload.data = + (char *) palloc(uncnt * sizeof(OffsetNumber)); + memcpy(undorecord.uur_payload.data, + (char *) unused, + undorecord.uur_payload.len); + + urecptr = PrepareUndoInsert(&undorecord, xid, UNDO_PERMANENT, NULL); + InsertPreparedUndo(); + + /* + * undo should be inserted at same location as it was during the actual + * insert (DO operation). + */ + Assert (urecptr == xlundohdr->urec_ptr); + + /* + * If we have a full-page image, restore it (using a cleanup lock) and + * we're done. + */ + action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, + &buffer); + if (action == BLK_NEEDS_REDO) + { + Page page = (Page) BufferGetPage(buffer); + + // ZBORKED: unsigned type, can't be smaller, compiler laments + // Assert(uncnt >= 0); + + for (i = 0; i < uncnt; i++) + { + ItemId itemid; + + itemid = PageGetItemId(page, unused[i]); + ItemIdSetUnusedExtended(itemid, xlrec->trans_slot_id); + } + + PageSetUNDO(undorecord, buffer, xlrec->trans_slot_id, false, xid_epoch, + xid, urecptr, NULL, 0); + + if (xlrec->flags & XLZ_UNUSED_ALLOW_PRUNING) + { + bool pruned PG_USED_FOR_ASSERTS_ONLY = false; + Page tmppage = NULL; + + /* + * We prepare the temporary copy of the page so that during page + * repair fragmentation we can use it to copy the actual tuples. + * See comments atop zheap_page_prune_guts. + */ + tmppage = PageGetTempPageCopy(BufferGetPage(buffer)); + ZPageRepairFragmentation(buffer, tmppage, InvalidOffsetNumber, + 0, true, &pruned); + + /* + * Pruning must be successful at redo time, otherwise the page + * contents on master and standby might differ. + */ + Assert(pruned); + + pfree(tmppage); + } + + freespace = PageGetZHeapFreeSpace(page); /* needed to update FSM below */ + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + + /* replay the record for tpd buffer */ + if (XLogRecHasBlockRef(record, 1)) + { + /* + * We need to replay the record for TPD only when this record contains + * slot from TPD. + */ + action = XLogReadTPDBuffer(record, 1); + if (action == BLK_NEEDS_REDO) + { + TPDPageSetUndo(buffer, + xlrec->trans_slot_id, + true, + xid_epoch, + xid, + urecptr, + unused, + uncnt); + TPDPageSetLSN(BufferGetPage(buffer), lsn); + } + } + + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + + /* + * Update the FSM as well. + * + * XXX: Don't do this if the page was restored from full page image. We + * don't bother to update the FSM in that case, it doesn't need to be + * totally accurate anyway. + */ + if (action == BLK_NEEDS_REDO) + XLogRecordPageWithFreeSpace(rnode, blkno, freespace); +} + +/* + * Replay XLOG_ZHEAP_VISIBLE record. + */ +static void +zheap_xlog_visible(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_zheap_visible *xlrec = (xl_zheap_visible *) XLogRecGetData(record); + Buffer vmbuffer = InvalidBuffer; + RelFileNode rnode; + + XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL); + + /* + * If there are any Hot Standby transactions running that have an xmin + * horizon old enough that this page isn't all-visible for them, they + * might incorrectly decide that an index-only scan can skip a zheap fetch. + * + * NB: It might be better to throw some kind of "soft" conflict here that + * forces any index-only scan that is in flight to perform zheap fetches, + * rather than killing the transaction outright. + */ + if (InHotStandby) + ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode); + + if (XLogReadBufferForRedoExtended(record, 0, RBM_ZERO_ON_ERROR, false, + &vmbuffer) == BLK_NEEDS_REDO) + { + Page vmpage = BufferGetPage(vmbuffer); + Relation reln; + BlockNumber blkno = xlrec->heapBlk;; + + /* initialize the page if it was read as zeros */ + if (PageIsNew(vmpage)) + PageInit(vmpage, BLCKSZ, 0); + + /* + * XLogReadBufferForRedoExtended locked the buffer. But + * visibilitymap_set will handle locking itself. + */ + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK); + + reln = CreateFakeRelcacheEntry(rnode); + visibilitymap_pin(reln, blkno, &vmbuffer); + + /* + * Don't set the bit if replay has already passed this point. + * + * It might be safe to do this unconditionally; if replay has passed + * this point, we'll replay at least as far this time as we did + * before, and if this bit needs to be cleared, the record responsible + * for doing so should be again replayed, and clear it. For right + * now, out of an abundance of conservatism, we use the same test here + * we did for the zheap page. If this results in a dropped bit, no + * real harm is done; and the next VACUUM will fix it. + */ + if (lsn > PageGetLSN(vmpage)) + visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer, + xlrec->cutoff_xid, xlrec->flags); + + ReleaseBuffer(vmbuffer); + FreeFakeRelcacheEntry(reln); + } + else if (BufferIsValid(vmbuffer)) + UnlockReleaseBuffer(vmbuffer); +} + +void +zheap_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info & XLOG_ZHEAP_OPMASK) + { + case XLOG_ZHEAP_INSERT: + zheap_xlog_insert(record); + break; + case XLOG_ZHEAP_DELETE: + zheap_xlog_delete(record); + break; + case XLOG_ZHEAP_UPDATE: + zheap_xlog_update(record); + break; + case XLOG_ZHEAP_FREEZE_XACT_SLOT: + zheap_xlog_freeze_xact_slot(record); + break; + case XLOG_ZHEAP_INVALID_XACT_SLOT: + zheap_xlog_invalid_xact_slot(record); + break; + case XLOG_ZHEAP_LOCK: + zheap_xlog_lock(record); + break; + case XLOG_ZHEAP_MULTI_INSERT: + zheap_xlog_multi_insert(record); + break; + case XLOG_ZHEAP_CLEAN: + zheap_xlog_clean(record); + break; + default: + elog(PANIC, "zheap_redo: unknown op code %u", info); + } +} + +void +zheap2_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info & XLOG_ZHEAP_OPMASK) + { + case XLOG_ZHEAP_CONFIRM: + zheap_xlog_confirm(record); + break; + case XLOG_ZHEAP_UNUSED: + zheap_xlog_unused(record); + break; + case XLOG_ZHEAP_VISIBLE: + zheap_xlog_visible(record); + break; + default: + elog(PANIC, "zheap2_redo: unknown op code %u", info); + } +} diff --git a/src/backend/access/zheap/zhio.c b/src/backend/access/zheap/zhio.c new file mode 100644 index 0000000000..df6656bbf7 --- /dev/null +++ b/src/backend/access/zheap/zhio.c @@ -0,0 +1,403 @@ +/*------------------------------------------------------------------------- + * + * zhio.c + * POSTGRES zheap access method input/output code. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/zheap/zhio.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/tpd.h" +#include "access/visibilitymap.h" +#include "access/zheap.h" +#include "access/zhio.h" +#include "access/zhtup.h" +#include "storage/bufmgr.h" +#include "storage/freespace.h" +#include "storage/lmgr.h" +#include "storage/smgr.h" + +/* + * RelationGetBufferForZTuple + * + * Returns pinned and exclusive-locked buffer of a page in given relation + * with free space >= given len. + * + * This is quite similar to RelationGetBufferForTuple except for zheap + * specific handling. If the last page where tuple needs to be inserted is a + * TPD page, we skip it and directly extend the relation. We could instead + * check the previous page, but scanning relation backwards could be costly, + * so we avoid it for now. As we don't align tuples in zheap, use actual + * length to find the required buffer. + */ +Buffer +RelationGetBufferForZTuple(Relation relation, Size len, + Buffer otherBuffer, int options, + BulkInsertState bistate, + Buffer *vmbuffer, Buffer *vmbuffer_other) +{ + bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM); + Buffer buffer = InvalidBuffer; + Page page; + Size pageFreeSpace = 0, + saveFreeSpace = 0; + BlockNumber targetBlock, + otherBlock; + bool needLock = false; + bool recheck = true; + bool tpdPage = false; + + /* Bulk insert is not supported for updates, only inserts. */ + Assert(otherBuffer == InvalidBuffer || !bistate); + + len = SHORTALIGN(len); + + /* + * If we're gonna fail for oversize tuple, do it right away + */ + if (len > MaxZHeapTupleSize) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("row is too big: size %zu, maximum size %zu", + len, MaxZHeapTupleSize))); + + /* Compute desired extra freespace due to fillfactor option */ + saveFreeSpace = RelationGetTargetPageFreeSpace(relation, + HEAP_DEFAULT_FILLFACTOR); + + if (otherBuffer != InvalidBuffer) + otherBlock = BufferGetBlockNumber(otherBuffer); + else + otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */ + + /* + * We first try to put the tuple on the same page we last inserted a tuple + * on, as cached in the BulkInsertState or relcache entry. If that + * doesn't work, we ask the Free Space Map to locate a suitable page. + * Since the FSM's info might be out of date, we have to be prepared to + * loop around and retry multiple times. (To insure this isn't an infinite + * loop, we must update the FSM with the correct amount of free space on + * each page that proves not to be suitable.) If the FSM has no record of + * a page with enough free space, we give up and extend the relation. + * + * When use_fsm is false, we either put the tuple onto the existing target + * page or extend the relation. + */ + if (len + saveFreeSpace > MaxZHeapTupleSize) + { + /* can't fit, don't bother asking FSM */ + targetBlock = InvalidBlockNumber; + use_fsm = false; + } + else if (bistate && bistate->current_buf != InvalidBuffer) + targetBlock = BufferGetBlockNumber(bistate->current_buf); + else + targetBlock = RelationGetTargetBlock(relation); + + if (targetBlock == InvalidBlockNumber && use_fsm) + { + /* + * We have no cached target page, so ask the FSM for an initial + * target. + */ + targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); + + /* + * If the FSM knows nothing of the rel, try the last page before we + * give up and extend. This avoids one-tuple-per-page syndrome during + * bootstrapping or in a recently-started system. + */ + if (targetBlock == InvalidBlockNumber) + { + BlockNumber nblocks = RelationGetNumberOfBlocks(relation); + + /* + * In zheap, first page is always a meta page, so we need to + * skip it for tuple insertions. + */ + if (nblocks > ZHEAP_METAPAGE + 1) + targetBlock = nblocks - 1; + } + } + +loop: + while (targetBlock != InvalidBlockNumber) + { + /* + * Read and exclusive-lock the target block, as well as the other + * block if one was given, taking suitable care with lock ordering and + * the possibility they are the same block. + * + * If the page-level all-visible flag is set, caller will need to + * clear both that and the corresponding visibility map bit. However, + * by the time we return, we'll have x-locked the buffer, and we don't + * want to do any I/O while in that state. So we check the bit here + * before taking the lock, and pin the page if it appears necessary. + * Checking without the lock creates a risk of getting the wrong + * answer, so we'll have to recheck after acquiring the lock. + */ + if (otherBuffer == InvalidBuffer) + { + /* easy case */ + buffer = ReadBufferBI(relation, targetBlock, bistate); + visibilitymap_pin(relation, targetBlock, vmbuffer); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + } + else if (otherBlock == targetBlock) + { + /* also easy case */ + buffer = otherBuffer; + visibilitymap_pin(relation, targetBlock, vmbuffer); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + } + else if (otherBlock < targetBlock) + { + /* lock other buffer first */ + buffer = ReadBuffer(relation, targetBlock); + visibilitymap_pin(relation, targetBlock, vmbuffer); + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + } + else + { + /* lock target buffer first */ + buffer = ReadBuffer(relation, targetBlock); + visibilitymap_pin(relation, targetBlock, vmbuffer); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + } + + if (PageGetSpecialSize(BufferGetPage(buffer)) == MAXALIGN(sizeof(TPDPageOpaqueData))) + { + tpdPage = true; + page = BufferGetPage(buffer); + + /* If the tpd page is empty, then we can use it as an empty zheap page. */ + if (PageIsEmpty(page)) + { + ZheapInitPage(page, BufferGetPageSize(buffer)); + tpdPage = false; + } + } + + if (!tpdPage) + { + /* + * We now have the target page (and the other buffer, if any) pinned + * and locked. However, since our initial PageIsAllVisible checks + * were performed before acquiring the lock, the results might now be + * out of date, either for the selected victim buffer, or for the + * other buffer passed by the caller. In that case, we'll need to + * give up our locks, go get the pin(s) we failed to get earlier, and + * re-lock. That's pretty painful, but hopefully shouldn't happen + * often. + * + * Note that there's a small possibility that we didn't pin the page + * above but still have the correct page pinned anyway, either because + * we've already made a previous pass through this loop, or because + * caller passed us the right page anyway. + * + * Note also that it's possible that by the time we get the pin and + * retake the buffer locks, the visibility map bit will have been + * cleared by some other backend anyway. In that case, we'll have + * done a bit of extra work for no gain, but there's no real harm + * done. + * + * Fixme: GetVisibilityMapPins use PageIsAllVisible which is not + * required for zheap, so either we need to rewrite that function or + * somehow avoid the usage of that call. + */ + if (otherBuffer == InvalidBuffer || buffer <= otherBuffer) + GetVisibilityMapPins(relation, buffer, otherBuffer, + targetBlock, otherBlock, vmbuffer, + vmbuffer_other); + else + GetVisibilityMapPins(relation, otherBuffer, buffer, + otherBlock, targetBlock, vmbuffer_other, + vmbuffer); + + /* + * Now we can check to see if there's enough free space here. If so, + * we're done. + */ + page = BufferGetPage(buffer); + pageFreeSpace = PageGetZHeapFreeSpace(page); + if (len + saveFreeSpace <= pageFreeSpace) + { + /* use this page as future insert target, too */ + RelationSetTargetBlock(relation, targetBlock); + return buffer; + } + } + + /* + * Not enough space or a tpd page, so we must give up our page locks + * and pin (if any) and prepare to look elsewhere. We don't care + * which order we unlock the two buffers in, so this can be slightly + * simpler than the code above. + */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + if (otherBuffer == InvalidBuffer) + ReleaseBuffer(buffer); + else if (otherBlock != targetBlock) + { + LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK); + ReleaseBuffer(buffer); + } + + /* + * If this a tpd page or FSM doesn't need to be updated, always fall + * out of the loop and extend. + */ + if (!use_fsm || tpdPage) + break; + + /* + * Update FSM as to condition of this page, and ask for another page + * to try. + */ + targetBlock = RecordAndGetPageWithFreeSpace(relation, + targetBlock, + pageFreeSpace, + len + saveFreeSpace); + } + + /* + * Have to extend the relation. + * + * We have to use a lock to ensure no one else is extending the rel at the + * same time, else we will both try to initialize the same new page. We + * can skip locking for new or temp relations, however, since no one else + * could be accessing them. + */ + needLock = !RELATION_IS_LOCAL(relation); + +recheck: + /* + * If we need the lock but are not able to acquire it immediately, we'll + * consider extending the relation by multiple blocks at a time to manage + * contention on the relation extension lock. However, this only makes + * sense if we're using the FSM; otherwise, there's no point. + */ + if (needLock) + { + if (!use_fsm) + LockRelationForExtension(relation, ExclusiveLock); + else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock)) + { + /* Couldn't get the lock immediately; wait for it. */ + LockRelationForExtension(relation, ExclusiveLock); + + /* + * Check if some other backend has extended a block for us while + * we were waiting on the lock. + */ + targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); + + /* + * If some other waiter has already extended the relation, we + * don't need to do so; just use the existing freespace. + */ + if (targetBlock != InvalidBlockNumber) + { + UnlockRelationForExtension(relation, ExclusiveLock); + goto loop; + } + + /* Time to bulk-extend. */ + RelationAddExtraBlocks(relation, bistate); + } + } + + /* + * In addition to whatever extension we performed above, we always add at + * least one block to satisfy our own request. + * + * XXX This does an lseek - rather expensive - but at the moment it is the + * only way to accurately determine how many blocks are in a relation. Is + * it worth keeping an accurate file length in shared memory someplace, + * rather than relying on the kernel to do it for us? + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + /* + * We can be certain that locking the otherBuffer first is OK, since it + * must have a lower page number. We don't lock other buffer while holding + * extension lock. See comments below. + */ + if (otherBuffer != InvalidBuffer && !needLock) + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Now acquire lock on the new page. + */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Release the file-extension lock; it's now OK for someone else to extend + * the relation some more. Note that we cannot release this lock before + * we have buffer lock on the new page, or we risk a race condition + * against vacuumlazy.c --- see comments therein. + */ + if (needLock) + UnlockRelationForExtension(relation, ExclusiveLock); + + /* + * We need to initialize the empty new page. Double-check that it really + * is empty (this should never happen, but if it does we don't want to + * risk wiping out valid data). + */ + page = BufferGetPage(buffer); + + if (!PageIsNew(page)) + elog(ERROR, "page %u of relation \"%s\" should be empty but is not", + BufferGetBlockNumber(buffer), + RelationGetRelationName(relation)); + + Assert(BufferGetBlockNumber(buffer) != ZHEAP_METAPAGE); + ZheapInitPage(page, BufferGetPageSize(buffer)); + + /* + * We don't acquire lock on otherBuffer while holding extension lock as it + * can create a deadlock against extending TPD entry where we take extension + * lock while holding the heap buffer lock. See TPDAllocatePageAndAddEntry. + */ + if (needLock && + otherBuffer != InvalidBuffer && + BufferGetBlockNumber(buffer) > otherBlock) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + recheck = true; + } + if (len > PageGetZHeapFreeSpace(page)) + { + if (recheck) + goto recheck; + + /* We should not get here given the test at the top */ + elog(PANIC, "tuple is too big: size %zu", len); + } + + /* + * Remember the new page as our target for future insertions. + * + * XXX should we enter the new page into the free space map immediately, + * or just keep it for this backend's exclusive use in the short run + * (until VACUUM sees it)? Seems to depend on whether you expect the + * current backend to make more insertions or not, which is probably a + * good bet most of the time. So for now, don't add it to FSM yet. + */ + RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); + + return buffer; +} diff --git a/src/backend/access/zheap/zmultilocker.c b/src/backend/access/zheap/zmultilocker.c new file mode 100644 index 0000000000..7b39917a32 --- /dev/null +++ b/src/backend/access/zheap/zmultilocker.c @@ -0,0 +1,853 @@ +/*------------------------------------------------------------------------- + * + * zmultilocker.c + * zheap multi locker code + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/zheap/zmultilocker.c + * + * NOTES + * This file contains functions for the multi locker facilit1y of zheap. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/tpd.h" +#include "access/xact.h" +#include "access/zmultilocker.h" +#include "storage/bufmgr.h" +#include "storage/buf_internals.h" +#include "storage/proc.h" + +static bool IsZMultiLockListMember(List *members, ZMultiLockMember *mlmember); + +/* + * ZGetMultiLockMembersForCurrentXact - Return the strongest lock mode held by + * the current transaction on a given tuple. + */ +List * +ZGetMultiLockMembersForCurrentXact(ZHeapTuple zhtup, int trans_slot, + UndoRecPtr urec_ptr) +{ + ZHeapTuple undo_tup; + UnpackedUndoRecord *urec = NULL; + ZMultiLockMember *mlmember; + List *multilockmembers = NIL; + int trans_slot_id = -1; + uint8 uur_type; + + undo_tup = zhtup; + do + { + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* If undo is discarded, we can't proceed further. */ + if (!urec) + break; + + /* If we encounter a different transaction, we shouldn't go ahead. */ + if (!TransactionIdIsCurrentTransactionId(urec->uur_xid)) + break; + + + uur_type = urec->uur_type; + + if (uur_type == UNDO_INSERT || uur_type == UNDO_MULTI_INSERT) + { + /* + * We are done, once we are at the end of current chain. We + * consider the chain has ended when we reach the root tuple. + */ + break; + } + + /* don't free the tuple passed by caller */ + undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, NULL, + (undo_tup) == (zhtup) ? false : true, + NULL); + + if (uur_type == UNDO_XID_LOCK_ONLY || + uur_type == UNDO_XID_LOCK_FOR_UPDATE || + uur_type == UNDO_XID_MULTI_LOCK_ONLY) + { + mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember)); + mlmember->xid = urec->uur_xid; + mlmember->trans_slot_id = trans_slot; + mlmember->mode = *((LockTupleMode *) urec->uur_payload.data); + multilockmembers = lappend(multilockmembers, mlmember); + } + else if (uur_type == UNDO_UPDATE || + uur_type == UNDO_INPLACE_UPDATE) + { + mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember)); + mlmember->xid = urec->uur_xid; + mlmember->trans_slot_id = trans_slot; + + if (ZHEAP_XID_IS_EXCL_LOCKED(undo_tup->t_data->t_infomask)) + mlmember->mode = LockTupleExclusive; + else + mlmember->mode = LockTupleNoKeyExclusive; + + multilockmembers = lappend(multilockmembers, mlmember); + } + else if (uur_type == UNDO_DELETE) + { + mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember)); + mlmember->xid = urec->uur_xid; + mlmember->trans_slot_id = trans_slot; + mlmember->mode = LockTupleExclusive; + multilockmembers = lappend(multilockmembers, mlmember); + } + else + { + /* Should not reach here */ + Assert(0); + } + + if (trans_slot_id == ZHTUP_SLOT_FROZEN) + { + /* + * We are done, once the the undo record suggests that prior + * record is already discarded. + * + * Note that we record the lock mode for all these cases because + * the lock mode stored in undo tuple is for the current + * transaction. + */ + break; + } + urec_ptr = urec->uur_blkprev; + + UndoRecordRelease(urec); + urec = NULL; + } while (UndoRecPtrIsValid(urec_ptr)); + + if (urec) + { + UndoRecordRelease(urec); + urec = NULL; + } + if (undo_tup && undo_tup != zhtup) + pfree(undo_tup); + + return multilockmembers; +} + +/* + * ZGetMultiLockMembers - Return the list of members that have locked a + * particular tuple. + * + * This function returns the list of in-progress, committed or aborted + * transactions. The purpose of returning committed or aborted transactions + * is that some of the callers want to take some specific action for + * such transactions if they have updated the tuple. + */ +List * +ZGetMultiLockMembers(Relation rel, ZHeapTuple zhtup, Buffer buf, + bool nobuflock) +{ + ZHeapTuple undo_tup; + UnpackedUndoRecord *urec = NULL; + UndoRecPtr urec_ptr; + ZMultiLockMember *mlmember; + List *multilockmembers = NIL; + TransInfo *trans_slots = NULL; + TransactionId xid; + SubTransactionId subxid = InvalidSubTransactionId; + uint64 epoch_xid; + uint32 epoch; + int prev_trans_slot_id, + trans_slot_id; + uint8 uur_type; + int slot_no; + int total_trans_slots = 0; + BlockNumber tpd_blkno = InvalidBlockNumber; + + if (nobuflock) + { + ItemId lp; + + LockBuffer(buf, BUFFER_LOCK_SHARE); + lp = PageGetItemId(BufferGetPage(buf), + ItemPointerGetOffsetNumber(&zhtup->t_self)); + /* + * It is quite possible that once we reacquire the lock on buffer, + * some other backend would have deleted the tuple and in such case, + * we don't need to do anything. However, the tuple can't be pruned + * because the current snapshot must predates the transaction that + * removes the tuple. + */ + Assert(!ItemIdIsDead(lp)); + if (ItemIdIsDeleted(lp)) + { + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + return NIL; + } + } + + trans_slots = GetTransactionsSlotsForPage(rel, buf, &total_trans_slots, + &tpd_blkno); + + if (nobuflock) + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + for (slot_no = 0; slot_no < total_trans_slots; slot_no++) + { + bool first_urp = true; + epoch = trans_slots[slot_no].xid_epoch; + xid = trans_slots[slot_no].xid; + + epoch_xid = MakeEpochXid((uint64)epoch, xid); + + /* + * We need to process the undo chain only for in-progress + * transactions. + */ + if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + continue; + + urec_ptr = trans_slots[slot_no].urec_ptr; + trans_slot_id = slot_no + 1; + undo_tup = zhtup; + + /* + * If the page contains TPD slots and it's not pruned, the last slot + * contains the information about the corresponding TPD entry. + * Hence, if current slot refers to some TPD slot, we should skip + * the last slot in the page by increasing the slot index by 1. + */ + if ((trans_slot_id >= ZHEAP_PAGE_TRANS_SLOTS) && + BlockNumberIsValid(tpd_blkno)) + trans_slot_id += 1; + + do + { + UndoLogControl *log = NULL; + + /* + * After we release the buffer lock, the transaction can be + * rolled-back and undo record poiner can be re-winded. Ensure + * that undo record pointer is sane by acquiring rewind lock so + * that undo worker can't rewind it concurrently. + * + * It is sufficient to verify the first undo record of slot as + * the previous one's can't be re-wounded. + * + * If we already have a buf LOCK, then there is no need to verify + * undo record pointer as rollback can't rewind till the undo actions + * are applied. + */ + if (nobuflock && first_urp) + { + log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr)); + + /* + * Acquire rewind lock to prevent rewinding the undo record + * pointer while we are fetching the undo record. + */ + LWLockAcquire(&log->rewind_lock, LW_SHARED); + + /* Lock the buffer */ + LockBuffer(buf, BUFFER_LOCK_SHARE); + + /* + * We can release the buffer lock after reading the slot + * information as we already hold the rewind lock, so the undo + * can't be re-winded. Although, it can be discarded but we have + * handling for the same. + */ + trans_slot_id = GetTransactionSlotInfo(buf, + InvalidOffsetNumber, + trans_slot_id, + &epoch, + &xid, + &urec_ptr, + true, + true); + + /* Release the buffer lock */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + epoch_xid = MakeEpochXid((uint64)epoch, xid); + + /* + * We need to process the undo chain only for in-progress + * transactions. + */ + if (epoch_xid < pg_atomic_read_u64( + &ProcGlobal->oldestXidWithEpochHavingUndo)) + { + LWLockRelease(&log->rewind_lock); + break; + } + } + + prev_trans_slot_id = trans_slot_id; + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&undo_tup->t_self), + ItemPointerGetOffsetNumber(&undo_tup->t_self), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + if (nobuflock && first_urp) + { + first_urp = false; + LWLockRelease(&log->rewind_lock); + } + + /* If undo is discarded, we can't proceed further. */ + if (!urec) + break; + + ZHeapTupleGetSubXid(undo_tup, buf, urec_ptr, &subxid); + + /* + * Exclude undo records inserted by my own transaction. We neither + * need to check conflicts with them nor need to wait for them. + */ + if (TransactionIdEquals(urec->uur_xid, GetTopTransactionIdIfAny())) + { + urec_ptr = urec->uur_blkprev; + UndoRecordRelease(urec); + urec = NULL; + continue; + } + + uur_type = urec->uur_type; + + if (uur_type == UNDO_INSERT || uur_type == UNDO_MULTI_INSERT) + { + /* + * We are done, once we are at the end of current chain. We + * consider the chain has ended when we reach the root tuple. + */ + break; + } + + /* don't free the tuple passed by caller */ + undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, + NULL, (undo_tup) == (zhtup) ? false : true, + BufferGetPage(buf)); + + if (uur_type == UNDO_XID_LOCK_ONLY || + uur_type == UNDO_XID_LOCK_FOR_UPDATE || + uur_type == UNDO_XID_MULTI_LOCK_ONLY) + { + mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember)); + mlmember->xid = urec->uur_xid; + mlmember->subxid = subxid; + mlmember->trans_slot_id = prev_trans_slot_id; + mlmember->mode = *((LockTupleMode *) urec->uur_payload.data); + multilockmembers = lappend(multilockmembers, mlmember); + } + else if (uur_type == UNDO_UPDATE || + uur_type == UNDO_INPLACE_UPDATE) + { + mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember)); + mlmember->xid = urec->uur_xid; + mlmember->subxid = subxid; + mlmember->trans_slot_id = prev_trans_slot_id; + + if (ZHEAP_XID_IS_EXCL_LOCKED(undo_tup->t_data->t_infomask)) + mlmember->mode = LockTupleExclusive; + else + mlmember->mode = LockTupleNoKeyExclusive; + + multilockmembers = lappend(multilockmembers, mlmember); + } + else if (uur_type == UNDO_DELETE) + { + mlmember = (ZMultiLockMember *) palloc(sizeof(ZMultiLockMember)); + mlmember->xid = urec->uur_xid; + mlmember->subxid = subxid; + mlmember->trans_slot_id = prev_trans_slot_id; + mlmember->mode = LockTupleExclusive; + multilockmembers = lappend(multilockmembers, mlmember); + } + else + { + /* Should not reach here */ + Assert(0); + } + + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + trans_slot_id != prev_trans_slot_id) + { + /* + * We are done, once the the undo record suggests that prior + * record is already discarded or the prior record belongs to + * a different transaction slot chain. + */ + break; + } + + /* + * We allow to move backwards in the chain even when we + * encountered undo record of committed transaction + * (ZHeapTupleHasInvalidXact(undo_tup->t_data)). + */ + urec_ptr = urec->uur_blkprev; + + UndoRecordRelease(urec); + urec = NULL; + } while (UndoRecPtrIsValid(urec_ptr)); + + if (urec) + { + UndoRecordRelease(urec); + urec = NULL; + } + + if (undo_tup && undo_tup != zhtup) + pfree(undo_tup); + } + + /* be tidy */ + pfree(trans_slots); + + return multilockmembers; +} + +/* + * ZMultiLockMembersWait - Wait for all the members to end. + * + * This function also applies the undo actions for aborted transactions. + */ +bool +ZMultiLockMembersWait(Relation rel, List *mlmembers, ZHeapTuple zhtup, + Buffer buf, TransactionId update_xact, + LockTupleMode required_mode, bool nowait, + XLTW_Oper oper, int *remaining, bool *upd_xact_aborted) +{ + bool result = true; + ListCell *lc; + BufferDesc *bufhdr PG_USED_FOR_ASSERTS_ONLY; + int remain = 0; + + bufhdr = GetBufferDescriptor(buf - 1); + /* buffer must be unlocked */ + Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(bufhdr))); + + *upd_xact_aborted = false; + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + TransactionId memxid = mlmember->xid; + SubTransactionId memsubxid = mlmember->subxid; + LockTupleMode memmode = mlmember->mode; + + if (TransactionIdIsCurrentTransactionId(memxid)) + { + remain++; + continue; + } + + if (!DoLockModesConflict(HWLOCKMODE_from_locktupmode(memmode), + HWLOCKMODE_from_locktupmode(required_mode))) + { + if (remaining && TransactionIdIsInProgress(memxid)) + remain++; + continue; + } + + /* + * This member conflicts with our multi, so we have to sleep (or + * return failure, if asked to avoid waiting.) + */ + if (memsubxid != InvalidSubTransactionId) + { + if (nowait) + { + result = ConditionalSubXactLockTableWait(memxid, memsubxid); + if (!result) + break; + } + else + SubXactLockTableWait(memxid, memsubxid, rel, &zhtup->t_self, + oper); + } + else if (nowait) + { + result = ConditionalXactLockTableWait(memxid); + if (!result) + break; + } + else + XactLockTableWait(memxid, rel, &zhtup->t_self, oper); + + /* + * For aborted transaction, if the undo actions are not applied yet, + * then apply them before modifying the page. + */ + if (TransactionIdDidAbort(memxid)) + { + LockBuffer(buf, BUFFER_LOCK_SHARE); + zheap_exec_pending_rollback(rel, buf, mlmember->trans_slot_id, memxid); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + if (TransactionIdIsValid(update_xact) && memxid == update_xact) + *upd_xact_aborted = true; + } + } + + if (remaining) + *remaining = remain; + + return result; +} + +/* + * ConditionalZMultiLockMembersWait + * As above, but only lock if we can get the lock without blocking. + */ +bool +ConditionalZMultiLockMembersWait(Relation rel, List *mlmembers, + Buffer buf, TransactionId update_xact, + LockTupleMode required_mode, int *remaining, + bool *upd_xact_aborted) +{ + return ZMultiLockMembersWait(rel, mlmembers, NULL, buf, update_xact, + required_mode, true, XLTW_None, remaining, + upd_xact_aborted); +} + +/* + * ZIsAnyMultiLockMemberRunning - Check if any multi lock member is running. + * + * Returns true, if any member of the multi lock is running, false otherwise. + * + * Unlike heap, we don't consider current transaction's lockers to decide + * if the lockers of multi lock are running. In heap, any lock taken by + * subtransaction is recorded separetly in the multixact, so that it can + * detect if the subtransaction is rolled back. Now as the lock information + * is tracked at subtransaction level, we can't ignore the lockers for + * subtransactions of current top-level transaction. For zheap, rollback to + * subtransaction will rewind the undo and the lockers information will + * be automatically removed, so we don't need to track subtransaction lockers + * separately and hence we can ignore lockers of current top-level + * transaction. + */ +bool +ZIsAnyMultiLockMemberRunning(List *mlmembers, ZHeapTuple zhtup, Buffer buf) +{ + ListCell *lc; + BufferDesc *bufhdr PG_USED_FOR_ASSERTS_ONLY; + + bufhdr = GetBufferDescriptor(buf - 1); + + /* + * Local buffers can't be accesed by other sessions. + */ + if (BufferIsLocal(buf)) + return false; + + /* buffer must be locked by caller */ + Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufhdr))); + + if (list_length(mlmembers) <= 0) + { + elog(DEBUG2, "ZIsRunning: no members"); + return false; + } + + foreach(lc, mlmembers) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + TransactionId memxid = mlmember->xid; + + if (TransactionIdIsInProgress(memxid)) + { + elog(DEBUG2, "ZIsRunning: member %d is running", memxid); + return true; + } + } + + elog(DEBUG2, "ZIsRunning: no members are running"); + + return false; +} + +/* + * IsZMultiLockListMember - Returns true iff mlmember is a member of list + * mlmembers. Equality is determined by comparing all the variables of + * member. + */ +static bool +IsZMultiLockListMember(List *members, ZMultiLockMember *mlmember) +{ + ListCell *lc; + + foreach(lc, members) + { + ZMultiLockMember *lc_member = (ZMultiLockMember *) lfirst(lc); + + if (lc_member->xid == mlmember->xid && + lc_member->trans_slot_id == mlmember->trans_slot_id && + lc_member->mode == mlmember->mode) + return true; + } + + return false; +} + +/* + * ZMultiLockMembersSame - Returns true, iff all the members in list2 list + * are present in list1 list + */ +bool +ZMultiLockMembersSame(List *list1, List *list2) +{ + ListCell *lc; + + if (list_length(list2) > list_length(list1)) + return false; + + foreach(lc, list2) + { + ZMultiLockMember *mlmember = (ZMultiLockMember *) lfirst(lc); + + if (!IsZMultiLockListMember(list1, mlmember)) + return false; + } + + return true; +} + +/* + * ZGetMultiLockInfo - Helper function for compute_new_xid_infomask to + * get the multi lockers information. + */ +void +ZGetMultiLockInfo(uint16 old_infomask, TransactionId tup_xid, + int tup_trans_slot, TransactionId add_to_xid, + uint16 *new_infomask, int *new_trans_slot, + LockTupleMode *mode, bool *old_tuple_has_update, + bool is_update) +{ + LockTupleMode old_mode; + + old_mode = get_old_lock_mode(old_infomask); + + /* We want to propagate the updaters information for lockers only. */ + if (!is_update && IsZHeapTupleModified(old_infomask) && + !ZHEAP_XID_IS_LOCKED_ONLY(old_infomask)) + { + *old_tuple_has_update = true; + + if (ZHeapTupleIsInPlaceUpdated(old_infomask)) + { + *new_infomask |= ZHEAP_INPLACE_UPDATED; + } + else + { + Assert(ZHeapTupleIsUpdated(old_infomask)); + *new_infomask |= ZHEAP_UPDATED; + } + } + + if (tup_xid == add_to_xid) + { + if (ZHeapTupleHasMultiLockers(old_infomask)) + *new_infomask |= ZHEAP_MULTI_LOCKERS; + + /* acquire the strongest of both */ + if (*mode < old_mode) + *mode = old_mode; + } + else + { + *new_infomask |= ZHEAP_MULTI_LOCKERS; + + /* + * Acquire the strongest of both and keep the transaction slot of + * the stronger lock. + */ + if (*mode < old_mode) + { + *mode = old_mode; + } + + /* For lockers, we want to store the updater's transaction slot. */ + if (!is_update) + *new_trans_slot = tup_trans_slot; + } +} + +/* + * GetLockerTransInfo - Retrieve the transaction information of single locker + * from undo. + * + * If the locker is already committed or too-old, we consider as if it didn't + * exist at all. + * + * The caller must have a lock on the buffer (buf). + */ +bool +GetLockerTransInfo(Relation rel, ZHeapTuple zhtup, Buffer buf, + int *trans_slot, uint64 *epoch_xid_out, + TransactionId *xid_out, CommandId *cid_out, + UndoRecPtr *urec_ptr_out) +{ + UnpackedUndoRecord *urec = NULL; + UndoRecPtr urec_ptr; + UndoRecPtr save_urec_ptr = InvalidUndoRecPtr; + TransInfo *trans_slots = NULL; + TransactionId xid; + CommandId cid = InvalidCommandId; + uint64 epoch; + uint64 epoch_xid; + int trans_slot_id; + uint8 uur_type; + int slot_no; + int total_trans_slots = 0; + bool found = false; + BlockNumber tpd_blkno; + + trans_slots = GetTransactionsSlotsForPage(rel, buf, &total_trans_slots, + &tpd_blkno); + + for (slot_no = 0; slot_no < total_trans_slots; slot_no++) + { + epoch = trans_slots[slot_no].xid_epoch; + xid = trans_slots[slot_no].xid; + + epoch_xid = MakeEpochXid(epoch, xid); + + /* + * We need to process the undo chain only for in-progress + * transactions. + */ + if (epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo) || + (!TransactionIdIsInProgress(xid) && TransactionIdDidCommit(xid))) + continue; + + save_urec_ptr = urec_ptr = trans_slots[slot_no].urec_ptr; + + do + { + UndoRecPtr out_urec_ptr PG_USED_FOR_ASSERTS_ONLY; + + out_urec_ptr = InvalidUndoRecPtr; + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + InvalidTransactionId, + &out_urec_ptr, + ZHeapSatisfyUndoRecord); + + /* + * We couldn't find any undo record for the tuple corresponding + * to current slot. + */ + if (urec == NULL) + { + /* Make sure we've reached the end of current undo chain. */ + Assert(!out_urec_ptr); + break; + } + + cid = urec->uur_cid; + + /* + * If the current transaction has locked the tuple, then we don't need + * to process the undo records. + */ + if (TransactionIdEquals(urec->uur_xid, GetTopTransactionIdIfAny())) + { + found = true; + break; + } + + uur_type = urec->uur_type; + + if (uur_type == UNDO_INSERT || uur_type == UNDO_MULTI_INSERT) + { + /* + * We are done, once we are at the end of current chain. We + * consider the chain has ended when we reach the root tuple. + */ + break; + } + + if (uur_type == UNDO_XID_LOCK_ONLY || + uur_type == UNDO_XID_LOCK_FOR_UPDATE) + { + found = true; + break; + } + + if (xid != urec->uur_xid) + { + /* + * We are done, once the the undo record suggests that prior + * tuple version is modified by a different transaction. + */ + break; + } + + urec_ptr = urec->uur_blkprev; + + UndoRecordRelease(urec); + urec = NULL; + } while (UndoRecPtrIsValid(urec_ptr)); + + if (urec) + { + UndoRecordRelease(urec); + urec = NULL; + } + + if (found) + { + /* Transaction slots in the page start from 1. */ + trans_slot_id = slot_no + 1; + + /* + * If the page contains TPD slots and it's not pruned, the last slot + * contains the information about the corresponding TPD entry. + * Hence, if current slot refers to some TPD slot, we should skip + * the last slot in the page by increasing the slot index by 1. + */ + if ((trans_slot_id >= ZHEAP_PAGE_TRANS_SLOTS) && + BlockNumberIsValid(tpd_blkno)) + trans_slot_id += 1; + + break; + } + } + + /* be tidy */ + pfree(trans_slots); + + /* + * If found, we return the corresponding transaction information. Else, we + * return the same information as passed as arguments. + */ + if (found) + { + /* Set the value of required parameters. */ + if (trans_slot) + *trans_slot = trans_slot_id; + if (epoch_xid_out) + *epoch_xid_out = MakeEpochXid(epoch, xid); + if (xid_out) + *xid_out = xid; + if (cid_out) + *cid_out = cid; + if (urec_ptr_out) + *urec_ptr_out = save_urec_ptr; + } + + return found; +} diff --git a/src/backend/access/zheap/ztqual.c b/src/backend/access/zheap/ztqual.c new file mode 100644 index 0000000000..0f615fa03b --- /dev/null +++ b/src/backend/access/zheap/ztqual.c @@ -0,0 +1,2582 @@ +/*------------------------------------------------------------------------- + * + * ztqual.c + * POSTGRES "time qualification" code, ie, ztuple visibility rules. + * + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/utils/time/ztqual.c + * + * The core idea to check if the tuple is all-visible is to see if it is + * modified by transaction smaller than oldestXidWithEpochHavingUndo (aka + * there is no undo pending for the transaction) or if the transaction slot + * is frozen. For undo tuples, we additionally check if the transaction id + * of a transaction that has modified the tuple is FrozenTransactionId. The + * idea is we will always check the visibility of latest tuple based on + * epoch+xid and undo tuple's visibility based on xid. If the heap tuple is + * not all-visible (epoch+xid is not older than oldestXidWithEpochHavingUndo), + * then the xid corresponding to undo tuple must be in the range of 2-billion + * transactions with oldestXidHavingUndo (xid part in + * oldestXidWithEpochHavingUndo). This is true because we don't allow undo + * records older than 2-billion transactions. + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/subtrans.h" +#include "access/xact.h" +#include "access/zheap.h" +#include "access/zheaputils.h" +#include "access/zmultilocker.h" +#include "storage/bufmgr.h" +#include "storage/proc.h" +#include "storage/procarray.h" +#include "utils/tqual.h" +#include "utils/ztqual.h" +#include "storage/proc.h" + + +static ZHeapTuple GetTupleFromUndo(UndoRecPtr urec_ptr, ZHeapTuple zhtup, + Snapshot snapshot, Buffer buffer, + ItemPointer ctid, int trans_slot_id, + TransactionId prev_undo_xid); +static ZHeapTuple +GetTupleFromUndoForAbortedXact(UndoRecPtr urec_ptr, Buffer buffer, int trans_slot, + ZHeapTuple ztuple,TransactionId *xid); + +/* + * FetchTransInfoFromUndo - Retrieve transaction information of transaction + * that has modified the undo tuple. + */ +void +FetchTransInfoFromUndo(ZHeapTuple undo_tup, uint64 *epoch, TransactionId *xid, + CommandId *cid, UndoRecPtr *urec_ptr, bool skip_lockers) +{ + UnpackedUndoRecord *urec; + UndoRecPtr urec_ptr_out = InvalidUndoRecPtr; + TransactionId undo_tup_xid; + +fetch_prior_undo: + undo_tup_xid = *xid; + + /* + * The transaction slot referred by the undo tuple could have been reused + * multiple times, so to ensure that we have fetched the right undo record + * we need to verify that the undo record contains xid same as the xid + * that has modified the tuple. + */ + urec = UndoFetchRecord(*urec_ptr, + ItemPointerGetBlockNumber(&undo_tup->t_self), + ItemPointerGetOffsetNumber(&undo_tup->t_self), + undo_tup_xid, + &urec_ptr_out, + ZHeapSatisfyUndoRecord); + + /* + * The undo tuple must be visible, if the undo record containing + * the information of the last transaction that has updated the + * tuple is discarded. + */ + if (urec == NULL) + { + if (epoch) + *epoch = 0; + if (xid) + *xid = InvalidTransactionId; + if (cid) + *cid = InvalidCommandId; + if (urec_ptr) + *urec_ptr = InvalidUndoRecPtr; + return; + } + + /* + * If we reach here, this means the transaction id that has + * last modified this tuple must be in 2-billion xid range + * of oldestXidHavingUndo, so we can get compute its epoch + * as we do for current transaction. + */ + if (epoch) + *epoch = GetEpochForXid(urec->uur_xid); + *xid = urec->uur_xid; + *cid = urec->uur_cid; + *urec_ptr = urec_ptr_out; + + if (skip_lockers && + (urec->uur_type == UNDO_XID_LOCK_ONLY || + urec->uur_type == UNDO_XID_LOCK_FOR_UPDATE || + urec->uur_type == UNDO_XID_MULTI_LOCK_ONLY)) + { + *xid = InvalidTransactionId; + *urec_ptr = urec->uur_blkprev; + UndoRecordRelease(urec); + goto fetch_prior_undo; + } + + UndoRecordRelease(urec); +} + +/* + * ZHeapPageGetNewCtid + * + * This should be called for ctid which is already set deleted to get the new + * ctid, xid and cid which modified the given one. + */ +void +ZHeapPageGetNewCtid(Buffer buffer, ItemPointer ctid, TransactionId *xid, + CommandId *cid) +{ + UndoRecPtr urec_ptr = InvalidUndoRecPtr; + int trans_slot; + int vis_info; + uint64 epoch; + ItemId lp; + Page page; + OffsetNumber offnum = ItemPointerGetOffsetNumber(ctid); + int out_slot_no PG_USED_FOR_ASSERTS_ONLY; + + page = BufferGetPage(buffer); + lp = PageGetItemId(page, offnum); + + Assert(ItemIdIsDeleted(lp)); + + trans_slot = ItemIdGetTransactionSlot(lp); + vis_info = ItemIdGetVisibilityInfo(lp); + + if (vis_info & ITEMID_XACT_INVALID) + { + ZHeapTupleData undo_tup; + ItemPointerSetBlockNumber(&undo_tup.t_self, + BufferGetBlockNumber(buffer)); + ItemPointerSetOffsetNumber(&undo_tup.t_self, offnum); + + /* + * We need undo record pointer to fetch the transaction information + * from undo. + */ + out_slot_no = GetTransactionSlotInfo(buffer, offnum, trans_slot, + (uint32 *) &epoch, xid, &urec_ptr, + true, false); + *xid = InvalidTransactionId; + FetchTransInfoFromUndo(&undo_tup, &epoch, xid, cid, &urec_ptr, false); + } + else + { + out_slot_no = GetTransactionSlotInfo(buffer, offnum, trans_slot, + (uint32 *) &epoch, xid, &urec_ptr, + true, false); + *cid = ZHeapPageGetCid(buffer, trans_slot, (uint32) epoch, *xid, + urec_ptr, offnum); + } + + /* + * We always expect non-frozen transaction slot here as the caller tries + * to fetch the ctid of tuples that are visible to the snapshot, so + * corresponding undo record can't be discarded. + */ + Assert(out_slot_no != ZHTUP_SLOT_FROZEN); + + ZHeapPageGetCtid(trans_slot, buffer, urec_ptr, ctid); +} + +/* + * GetVisibleTupleIfAny + * + * This is a helper function for GetTupleFromUndoWithOffset. + */ +static ZHeapTuple +GetVisibleTupleIfAny(UndoRecPtr prev_urec_ptr, ZHeapTuple undo_tup, + Snapshot snapshot, Buffer buffer, TransactionId xid, + int trans_slot_id, CommandId cid) +{ + int undo_oper = -1; + TransactionId oldestXidHavingUndo; + + if (undo_tup->t_data->t_infomask & ZHEAP_INPLACE_UPDATED) + { + undo_oper = ZHEAP_INPLACE_UPDATED; + } + else if (undo_tup->t_data->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + undo_oper = ZHEAP_XID_LOCK_ONLY; + } + else + { + /* we can't further operate on deleted or non-inplace-updated tuple */ + Assert(!(undo_tup->t_data->t_infomask & ZHEAP_DELETED) || + !(undo_tup->t_data->t_infomask & ZHEAP_UPDATED)); + } + + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + + /* + * We need to fetch all the transaction related information from undo + * record for the tuples that point to a slot that gets invalidated for + * reuse at some point of time. See PageFreezeTransSlots. + */ + if ((trans_slot_id != ZHTUP_SLOT_FROZEN) && + !TransactionIdEquals(xid, FrozenTransactionId) && + !TransactionIdPrecedes(xid, oldestXidHavingUndo)) + { + if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask)) + { + FetchTransInfoFromUndo(undo_tup, NULL, &xid, &cid, &prev_urec_ptr, false); + } + /* + * If we already have a valid cid then don't fetch it from the undo. + * This is the case when old locker got transferred to the newly + * inserted tuple of the non-in place update. In such case undo chain + * will not have a separate undo-record for the locker so we have to + * use the cid we have got from the insert undo record because in this + * case the actual previous version of the locker is insert only and + * that is what we are interested in. + */ + /* + * If we already have a valid cid then don't fetch it from the undo. + * For detailed comment refer GetVisibleTupleIfAny. + */ + /* + * If we already have a valid cid then don't fetch it from the undo. + * For detailed comment refer GetVisibleTupleIfAny. + */ + else if (cid == InvalidCommandId) + { + /* + * we don't use prev_undo_xid to fetch the undo record for cid as it is + * required only when transaction is current transaction in which case + * there is no risk of transaction chain switching, so we are safe. It + * might be better to move this check near to it's usage, but that will + * make code look ugly, so keeping it here. + */ + cid = ZHeapTupleGetCid(undo_tup, buffer, prev_urec_ptr, trans_slot_id); + } + } + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(xid, FrozenTransactionId) || + TransactionIdPrecedes(xid, oldestXidHavingUndo)) + return undo_tup; + + if (undo_oper == ZHEAP_INPLACE_UPDATED || + undo_oper == ZHEAP_XID_LOCK_ONLY) + { + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid) + { + /* updated/locked after scan started */ + return GetTupleFromUndo(prev_urec_ptr, + undo_tup, + snapshot, + buffer, + NULL, + trans_slot_id, + xid); + } + else + return undo_tup; /* updated before scan started */ + } + else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot)) + return GetTupleFromUndo(prev_urec_ptr, + undo_tup, + snapshot, + buffer, + NULL, + trans_slot_id, + xid); + else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid)) + return GetTupleFromUndo(prev_urec_ptr, + undo_tup, + snapshot, + buffer, + NULL, + trans_slot_id, + xid); + else if (TransactionIdDidCommit(xid)) + return undo_tup; + else + return GetTupleFromUndo(prev_urec_ptr, + undo_tup, + snapshot, + buffer, + NULL, + trans_slot_id, + xid); + } + else /* undo tuple is the root tuple */ + { + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid) + return NULL; /* inserted after scan started */ + else + return undo_tup; /* inserted before scan started */ + } + else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot)) + return NULL; + else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid)) + return NULL; + else if (TransactionIdDidCommit(xid)) + return undo_tup; + else + return NULL; + } +} + +/* + * GetTupleFromUndoForAbortedXact + * + * This is used to fetch the prior committed version of the tuple which is + * modified by an aborted xact. + * + * It returns the prior committed version of the tuple, if available. Else, + * returns NULL. + * + * The caller must send a palloc'ed tuple. This function can get a tuple + * from undo to return in which case it will free the memory passed by + * the caller. + * + * xid is an output parameter. It is set to the latest committed xid that + * inserted/in-place-updated the tuple. If the aborted transaction inserted + * the tuple itself, we return the same transaction id. The caller *should* + * handle the same scenario. + */ +static ZHeapTuple +GetTupleFromUndoForAbortedXact(UndoRecPtr urec_ptr, Buffer buffer, int trans_slot, + ZHeapTuple ztuple,TransactionId *xid) +{ + ZHeapTuple undo_tup = ztuple; + UnpackedUndoRecord *urec; + UndoRecPtr prev_urec_ptr; + TransactionId prev_undo_xid PG_USED_FOR_ASSERTS_ONLY; + TransactionId oldestXidHavingUndo = InvalidTransactionId; + int trans_slot_id; + int prev_trans_slot_id = trans_slot; + + prev_undo_xid = InvalidTransactionId; +fetch_prior_undo_record: + prev_urec_ptr = InvalidUndoRecPtr; + trans_slot_id = InvalidXactSlotId; + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&undo_tup->t_self), + ItemPointerGetOffsetNumber(&undo_tup->t_self), + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* If undo is discarded, then current tuple is visible. */ + if (urec == NULL) + return undo_tup; + + /* Here, we free the previous version and palloc a new tuple from undo. */ + undo_tup = CopyTupleFromUndoRecord(urec, undo_tup, &trans_slot_id, NULL, + true, BufferGetPage(buffer)); + + prev_urec_ptr = urec->uur_blkprev; + *xid = urec->uur_prevxid; + + UndoRecordRelease(urec); + + /* we can't further operate on deleted or non-inplace-updated tuple */ + Assert(!((undo_tup->t_data->t_infomask & ZHEAP_DELETED) || + (undo_tup->t_data->t_infomask & ZHEAP_UPDATED))); + + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(*xid, FrozenTransactionId) || + TransactionIdPrecedes(*xid, oldestXidHavingUndo)) + { + return undo_tup; + } + + /* + * If we got a tuple modified by a committed transaction, return it. + */ + if (TransactionIdDidCommit(*xid)) + return undo_tup; + + /* + * If the tuple points to a slot that gets invalidated for reuse at some + * point of time, then undo_tup is the latest committed version of the tuple. + */ + if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask)) + return undo_tup; + + /* + * If the undo tuple is stamped with a different transaction, then either + * the previous transaction is committed or tuple must be locked only. In both + * cases, we can return the tuple fetched from undo. + */ + if (trans_slot_id != prev_trans_slot_id) + { + (void) GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(&undo_tup->t_self), + trans_slot_id, + NULL, + NULL, + &prev_urec_ptr, + true, + true); + FetchTransInfoFromUndo(undo_tup, NULL, xid, NULL, &prev_urec_ptr, false); + + Assert(TransactionIdDidCommit(*xid) || + ZHEAP_XID_IS_LOCKED_ONLY(undo_tup->t_data->t_infomask)); + + return undo_tup; + } + + /* transaction must be aborted. */ + Assert(!TransactionIdIsCurrentTransactionId(*xid)); + Assert(!TransactionIdIsInProgress(*xid)); + Assert(TransactionIdDidAbort(*xid)); + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(*xid, FrozenTransactionId) || + TransactionIdPrecedes(*xid, oldestXidHavingUndo)) + { + return undo_tup; + } + + /* + * We can't have two aborted transaction with pending rollback state for + * the same tuple. + */ + Assert(!TransactionIdIsValid(prev_undo_xid) || + TransactionIdEquals(prev_undo_xid, *xid)); + + /* + * If undo tuple is the root tuple inserted by the aborted transaction, + * we don't have to process any further. The tuple is not visible to us. + */ + if (!IsZHeapTupleModified(undo_tup->t_data->t_infomask)) + { + /* before leaving, free the allocated memory */ + pfree(undo_tup); + return NULL; + } + + urec_ptr = prev_urec_ptr; + prev_undo_xid = *xid; + prev_trans_slot_id = trans_slot_id; + + goto fetch_prior_undo_record; + + /* not reachable */ + Assert(0); + return NULL; +} + +/* + * GetTupleFromUndo + * + * Fetch the record from undo and determine if previous version of tuple + * is visible for the given snapshot. If there exists a visible version + * of tuple in undo, then return the same, else return NULL. + * + * During undo chain traversal, we need to ensure that we switch the undo + * chain if the current version of undo tuple is modified by a transaction + * that is different from transaction that has modified the previous version + * of undo tuple. This is primarily done because undo chain for a particular + * tuple is formed based on the transaction id that has modified the tuple. + * + * Also we don't need to process the chain if the latest xid that has changed + * the tuple precedes smallest xid that has undo. + */ +static ZHeapTuple +GetTupleFromUndo(UndoRecPtr urec_ptr, ZHeapTuple zhtup, + Snapshot snapshot, Buffer buffer, + ItemPointer ctid, int trans_slot, + TransactionId prev_undo_xid) +{ + UnpackedUndoRecord *urec; + ZHeapTuple undo_tup; + UndoRecPtr prev_urec_ptr; + TransactionId xid; + CommandId cid; + int undo_oper; + TransactionId oldestXidHavingUndo; + int trans_slot_id; + int prev_trans_slot_id = trans_slot; + + + /* + * tuple is modified after the scan is started, fetch the prior record + * from undo to see if it is visible. + */ +fetch_prior_undo_record: + prev_urec_ptr = InvalidUndoRecPtr; + cid = InvalidCommandId; + undo_oper = -1; + trans_slot_id = InvalidXactSlotId; + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + prev_undo_xid, + NULL, + ZHeapSatisfyUndoRecord); + + /* If undo is discarded, then current tuple is visible. */ + if (urec == NULL) + return zhtup; + + undo_tup = CopyTupleFromUndoRecord(urec, zhtup, &trans_slot_id, &cid, true, + BufferGetPage(buffer)); + prev_urec_ptr = urec->uur_blkprev; + xid = urec->uur_prevxid; + + /* + * For non-inplace-updates, ctid needs to be retrieved from undo + * record if required. + */ + if (ctid) + { + if (urec->uur_type == UNDO_UPDATE) + *ctid = *((ItemPointer) urec->uur_payload.data); + else + *ctid = undo_tup->t_self; + } + + UndoRecordRelease(urec); + + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(xid, FrozenTransactionId) || + TransactionIdPrecedes(xid, oldestXidHavingUndo)) + return undo_tup; + + /* + * Change the undo chain if the undo tuple is stamped with the different + * transaction. + */ + if (trans_slot_id != prev_trans_slot_id) + { + /* + * It is quite possible that the tuple is showing some valid + * transaction slot, but actual slot has been frozen. This can happen + * when the slot belongs to TPD entry and the corresponding TPD entry + * is pruned. + */ + trans_slot_id = GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(&undo_tup->t_self), + trans_slot_id, + NULL, + NULL, + &prev_urec_ptr, + true, + true); + } + + if (undo_tup->t_data->t_infomask & ZHEAP_INPLACE_UPDATED) + { + undo_oper = ZHEAP_INPLACE_UPDATED; + } + else if (undo_tup->t_data->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + undo_oper = ZHEAP_XID_LOCK_ONLY; + } + else + { + /* we can't further operate on deleted or non-inplace-updated tuple */ + Assert(!((undo_tup->t_data->t_infomask & ZHEAP_DELETED) || + (undo_tup->t_data->t_infomask & ZHEAP_UPDATED))); + } + + /* + * We need to fetch all the transaction related information from undo + * record for the tuples that point to a slot that gets invalidated for + * reuse at some point of time. See PageFreezeTransSlots. + */ + if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask)) + { + FetchTransInfoFromUndo(undo_tup, NULL, &xid, &cid, &prev_urec_ptr, false); + } + else if (cid == InvalidCommandId) + { + CommandId cur_cid = GetCurrentCommandId(false); + + /* + * If the current command doesn't need to modify any tuple and the + * snapshot used is not of any previous command, then it can see all the + * modifications made by current transactions till now. So, we don't even + * attempt to fetch CID from undo in such cases. + */ + if (!GetCurrentCommandIdUsed() && cur_cid == snapshot->curcid) + { + cid = InvalidCommandId; + } + else + { + /* + * we don't use prev_undo_xid to fetch the undo record for cid as it is + * required only when transaction is current transaction in which case + * there is no risk of transaction chain switching, so we are safe. It + * might be better to move this check near to it's usage, but that will + * make code look ugly, so keeping it here. + */ + cid = ZHeapTupleGetCid(undo_tup, buffer, prev_urec_ptr, trans_slot_id); + } + } + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(xid, FrozenTransactionId) || + TransactionIdPrecedes(xid, oldestXidHavingUndo)) + return undo_tup; + + if (undo_oper == ZHEAP_INPLACE_UPDATED || + undo_oper == ZHEAP_XID_LOCK_ONLY) + { + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid) + { + /* + * Updated after scan started, need to fetch prior tuple + * in undo chain. + */ + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + + goto fetch_prior_undo_record; + } + else + return undo_tup; /* updated before scan started */ + } + else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot)) + { + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + + goto fetch_prior_undo_record; + } + else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid)) + { + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + + goto fetch_prior_undo_record; + } + else if (TransactionIdDidCommit(xid)) + return undo_tup; + else + { + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + + goto fetch_prior_undo_record; + } + } + else /* undo tuple is the root tuple */ + { + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (IsMVCCSnapshot(snapshot) && cid >= snapshot->curcid) + return NULL; /* inserted after scan started */ + else + return undo_tup; /* inserted before scan started */ + } + else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot)) + return NULL; + else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid)) + return NULL; + else if (TransactionIdDidCommit(xid)) + return undo_tup; + else + return NULL; + } + + /* we should never reach here */ + return NULL; +} + +/* + * GetTupleFromUndoWithOffset + * + * This is similar to GetTupleFromUndo with a difference that it takes + * line offset as an input. This is a special purpose function that + * is written to fetch visible version of deleted tuple that has been + * pruned to a deleted line pointer. + */ +static ZHeapTuple +GetTupleFromUndoWithOffset(UndoRecPtr urec_ptr, Snapshot snapshot, + Buffer buffer, OffsetNumber off, int trans_slot) +{ + UnpackedUndoRecord *urec; + ZHeapTuple undo_tup; + UndoRecPtr prev_urec_ptr = InvalidUndoRecPtr; + TransactionId xid, oldestXidHavingUndo; + CommandId cid = InvalidCommandId; + int trans_slot_id = InvalidXactSlotId; + int prev_trans_slot_id = trans_slot; + + + /* + * tuple is modified after the scan is started, fetch the prior record + * from undo to see if it is visible. + */ + urec = UndoFetchRecord(urec_ptr, + BufferGetBlockNumber(buffer), + off, + InvalidTransactionId, + NULL, + ZHeapSatisfyUndoRecord); + + /* need to ensure that undo record contains complete tuple */ + Assert(urec->uur_type == UNDO_DELETE || urec->uur_type == UNDO_UPDATE); + undo_tup = CopyTupleFromUndoRecord(urec, NULL, &trans_slot_id, &cid, false, + BufferGetPage(buffer)); + prev_urec_ptr = urec->uur_blkprev; + xid = urec->uur_prevxid; + + UndoRecordRelease(urec); + + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(xid, FrozenTransactionId) || + TransactionIdPrecedes(xid, oldestXidHavingUndo)) + return undo_tup; + + /* + * Change the undo chain if the undo tuple is stamped with the different + * transaction. + */ + if (trans_slot_id != prev_trans_slot_id) + { + trans_slot_id = GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(&undo_tup->t_self), + trans_slot_id, + NULL, + NULL, + &prev_urec_ptr, + true, + true); + } + + return GetVisibleTupleIfAny(prev_urec_ptr, undo_tup, + snapshot, buffer, xid, trans_slot_id, cid); +} + +/* + * UndoTupleSatisfiesUpdate + * + * Returns true, if there exists a visible version of zhtup in undo, + * false otherwise. + * + * This function returns ctid for the undo tuple which will be always + * same as the ctid of zhtup except for non-in-place update case. + * + * The Undo chain traversal follows similar protocol as mentioned atop + * GetTupleFromUndo. + */ +static bool +UndoTupleSatisfiesUpdate(UndoRecPtr urec_ptr, ZHeapTuple zhtup, + CommandId curcid, Buffer buffer, + ItemPointer ctid, int trans_slot, + TransactionId prev_undo_xid, bool free_zhtup, + bool *in_place_updated_or_locked) +{ + UnpackedUndoRecord *urec; + ZHeapTuple undo_tup; + UndoRecPtr prev_urec_ptr; + TransactionId xid, oldestXidHavingUndo; + CommandId cid; + int trans_slot_id; + int prev_trans_slot_id = trans_slot; + int undo_oper; + bool result; + + /* + * tuple is modified after the scan is started, fetch the prior record + * from undo to see if it is visible. + */ +fetch_prior_undo_record: + undo_tup = NULL; + prev_urec_ptr = InvalidUndoRecPtr; + cid = InvalidCommandId; + trans_slot_id = InvalidXactSlotId; + undo_oper = -1; + result = false; + + urec = UndoFetchRecord(urec_ptr, + ItemPointerGetBlockNumber(&zhtup->t_self), + ItemPointerGetOffsetNumber(&zhtup->t_self), + prev_undo_xid, + NULL, + ZHeapSatisfyUndoRecord); + + /* If undo is discarded, then current tuple is visible. */ + if (urec == NULL) + { + result = true; + goto result_available; + } + + undo_tup = CopyTupleFromUndoRecord(urec, zhtup, &trans_slot_id, &cid, + free_zhtup, BufferGetPage(buffer)); + prev_urec_ptr = urec->uur_blkprev; + xid = urec->uur_prevxid; + /* + * For non-inplace-updates, ctid needs to be retrieved from undo + * record if required. + */ + if (ctid) + { + if (urec->uur_type == UNDO_UPDATE) + *ctid = *((ItemPointer) urec->uur_payload.data); + else + *ctid = undo_tup->t_self; + } + + if (undo_tup->t_data->t_infomask & ZHEAP_INPLACE_UPDATED) + { + undo_oper = ZHEAP_INPLACE_UPDATED; + *in_place_updated_or_locked = true; + } + else if (undo_tup->t_data->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + undo_oper = ZHEAP_XID_LOCK_ONLY; + *in_place_updated_or_locked = true; + } + else + { + /* we can't further operate on deleted or non-inplace-updated tuple */ + Assert(!(undo_tup->t_data->t_infomask & ZHEAP_DELETED) || + !(undo_tup->t_data->t_infomask & ZHEAP_UPDATED)); + } + + UndoRecordRelease(urec); + + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(xid, FrozenTransactionId) || + TransactionIdPrecedes(xid, oldestXidHavingUndo)) + { + result = true; + goto result_available; + } + + /* + * Change the undo chain if the undo tuple is stamped with the different + * transaction slot. + */ + if (trans_slot_id != prev_trans_slot_id) + { + /* + * It is quite possible that the tuple is showing some valid + * transaction slot, but actual slot has been frozen. This can happen + * when the slot belongs to TPD entry and the corresponding TPD entry + * is pruned. + */ + trans_slot_id = GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(&undo_tup->t_self), + trans_slot_id, + NULL, + NULL, + &prev_urec_ptr, + true, + true); + } + + /* + * We need to fetch all the transaction related information from undo + * record for the tuples that point to a slot that gets invalidated for + * reuse at some point of time. See PageFreezeTransSlots. + */ + if (ZHeapTupleHasInvalidXact(undo_tup->t_data->t_infomask)) + { + FetchTransInfoFromUndo(undo_tup, NULL, &xid, &cid, &prev_urec_ptr, false); + } + else if (cid == InvalidCommandId) + { + CommandId cur_comm_cid = GetCurrentCommandId(false); + + /* + * If the current command doesn't need to modify any tuple and the + * snapshot used is not of any previous command, then it can see all the + * modifications made by current transactions till now. So, we don't even + * attempt to fetch CID from undo in such cases. + */ + if (!GetCurrentCommandIdUsed() && cur_comm_cid == curcid) + { + cid = InvalidCommandId; + } + else + { + /* + * we don't use prev_undo_xid to fetch the undo record for cid as it is + * required only when transaction is current transaction in which case + * there is no risk of transaction chain switching, so we are safe. It + * might be better to move this check near to it's usage, but that will + * make code look ugly, so keeping it here. + */ + cid = ZHeapTupleGetCid(undo_tup, buffer, prev_urec_ptr, trans_slot_id); + } + } + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple is too old that it is all-visible + * or it precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + TransactionIdEquals(xid, FrozenTransactionId) || + TransactionIdPrecedes(xid, oldestXidHavingUndo)) + { + result = true; + goto result_available; + } + + if (undo_oper == ZHEAP_INPLACE_UPDATED || + undo_oper == ZHEAP_XID_LOCK_ONLY) + { + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (cid >= curcid) + { + /* + * Updated after scan started, need to fetch prior tuple + * in undo chain. + */ + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + free_zhtup = true; + + goto fetch_prior_undo_record; + } + else + result = true; /* updated before scan started */ + } + else if (TransactionIdIsInProgress(xid)) + { + /* Note the values required to fetch prior tuple in undo chain. */ + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + free_zhtup = true; + + goto fetch_prior_undo_record; + } + else if (TransactionIdDidCommit(xid)) + result = true; + else + { + /* Note the values required to fetch prior tuple in undo chain. */ + urec_ptr = prev_urec_ptr; + zhtup = undo_tup; + prev_undo_xid = xid; + prev_trans_slot_id = trans_slot_id; + free_zhtup = true; + + goto fetch_prior_undo_record; + } + } + else /* undo tuple is the root tuple */ + { + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (cid >= curcid) + result = false; /* inserted after scan started */ + else + result = true; /* inserted before scan started */ + } + else if (TransactionIdIsInProgress(xid)) + result = false; + else if (TransactionIdDidCommit(xid)) + result = true; + else + result = false; + } + +result_available: + if (undo_tup) + pfree(undo_tup); + return result; +} + +/* + * ZHeapTupleSatisfiesMVCC + * + * Returns the visible version of tuple if any, NULL otherwise. We need to + * traverse undo record chains to determine the visibility of tuple. In + * this function we need to first the determine the visibility of modified + * tuple and if it is not visible, then we need to fetch the prior version + * of tuple from undo chain and decide based on its visibility. The undo + * chain needs to be traversed till we reach root version of the tuple. + * + * Here, we consider the effects of: + * all transactions committed as of the time of the given snapshot + * previous commands of this transaction + * + * Does _not_ include: + * transactions shown as in-progress by the snapshot + * transactions started after the snapshot was taken + * changes made by the current command + * + * The tuple will be considered visible iff latest operation on tuple is + * Insert, In-Place update or tuple is locked and the transaction that has + * performed operation is current transaction (and the operation is performed + * by some previous command) or is committed. + * + * We traverse the undo chain to get the visible tuple if any, in case the + * the latest transaction that has operated on tuple is shown as in-progress + * by the snapshot or is started after the snapshot was taken or is current + * transaction and the changes are made by current command. + * + * For aborted transactions, we need to fetch the visible tuple from undo. + * Now, it is possible that actions corresponding to aborted transaction + * has been applied, but still xid is present in slot, however we should + * never get such an xid. + * + * For multilockers, the strongest locker information is always present on + * the tuple. So for updaters, we don't need anything special as the tuple + * visibility will be determined based on the transaction information present + * on tuple. For the lockers only case, we need to determine if the original + * inserter is visible to snapshot. + */ +ZHeapTuple +ZHeapTupleSatisfiesMVCC(ZHeapTuple zhtup, Snapshot snapshot, + Buffer buffer, ItemPointer ctid) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + UndoRecPtr urec_ptr = InvalidUndoRecPtr; + TransactionId xid; + CommandId *cid; + CommandId cur_cid = GetCurrentCommandId(false); + CommandId tmp_cid; + uint64 epoch_xid; + int trans_slot; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + /* + * If the current command doesn't need to modify any tuple and the + * snapshot used is not of any previous command, then it can see all the + * modifications made by current transactions till now. So, we don't even + * attempt to fetch CID from undo in such cases. + */ + if (!GetCurrentCommandIdUsed() && cur_cid == snapshot->curcid) + { + cid = NULL; + } + else + { + cid = &tmp_cid; + *cid = InvalidCommandId; + } + + /* Get transaction info */ + ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot, &epoch_xid, &xid, cid, + &urec_ptr, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. Transaction slot can also be considered + * frozen if it belongs to previous epoch. + */ + if (trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return NULL; + + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (cid && *cid >= snapshot->curcid) + { + /* deleted after scan started, get previous tuple from undo */ + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot, + InvalidTransactionId); + } + else + { + /* + * For non-inplace-updates, ctid needs to be retrieved from + * undo record if required. If the tuple is moved to another + * partition, then we don't need ctid. + */ + if (ctid && + tuple->t_infomask & ZHEAP_UPDATED && + !ZHeapTupleIsMoved(tuple->t_infomask)) + ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid); + + return NULL; /* deleted before scan started */ + } + } + else if (XidInMVCCSnapshot(xid, snapshot)) + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot, + InvalidTransactionId); + else if (TransactionIdDidCommit(xid)) + { + /* + * For non-inplace-updates, ctid needs to be retrieved from undo + * record if required. If the tuple is moved to another + * partition, then we don't need ctid. + */ + if (ctid && + !ZHeapTupleIsMoved(tuple->t_infomask) && + tuple->t_infomask & ZHEAP_UPDATED) + ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid); + + return NULL; /* tuple is deleted */ + } + else /* transaction is aborted */ + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot, + InvalidTransactionId); + } + else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED || + tuple->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + /* + * The tuple is updated/locked and must be all visible if the + * transaction slot is cleared or latest xid that has changed the + * tuple precedes smallest xid that has undo. + */ + if (trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return zhtup; /* tuple is updated */ + + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (cid && *cid >= snapshot->curcid) + { + /* + * updated/locked after scan started, get previous tuple from + * undo. + */ + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot, + InvalidTransactionId); + } + else + return zhtup; /* updated before scan started */ + } + else if (XidInMVCCSnapshot(xid, snapshot)) + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot, + InvalidTransactionId); + else if (TransactionIdDidCommit(xid)) + return zhtup; /* tuple is updated */ + else /* transaction is aborted */ + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot, + InvalidTransactionId); + } + + /* + * The tuple must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return zhtup; + + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (cid && *cid >= snapshot->curcid) + return NULL; /* inserted after scan started */ + else + return zhtup; /* inserted before scan started */ + } + else if (XidInMVCCSnapshot(xid, snapshot)) + return NULL; + else if (TransactionIdDidCommit(xid)) + return zhtup; + else + return NULL; + + return NULL; +} + +/* + * ZHeapGetVisibleTuple + * + * This function is called for tuple that is deleted but not all-visible. It + * returns NULL, if the last transaction that has modified the tuple is + * visible to snapshot or if none of the versions of tuple is visible, + * otherwise visible version tuple if any. + * + * The caller must ensure that it passes the line offset for a tuple that is + * marked as deleted. + */ +ZHeapTuple +ZHeapGetVisibleTuple(OffsetNumber off, Snapshot snapshot, Buffer buffer, bool *all_dead) +{ + Page page; + UndoRecPtr urec_ptr; + TransactionId xid; + CommandId cid; + ItemId lp; + uint64 epoch, epoch_xid; + uint32 tmp_epoch; + int trans_slot; + int vis_info; + + if (all_dead) + *all_dead = false; + + page = BufferGetPage(buffer); + lp = PageGetItemId(page, off); + Assert(ItemIdIsDeleted(lp)); + + trans_slot = ItemIdGetTransactionSlot(lp); + vis_info = ItemIdGetVisibilityInfo(lp); + + /* + * We need to fetch all the transaction related information from undo + * record for the tuples that point to a slot that gets invalidated for + * reuse at some point of time. See PageFreezeTransSlots. + */ +check_trans_slot: + if (trans_slot != ZHTUP_SLOT_FROZEN) + { + if (vis_info & ITEMID_XACT_INVALID) + { + ZHeapTupleData undo_tup; + ItemPointerSetBlockNumber(&undo_tup.t_self, + BufferGetBlockNumber(buffer)); + ItemPointerSetOffsetNumber(&undo_tup.t_self, off); + + /* + * We need undo record pointer to fetch the transaction information + * from undo. + */ + trans_slot = GetTransactionSlotInfo(buffer, off, trans_slot, + &tmp_epoch, &xid, &urec_ptr, + true, false); + /* + * It is quite possible that the tuple is showing some valid + * transaction slot, but actual slot has been frozen. This can happen + * when the slot belongs to TPD entry and the corresponding TPD entry + * is pruned. + */ + if (trans_slot == ZHTUP_SLOT_FROZEN) + goto check_trans_slot; + + xid = InvalidTransactionId; + FetchTransInfoFromUndo(&undo_tup, &epoch, &xid, &cid, &urec_ptr, false); + } + else + { + trans_slot = GetTransactionSlotInfo(buffer, off, trans_slot, + &tmp_epoch, &xid, &urec_ptr, + true, false); + if (trans_slot == ZHTUP_SLOT_FROZEN) + goto check_trans_slot; + + epoch = (uint64) tmp_epoch; + cid = ZHeapPageGetCid(buffer, trans_slot, tmp_epoch, xid, urec_ptr, off); + } + } + else + { + epoch = 0; + xid = InvalidTransactionId; + cid = InvalidCommandId; + urec_ptr = InvalidUndoRecPtr; + } + + epoch_xid = MakeEpochXid(epoch, xid); + + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. Transaction slot can also be considered + * frozen if it belongs to previous epoch. + */ + if (trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + { + if (all_dead) + *all_dead = true; + return NULL; + } + + if (TransactionIdIsCurrentTransactionId(xid)) + { + if (cid >= snapshot->curcid) + { + /* deleted after scan started, get previous tuple from undo */ + return GetTupleFromUndoWithOffset(urec_ptr, + snapshot, + buffer, + off, + trans_slot); + } + else + return NULL; /* deleted before scan started */ + } + else if (IsMVCCSnapshot(snapshot) && XidInMVCCSnapshot(xid, snapshot)) + return GetTupleFromUndoWithOffset(urec_ptr, + snapshot, + buffer, + off, + trans_slot); + else if (!IsMVCCSnapshot(snapshot) && TransactionIdIsInProgress(xid)) + return GetTupleFromUndoWithOffset(urec_ptr, + snapshot, + buffer, + off, + trans_slot); + else if (TransactionIdDidCommit(xid)) + return NULL; /* tuple is deleted */ + else /* transaction is aborted */ + return GetTupleFromUndoWithOffset(urec_ptr, + snapshot, + buffer, + off, + trans_slot); + + return NULL; +} + +/* + * ZHeapTupleSatisfiesUpdate + * + * The retrun values for this API are same as HeapTupleSatisfiesUpdate. + * However, there is a notable difference in the way to determine visibility + * of tuples. We need to traverse undo record chains to determine the + * visibility of tuple. + * + * For multilockers, the visibility can be determined by the information + * present on tuple. See ZHeapTupleSatisfiesMVCC. Also, this API returns + * HeapTupleMayBeUpdated, if the strongest locker is committed which means + * the caller need to take care of waiting for other lockers in such a case. + * + * ctid - returns the ctid of visible tuple if the tuple is either deleted or + * updated. ctid needs to be retrieved from undo tuple. + * trans_slot - returns the transaction slot of the transaction that has + * modified the visible tuple. + * xid - returns the xid that has modified the visible tuple. + * subxid - returns the subtransaction id, if any, that has modified the + * visible tuple. We fetch the subxid from undo only when it is required, + * i.e when the caller would wait on it to finish. + * cid - returns the cid of visible tuple. + * single_locker_xid - returns the xid of a single in-progress locker, if any. + * single_locker_trans_slot - returns the transaction slot of a single + * in-progress locker, if any. + * lock_allowed - allow caller to lock the tuple if it is in-place updated + * in_place_updated - returns whether the current visible version of tuple is + * updated in place. + */ +HTSU_Result +ZHeapTupleSatisfiesUpdate(Relation rel, ZHeapTuple zhtup, CommandId curcid, + Buffer buffer, ItemPointer ctid, int *trans_slot, + TransactionId *xid, SubTransactionId *subxid, + CommandId *cid, TransactionId *single_locker_xid, + int *single_locker_trans_slot, bool free_zhtup, + bool lock_allowed, Snapshot snapshot, + bool *in_place_updated_or_locked) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + UndoRecPtr urec_ptr = InvalidUndoRecPtr; + uint64 epoch_xid; + CommandId cur_comm_cid = GetCurrentCommandId(false); + bool visible; + + *single_locker_xid = InvalidTransactionId; + *single_locker_trans_slot = InvalidXactSlotId; + *in_place_updated_or_locked = false; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + /* + * If the current command doesn't need to modify any tuple and the + * snapshot used is not of any previous command, then it can see all the + * modifications made by current transactions till now. So, we don't even + * attempt to fetch CID from undo in such cases. + */ + if (!GetCurrentCommandIdUsed() && cur_comm_cid == curcid) + { + cid = NULL; + } + + /* Get transaction info */ + ZHeapTupleGetTransInfo(zhtup, buffer, trans_slot, &epoch_xid, xid, cid, + &urec_ptr, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted or non-inplace-updated and must be all visible + * if the transaction slot is cleared or latest xid that has changed + * the tuple precedes smallest xid that has undo. However, that is + * not possible at this stage as the tuple has already passed snapshot + * check. + */ + Assert(!(*trans_slot == ZHTUP_SLOT_FROZEN && + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo))); + + if (TransactionIdIsCurrentTransactionId(*xid)) + { + if (cid && *cid >= curcid) + { + /* deleted after scan started, check previous tuple from undo */ + visible = UndoTupleSatisfiesUpdate(urec_ptr, + zhtup, + curcid, + buffer, + ctid, + *trans_slot, + InvalidTransactionId, + free_zhtup, + in_place_updated_or_locked); + if (visible) + return HeapTupleSelfUpdated; + else + return HeapTupleInvisible; + } + else + return HeapTupleInvisible; /* deleted before scan started */ + } + else if (TransactionIdIsInProgress(*xid)) + { + visible = UndoTupleSatisfiesUpdate(urec_ptr, + zhtup, + curcid, + buffer, + ctid, + *trans_slot, + InvalidTransactionId, + free_zhtup, + in_place_updated_or_locked); + + if (visible) + { + if (subxid) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid); + + return HeapTupleBeingUpdated; + } + else + return HeapTupleInvisible; + } + else if (TransactionIdDidCommit(*xid)) + { + /* + * For non-inplace-updates, ctid needs to be retrieved from undo + * record if required. If the tuple is moved to another + * partition, then we don't need ctid. + */ + if (ctid && + !ZHeapTupleIsMoved(tuple->t_infomask) && + tuple->t_infomask & ZHEAP_UPDATED) + ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid); + + /* tuple is deleted or non-inplace-updated */ + return HeapTupleUpdated; + } + else /* transaction is aborted */ + { + visible = UndoTupleSatisfiesUpdate(urec_ptr, + zhtup, + curcid, + buffer, + ctid, + *trans_slot, + InvalidTransactionId, + free_zhtup, + in_place_updated_or_locked); + + /* + * If updating transaction id is aborted and the tuple is visible + * then return HeapTupleBeingUpdated, so that caller can apply the + * undo before modifying the page. Here, we don't need to fetch + * subtransaction id as it is only possible for top-level xid to + * have pending undo actions. + */ + if (visible) + return HeapTupleBeingUpdated; + else + return HeapTupleInvisible; + } + } + else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED || + tuple->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + *in_place_updated_or_locked = true; + + /* + * The tuple is updated/locked and must be all visible if the + * transaction slot is cleared or latest xid that has touched the + * tuple precedes smallest xid that has undo. If there is a single + * locker on the tuple, then we fetch the lockers transaction info + * from undo as we never store lockers slot on tuple. See + * compute_new_xid_infomask for more details about lockers. + */ + if (*trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + { + bool found = false; + + if (ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask) && + !ZHeapTupleHasMultiLockers(tuple->t_infomask)) + found = GetLockerTransInfo(rel, zhtup, buffer, single_locker_trans_slot, + NULL, single_locker_xid, NULL, NULL); + if (!found) + return HeapTupleMayBeUpdated; + else + { + /* + * If there is a single locker in-progress/aborted locker, + * it's safe to return being updated so that the caller + * check for lock conflicts or perform rollback if necessary. + * + * If the single locker is our current transaction, then also + * we return beging updated. + */ + return HeapTupleBeingUpdated; + } + } + + if (TransactionIdIsCurrentTransactionId(*xid)) + { + if (cid && *cid >= curcid) + { + /* + * updated/locked after scan started, check previous tuple + * from undo + */ + visible = UndoTupleSatisfiesUpdate(urec_ptr, + zhtup, + curcid, + buffer, + ctid, + *trans_slot, + InvalidTransactionId, + free_zhtup, + in_place_updated_or_locked); + if (visible) + { + if (ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask)) + return HeapTupleBeingUpdated; + else + return HeapTupleSelfUpdated; + } + } + else + { + if (ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask)) + { + /* + * Locked before scan; caller can check if it is locked + * in lock mode higher or equal to the required mode, then + * it can skip locking the tuple. + */ + return HeapTupleBeingUpdated; + } + else + return HeapTupleMayBeUpdated; /* updated before scan started */ + } + } + else if (TransactionIdIsInProgress(*xid)) + { + visible = UndoTupleSatisfiesUpdate(urec_ptr, + zhtup, + curcid, + buffer, + ctid, + *trans_slot, + InvalidTransactionId, + free_zhtup, + in_place_updated_or_locked); + + if (visible) + { + if (subxid) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid); + + return HeapTupleBeingUpdated; + } + else + return HeapTupleInvisible; + } + else if (TransactionIdDidCommit(*xid)) + { + /* if tuple is updated and not in our snapshot, then allow to update it. */ + if (lock_allowed || !XidInMVCCSnapshot(*xid, snapshot)) + return HeapTupleMayBeUpdated; + else + return HeapTupleUpdated; + } + else /* transaction is aborted */ + { + visible = UndoTupleSatisfiesUpdate(urec_ptr, + zhtup, + curcid, + buffer, + ctid, + *trans_slot, + InvalidTransactionId, + free_zhtup, + in_place_updated_or_locked); + + /* + * If updating transaction id is aborted and the tuple is visible + * then return HeapTupleBeingUpdated, so that caller can apply the + * undo before modifying the page. Here, we don't need to fetch + * subtransaction id as it is only possible for top-level xid to + * have pending undo actions. + */ + if (visible) + return HeapTupleBeingUpdated; + else + return HeapTupleInvisible; + } + } + + /* + * The tuple must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (*trans_slot == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return HeapTupleMayBeUpdated; + + if (TransactionIdIsCurrentTransactionId(*xid)) + { + if (cid && *cid >= curcid) + return HeapTupleInvisible; /* inserted after scan started */ + else + return HeapTupleMayBeUpdated; /* inserted before scan started */ + } + else if (TransactionIdIsInProgress(*xid)) + return HeapTupleInvisible; + else if (TransactionIdDidCommit(*xid)) + return HeapTupleMayBeUpdated; + else + return HeapTupleInvisible; + + return HeapTupleInvisible; +} + +/* + * ZHeapTupleIsSurelyDead + * + * Similar to HeapTupleIsSurelyDead, but for zheap tuples. + */ +bool +ZHeapTupleIsSurelyDead(ZHeapTuple zhtup, uint64 OldestXmin, Buffer buffer) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + TransactionId xid; + uint64 epoch_xid; + int trans_slot_id; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + /* Get transaction id */ + ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, &xid, NULL, + NULL, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || epoch_xid < OldestXmin) + return true; + } + + return false; /* Tuple is still alive */ +} + +/* + * ZHeapTupleSatisfiesSelf + * Returns the visible version of tuple (including effects of previous + * commands in current transactions) if any, NULL otherwise. + * + * Here, we consider the effects of: + * all committed transactions (as of the current instant) + * previous commands of this transaction + * changes made by the current command + * + * The tuple will be considered visible iff: + * Latest operation on tuple is Insert, In-Place update or tuple is + * locked and the transaction that has performed operation is current + * transaction or is committed. + * + * If the transaction is in progress, then we fetch the tuple from undo. + */ +ZHeapTuple +ZHeapTupleSatisfiesSelf(ZHeapTuple zhtup, Snapshot snapshot, + Buffer buffer, ItemPointer ctid) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + TransactionId xid; + UndoRecPtr urec_ptr = InvalidUndoRecPtr; + uint64 epoch_xid; + int trans_slot_id; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + /* Get transaction id */ + ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, &xid, + NULL, &urec_ptr, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return NULL; + + if (TransactionIdIsCurrentTransactionId(xid)) + return NULL; + else if (TransactionIdIsInProgress(xid)) + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot_id, + InvalidTransactionId); + else if (TransactionIdDidCommit(xid)) + { + /* tuple is deleted or non-inplace-updated */ + return NULL; + } + else /* transaction is aborted */ + { + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot_id, + InvalidTransactionId); + } + } + else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED || + tuple->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + /* + * The tuple is updated/locked and must be all visible if the + * transaction slot is cleared or latest xid that has changed the + * tuple precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return zhtup; + + if (TransactionIdIsCurrentTransactionId(xid)) + { + return zhtup; + } + else if (TransactionIdIsInProgress(xid)) + { + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot_id, + InvalidTransactionId); + } + else if (TransactionIdDidCommit(xid)) + { + return zhtup; + } + else /* transaction is aborted */ + { + return GetTupleFromUndo(urec_ptr, + zhtup, + snapshot, + buffer, + ctid, + trans_slot_id, + InvalidTransactionId); + } + } + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple precedes smallest xid that has + * undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return zhtup; + + if (TransactionIdIsCurrentTransactionId(xid)) + return zhtup; + else if (TransactionIdIsInProgress(xid)) + { + return NULL; + } + else if (TransactionIdDidCommit(xid)) + return zhtup; + else + { + /* Inserting transaction is aborted. */ + return NULL; + } + + return NULL; +} + +/* + * ZHeapTupleSatisfiesDirty + * Returns the visible version of tuple (including effects of open + * transactions) if any, NULL otherwise. + * + * Here, we consider the effects of: + * all committed and in-progress transactions (as of the current instant) + * previous commands of this transaction + * changes made by the current command + * + * This is essentially like ZHeapTupleSatisfiesSelf as far as effects of + * the current transaction and committed/aborted xacts are concerned. + * However, we also include the effects of other xacts still in progress. + * + * The tuple will be considered visible iff: + * (a) Latest operation on tuple is Delete or non-inplace-update and the + * current transaction is in progress. + * (b) Latest operation on tuple is Insert, In-Place update or tuple is + * locked and the transaction that has performed operation is current + * transaction or is in-progress or is committed. + */ +ZHeapTuple +ZHeapTupleSatisfiesDirty(ZHeapTuple zhtup, Snapshot snapshot, + Buffer buffer, ItemPointer ctid) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + TransactionId xid; + uint64 epoch_xid; + UndoRecPtr urec_ptr; + int trans_slot_id; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + snapshot->xmin = snapshot->xmax = InvalidTransactionId; + snapshot->subxid = InvalidSubTransactionId; + snapshot->speculativeToken = 0; + + /* Get transaction id */ + ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, &xid, NULL, + &urec_ptr, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return NULL; + + if (TransactionIdIsCurrentTransactionId(xid)) + { + /* + * For non-inplace-updates, ctid needs to be retrieved from undo + * record if required. If the tuple is moved to another + * partition, then we don't need ctid. + */ + if (ctid && + !ZHeapTupleIsMoved(tuple->t_infomask) && + tuple->t_infomask & ZHEAP_UPDATED) + ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid); + return NULL; + } + else if (TransactionIdIsInProgress(xid)) + { + snapshot->xmax = xid; + if (UndoRecPtrIsValid(urec_ptr)) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, &snapshot->subxid); + return zhtup; /* in deletion by other */ + } + else if (TransactionIdDidCommit(xid)) + { + /* + * For non-inplace-updates, ctid needs to be retrieved from undo + * record if required. If the tuple is moved to another + * partition, then we don't need ctid. + */ + if (ctid && + !ZHeapTupleIsMoved(tuple->t_infomask) && + tuple->t_infomask & ZHEAP_UPDATED) + ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid); + + /* tuple is deleted or non-inplace-updated */ + return NULL; + } + else /* transaction is aborted */ + { + return GetTupleFromUndo(urec_ptr, zhtup, snapshot, buffer, ctid, + trans_slot_id, InvalidTransactionId); + } + } + else if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED || + tuple->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + /* + * The tuple is updated/locked and must be all visible if the + * transaction slot is cleared or latest xid that has changed the + * tuple precedes smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return zhtup; /* tuple is updated */ + + if (TransactionIdIsCurrentTransactionId(xid)) + return zhtup; + else if (TransactionIdIsInProgress(xid)) + { + if (!ZHEAP_XID_IS_LOCKED_ONLY(tuple->t_infomask)) + { + snapshot->xmax = xid; + if (UndoRecPtrIsValid(urec_ptr)) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, &snapshot->subxid); + } + return zhtup; /* being updated */ + } + else if (TransactionIdDidCommit(xid)) + return zhtup; /* tuple is updated by someone else */ + else /* transaction is aborted */ + { + /* Here we need to fetch the tuple from undo. */ + return GetTupleFromUndo(urec_ptr, zhtup, snapshot, buffer, ctid, + trans_slot_id, InvalidTransactionId); + } + } + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple precedes smallest xid that has + * undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return zhtup; + + if (TransactionIdIsCurrentTransactionId(xid)) + return zhtup; + else if (TransactionIdIsInProgress(xid)) + { + /* Return the speculative token to caller. */ + if (ZHeapTupleHeaderIsSpeculative(tuple)) + { + ZHeapTupleGetSpecToken(zhtup, buffer, urec_ptr, + &snapshot->speculativeToken); + + Assert(snapshot->speculativeToken != 0); + } + + snapshot->xmin = xid; + if (UndoRecPtrIsValid(urec_ptr)) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, &snapshot->subxid); + return zhtup; /* in insertion by other */ + } + else if (TransactionIdDidCommit(xid)) + return zhtup; + else + { + /* + * Since the transaction that inserted the tuple is aborted. So, it's + * not visible to any transaction. + */ + return NULL; + } + + return NULL; +} + +/* + * ZHeapTupleSatisfiesAny + * Dummy "satisfies" routine: any tuple satisfies SnapshotAny. + */ +ZHeapTuple +ZHeapTupleSatisfiesAny(ZHeapTuple zhtup, Snapshot snapshot, Buffer buffer, + ItemPointer ctid) +{ + /* Callers can expect ctid to be populated. */ + if (ctid && + !ZHeapTupleIsMoved(zhtup->t_data->t_infomask) && + ZHeapTupleIsUpdated(zhtup->t_data->t_infomask)) + { + UndoRecPtr urec_ptr; + int out_slot_no PG_USED_FOR_ASSERTS_ONLY; + + out_slot_no = GetTransactionSlotInfo(buffer, + ItemPointerGetOffsetNumber(&zhtup->t_self), + ZHeapTupleHeaderGetXactSlot(zhtup->t_data), + NULL, + NULL, + &urec_ptr, + true, + false); + /* + * We always expect non-frozen transaction slot here as the caller tries + * to fetch the ctid of tuples that are visible to the snapshot, so + * corresponding undo record can't be discarded. + */ + Assert(out_slot_no != ZHTUP_SLOT_FROZEN); + + ZHeapTupleGetCtid(zhtup, buffer, urec_ptr, ctid); + } + return zhtup; +} + +/* + * ZHeapTupleSatisfiesOldestXmin + * The tuple will be considered visible if it is visible to any open + * transaction. + * + * ztuple is an input/output parameter. The caller must send the palloc'ed + * data. This function can get a tuple from undo to return in which case it + * will free the memory passed by the caller. + * + * xid is an output parameter. It is set to the latest committed/in-progress + * xid that inserted/modified the tuple. + * If the latest transaction for the tuple aborted, we fetch a prior committed + * version of the tuple and return the prior comitted xid and status as + * HEAPTUPLE_LIVE. + * If the latest transaction for the tuple aborted and it also inserted + * the tuple, we return the aborted transaction id and status as + * HEAPTUPLE_DEAD. In this case, the caller *should* never mark the + * corresponding item id as dead. Because, when undo action for the same will + * be performed, we need the item pointer. + */ +HTSV_Result +ZHeapTupleSatisfiesOldestXmin(ZHeapTuple *ztuple, TransactionId OldestXmin, + Buffer buffer, TransactionId *xid, + SubTransactionId *subxid) +{ + ZHeapTuple zhtup = *ztuple; + ZHeapTupleHeader tuple = zhtup->t_data; + UndoRecPtr urec_ptr; + uint64 epoch_xid; + int trans_slot_id; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + /* Get transaction id */ + ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, xid, NULL, + &urec_ptr, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return HEAPTUPLE_DEAD; + + if (TransactionIdIsCurrentTransactionId(*xid)) + return HEAPTUPLE_DELETE_IN_PROGRESS; + else if (TransactionIdIsInProgress(*xid)) + { + /* Get Sub transaction id */ + if (subxid) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid); + + return HEAPTUPLE_DELETE_IN_PROGRESS; + } + else if (TransactionIdDidCommit(*xid)) + { + /* + * Deleter committed, but perhaps it was recent enough that some open + * transactions could still see the tuple. + */ + if (!TransactionIdPrecedes(*xid, OldestXmin)) + return HEAPTUPLE_RECENTLY_DEAD; + + /* Otherwise, it's dead and removable */ + return HEAPTUPLE_DEAD; + } + else /* transaction is aborted */ + { + /* + * For aborted transactions, we need to fetch the tuple from undo + * chain. + */ + *ztuple = GetTupleFromUndoForAbortedXact(urec_ptr, buffer, + trans_slot_id, zhtup, xid); + if (*ztuple != NULL) + return HEAPTUPLE_LIVE; + else + { + /* + * If the transaction that inserted the tuple got aborted, + * we should return the aborted transaction id. + */ + return HEAPTUPLE_DEAD; + } + } + } + else if (tuple->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + /* + * We can't take any decision if the tuple is marked as locked-only. + * It's possible that inserted transaction took a lock on the tuple + * Later, if it rolled back, we should return HEAPTUPLE_DEAD, or if + * it's still in progress, we should return HEAPTUPLE_INSERT_IN_PROGRESS. + * Similarly, if the inserted transaction got committed, we should return + * HEAPTUPLE_LIVE. + * The subsequent checks already takes care of all these possible + * scenarios, so we don't need any extra checks here. + */ + } + + /* The tuple is either a newly inserted tuple or is in-place updated. */ + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple precedes smallest xid that has + * undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return HEAPTUPLE_LIVE; + + if (TransactionIdIsCurrentTransactionId(*xid)) + return HEAPTUPLE_INSERT_IN_PROGRESS; + else if (TransactionIdIsInProgress(*xid)) + { + /* Get Sub transaction id */ + if (subxid) + ZHeapTupleGetSubXid(zhtup, buffer, urec_ptr, subxid); + return HEAPTUPLE_INSERT_IN_PROGRESS; /* in insertion by other */ + } + else if (TransactionIdDidCommit(*xid)) + return HEAPTUPLE_LIVE; + else /* transaction is aborted */ + { + if (tuple->t_infomask & ZHEAP_INPLACE_UPDATED) + { + /* + * For aborted transactions, we need to fetch the tuple from undo + * chain. + */ + *ztuple = GetTupleFromUndoForAbortedXact(urec_ptr, buffer, + trans_slot_id, zhtup, xid); + if (*ztuple != NULL) + return HEAPTUPLE_LIVE; + } + /* + * If the transaction that inserted the tuple got aborted, we should + * return the aborted transaction id. + */ + return HEAPTUPLE_DEAD; + } + + return HEAPTUPLE_LIVE; +} + +/* + * ZHeapTupleSatisfiesNonVacuumable + * + * True if tuple might be visible to some transaction; false if it's + * surely dead to everyone, ie, vacuumable. + * + * This is an interface to ZHeapTupleSatisfiesOldestXmin that meets the + * SnapshotSatisfiesFunc API, so it can be used through a Snapshot. + * snapshot->xmin must have been set up with the xmin horizon to use. + */ +ZHeapTuple +ZHeapTupleSatisfiesNonVacuumable(ZHeapTuple ztup, Snapshot snapshot, + Buffer buffer, ItemPointer ctid) +{ + TransactionId xid; + + return (ZHeapTupleSatisfiesOldestXmin(&ztup, snapshot->xmin, buffer, &xid, NULL) + != HEAPTUPLE_DEAD) ? ztup : NULL; +} + +/* + * ZHeapTupleSatisfiesVacuum + * Similar to ZHeapTupleSatisfiesOldestXmin, but it behaves differently for + * handling aborted transaction. + * + * For aborted transactions, we don't fetch any prior committed version of the + * tuple. Instead, we return ZHEAPTUPLE_ABORT_IN_PROGRESS and return the aborted + * xid. The caller should avoid such tuple for any kind of prunning/vacuuming. + */ +ZHTSV_Result +ZHeapTupleSatisfiesVacuum(ZHeapTuple zhtup, TransactionId OldestXmin, + Buffer buffer, TransactionId *xid) +{ + ZHeapTupleHeader tuple = zhtup->t_data; + UndoRecPtr urec_ptr; + uint64 epoch_xid; + int trans_slot_id; + + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + /* Get transaction id */ + ZHeapTupleGetTransInfo(zhtup, buffer, &trans_slot_id, &epoch_xid, xid, NULL, + &urec_ptr, false); + + if (tuple->t_infomask & ZHEAP_DELETED || + tuple->t_infomask & ZHEAP_UPDATED) + { + /* + * The tuple is deleted and must be all visible if the transaction slot + * is cleared or latest xid that has changed the tuple precedes + * smallest xid that has undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return ZHEAPTUPLE_DEAD; + + if (TransactionIdIsCurrentTransactionId(*xid)) + return ZHEAPTUPLE_DELETE_IN_PROGRESS; + else if (TransactionIdIsInProgress(*xid)) + { + return ZHEAPTUPLE_DELETE_IN_PROGRESS; + } + else if (TransactionIdDidCommit(*xid)) + { + /* + * Deleter committed, but perhaps it was recent enough that some open + * transactions could still see the tuple. + */ + if (!TransactionIdPrecedes(*xid, OldestXmin)) + return ZHEAPTUPLE_RECENTLY_DEAD; + + /* Otherwise, it's dead and removable */ + return ZHEAPTUPLE_DEAD; + } + else /* transaction is aborted */ + { + return ZHEAPTUPLE_ABORT_IN_PROGRESS; + } + } + else if (tuple->t_infomask & ZHEAP_XID_LOCK_ONLY) + { + /* + * "Deleting" xact really only locked it, so the tuple is live in any + * case. + */ + return ZHEAPTUPLE_LIVE; + } + + /* The tuple is either a newly inserted tuple or is in-place updated. */ + + /* + * The tuple must be all visible if the transaction slot is cleared or + * latest xid that has changed the tuple precedes smallest xid that has + * undo. + */ + if (trans_slot_id == ZHTUP_SLOT_FROZEN || + epoch_xid < pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)) + return ZHEAPTUPLE_LIVE; + + if (TransactionIdIsCurrentTransactionId(*xid)) + return ZHEAPTUPLE_INSERT_IN_PROGRESS; + else if (TransactionIdIsInProgress(*xid)) + return ZHEAPTUPLE_INSERT_IN_PROGRESS; /* in insertion by other */ + else if (TransactionIdDidCommit(*xid)) + return ZHEAPTUPLE_LIVE; + else /* transaction is aborted */ + { + return ZHEAPTUPLE_ABORT_IN_PROGRESS; + } + + return ZHEAPTUPLE_LIVE; +} + +/* + * ZHeapTupleSatisfiesToast + * + * True iff zheap tuple is valid as a TOAST row. + * + * Unlike heap, we don't need checks for VACUUM moving conditions as those are + * for pre-9.0 and that doesn't apply for zheap. For aborted speculative + * inserts, we always marks row as dead, so we don't any check for that. So, + * here we can rely on the fact that if you can see the main table row that + * contains a TOAST reference, you should be able to see the TOASTed value. + */ +ZHeapTuple +ZHeapTupleSatisfiesToast(ZHeapTuple zhtup, Snapshot snapshot, + Buffer buffer, ItemPointer ctid) +{ + Assert(ItemPointerIsValid(&zhtup->t_self)); + Assert(zhtup->t_tableOid != InvalidOid); + + return zhtup; +} + + +ZHeapTuple +ZHeapTupleSatisfies(ZHeapTuple stup, Snapshot snapshot, Buffer buffer, ItemPointer ctid) +{ + switch (snapshot->visibility_type) + { + case MVCC_VISIBILITY: + return ZHeapTupleSatisfiesMVCC(stup, snapshot, buffer, ctid); + break; + case SELF_VISIBILITY: + return ZHeapTupleSatisfiesSelf(stup, snapshot, buffer, ctid); + break; + case ANY_VISIBILITY: + return ZHeapTupleSatisfiesAny(stup, snapshot, buffer, ctid); + break; + case TOAST_VISIBILITY: + return ZHeapTupleSatisfiesToast(stup, snapshot, buffer, ctid); + break; + case DIRTY_VISIBILITY: + return ZHeapTupleSatisfiesDirty(stup, snapshot, buffer, ctid); + break; + case HISTORIC_MVCC_VISIBILITY: + // ZBORKED: need a better error message + elog(PANIC, "unimplemented"); + break; + case NON_VACUUMABLE_VISIBILTY: + return ZHeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer, ctid); + break; + default: + Assert(0); + break; + } +} + +/* + * This is a helper function for CheckForSerializableConflictOut. + * + * Check to see whether the tuple has been written to by a concurrent + * transaction, either to create it not visible to us, or to delete it + * while it is visible to us. The "visible" bool indicates whether the + * tuple is visible to us, while ZHeapTupleSatisfiesOldestXmin checks what + * else is going on with it. The caller should have a share lock on the buffer. + */ +bool +ZHeapTupleHasSerializableConflictOut(bool visible, Relation relation, + ItemPointer tid, Buffer buffer, + TransactionId *xid) +{ + HTSV_Result htsvResult; + ItemId lp; + OffsetNumber offnum; + Page dp; + ZHeapTuple tuple; + Size tuple_len; + bool tuple_inplace_updated = false; + Snapshot snap; + + Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer)); + offnum = ItemPointerGetOffsetNumber(tid); + dp = BufferGetPage(buffer); + + /* check for bogus TID */ + Assert (offnum >= FirstOffsetNumber && + offnum <= PageGetMaxOffsetNumber(dp)); + + lp = PageGetItemId(dp, offnum); + + /* check for unused or dead items */ + Assert (ItemIdIsNormal(lp) || ItemIdIsDeleted(lp)); + + /* + * If the record is deleted and pruned, its place in the page might have + * been taken by another of its kind. + */ + if (ItemIdIsDeleted(lp)) + { + /* + * If the tuple is still visible to us, then we've a conflict. Becasue, + * the transaction that deleted the tuple already got committed. + */ + if (visible) + { + snap = GetTransactionSnapshot(); + tuple = ZHeapGetVisibleTuple(offnum, snap, buffer, NULL); + ZHeapTupleGetTransInfo(tuple, buffer, NULL, NULL, xid, + NULL, NULL, false); + pfree(tuple); + return true; + } + else + return false; + } + + tuple_len = ItemIdGetLength(lp); + tuple = palloc(ZHEAPTUPLESIZE + tuple_len); + tuple->t_data = (ZHeapTupleHeader) ((char *) tuple + ZHEAPTUPLESIZE); + tuple->t_tableOid = RelationGetRelid(relation); + tuple->t_len = tuple_len; + ItemPointerSet(&tuple->t_self, ItemPointerGetBlockNumber(tid), offnum); + memcpy(tuple->t_data, + ((ZHeapTupleHeader) PageGetItem((Page) dp, lp)), tuple_len); + + if (tuple->t_data->t_infomask & ZHEAP_INPLACE_UPDATED) + tuple_inplace_updated = true; + + htsvResult = ZHeapTupleSatisfiesOldestXmin(&tuple, TransactionXmin, buffer, xid, NULL); + pfree(tuple); + switch (htsvResult) + { + case HEAPTUPLE_LIVE: + if (tuple_inplace_updated) + { + /* + * We can't rely on callers visibility information for + * in-place updated tuples because they consider the tuple as + * visible if any version of the tuple is visible whereas we + * want to know the status of current tuple. In case of + * aborted transactions, it is quite possible that the rollback + * actions aren't yet applied and we need the status of last + * committed transaction; ZHeapTupleSatisfiesOldestXmin returns + * us that information. + */ + if (XidIsConcurrent(*xid)) + visible = false; + } + if (visible) + return false; + break; + case HEAPTUPLE_RECENTLY_DEAD: + if (!visible) + return false; + break; + case HEAPTUPLE_DELETE_IN_PROGRESS: + break; + case HEAPTUPLE_INSERT_IN_PROGRESS: + break; + case HEAPTUPLE_DEAD: + return false; + default: + + /* + * The only way to get to this default clause is if a new value is + * added to the enum type without adding it to this switch + * statement. That's a bug, so elog. + */ + elog(ERROR, "unrecognized return value from ZHeapTupleSatisfiesOldestXmin: %u", htsvResult); + + /* + * In spite of having all enum values covered and calling elog on + * this default, some compilers think this is a code path which + * allows xid to be used below without initialization. Silence + * that warning. + */ + *xid = InvalidTransactionId; + } + Assert(TransactionIdIsValid(*xid)); + Assert(TransactionIdFollowsOrEquals(*xid, TransactionXmin)); + + /* + * Find top level xid. Bail out if xid is too early to be a conflict, or + * if it's our own xid. + */ + if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny())) + return false; + if (TransactionIdPrecedes(*xid, TransactionXmin)) + return false; + if (TransactionIdEquals(*xid, GetTopTransactionIdIfAny())) + return false; + + return true; +} diff --git a/src/backend/access/zheap/ztuptoaster.c b/src/backend/access/zheap/ztuptoaster.c new file mode 100644 index 0000000000..2ce75f387d --- /dev/null +++ b/src/backend/access/zheap/ztuptoaster.c @@ -0,0 +1,990 @@ +/*------------------------------------------------------------------------- + * + * ztuptoaster.c + * Support routines for external and compressed storage of + * variable size attributes. + * + * This file contains common functionality with tuptoaster except that + * the tuples are of zheap format and stored in zheap storage engine. + * Even if we want to keep it as a separate code for this, the common + * parts needs to be extracted. + * + * The benefit of storing toast data in zheap is that it avoids bloat in + * toast storage. The tuple space can be immediately reclaimed once the + * deleting transaction is committed. + * + * Copyright (c) 2000-2018, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * src/backend/access/heap/ztuptoaster.c + * + * + * INTERFACE ROUTINES + * ztoast_insert_or_update - + * Try to make a given tuple fit into one page by compressing + * or moving off attributes + * + * ztoast_delete - + * Reclaim toast storage when a tuple is deleted + * + * + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include +#include + +#include "access/genam.h" +#include "access/heapam.h" +#include "access/tuptoaster.h" +#include "access/xact.h" +#include "catalog/catalog.h" +#include "common/pg_lzcompress.h" +#include "miscadmin.h" +#include "utils/expandeddatum.h" +#include "utils/fmgroids.h" +#include "utils/rel.h" +#include "utils/snapmgr.h" +#include "utils/typcache.h" +#include "utils/tqual.h" +#include "access/zheap.h" +#include "access/zheaputils.h" + +static void ztoast_delete_datum(Relation rel, Datum value, bool is_speculative); +static Datum ztoast_save_datum(Relation rel, Datum value, + struct varlena *oldexternal, int options); + +/* ---------- + * ztoast_insert_or_update - + * Just like toast_insert_or_update but for zheap relations. + */ + +ZHeapTuple +ztoast_insert_or_update(Relation rel, ZHeapTuple newtup, ZHeapTuple oldtup, + int options) +{ + ZHeapTuple result_tuple; + TupleDesc tupleDesc; + int numAttrs; + int i; + + bool need_change = false; + bool need_free = false; + bool need_delold = false; + bool has_nulls = false; + + Size maxDataLen; + Size hoff; + + char toast_action[MaxHeapAttributeNumber]; + bool toast_isnull[MaxHeapAttributeNumber]; + bool toast_oldisnull[MaxHeapAttributeNumber]; + Datum toast_values[MaxHeapAttributeNumber]; + Datum toast_oldvalues[MaxHeapAttributeNumber]; + struct varlena *toast_oldexternal[MaxHeapAttributeNumber]; + int32 toast_sizes[MaxHeapAttributeNumber]; + bool toast_free[MaxHeapAttributeNumber]; + bool toast_delold[MaxHeapAttributeNumber]; + + /* + * Ignore the INSERT_SPECULATIVE option. Speculative insertions/super + * deletions just normally insert/delete the toast values. It seems + * easiest to deal with that here, instead on, potentially, multiple + * callers. + */ + options &= ~HEAP_INSERT_SPECULATIVE; + + /* + * We should only ever be called for tuples of plain relations or + * materialized views --- recursing on a toast rel is bad news. + */ + Assert(rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_MATVIEW); + + /* + * Get the tuple descriptor and break down the tuple(s) into fields. + */ + tupleDesc = rel->rd_att; + numAttrs = tupleDesc->natts; + + Assert(numAttrs <= MaxHeapAttributeNumber); + zheap_deform_tuple(newtup, tupleDesc, toast_values, toast_isnull); + if (oldtup != NULL) + zheap_deform_tuple(oldtup, tupleDesc, toast_oldvalues, toast_oldisnull); + + /* ---------- + * Then collect information about the values given + * + * NOTE: toast_action[i] can have these values: + * ' ' default handling + * 'p' already processed --- don't touch it + * 'x' incompressible, but OK to move off + * + * NOTE: toast_sizes[i] is only made valid for varlena attributes with + * toast_action[i] different from 'p'. + * ---------- + */ + memset(toast_action, ' ', numAttrs * sizeof(char)); + memset(toast_oldexternal, 0, numAttrs * sizeof(struct varlena *)); + memset(toast_free, 0, numAttrs * sizeof(bool)); + memset(toast_delold, 0, numAttrs * sizeof(bool)); + + for (i = 0; i < numAttrs; i++) + { + Form_pg_attribute att = TupleDescAttr(tupleDesc, i); + struct varlena *old_value; + struct varlena *new_value; + + if (oldtup != NULL) + { + /* + * For UPDATE get the old and new values of this attribute + */ + old_value = (struct varlena *) DatumGetPointer(toast_oldvalues[i]); + new_value = (struct varlena *) DatumGetPointer(toast_values[i]); + + /* + * If the old value is stored on disk, check if it has changed so + * we have to delete it later. + */ + if (att->attlen == -1 && !toast_oldisnull[i] && + VARATT_IS_EXTERNAL_ONDISK(old_value)) + { + if (toast_isnull[i] || !VARATT_IS_EXTERNAL_ONDISK(new_value) || + memcmp((char *) old_value, (char *) new_value, + VARSIZE_EXTERNAL(old_value)) != 0) + { + /* + * The old external stored value isn't needed any more + * after the update + */ + toast_delold[i] = true; + need_delold = true; + } + else + { + /* + * This attribute isn't changed by this update so we reuse + * the original reference to the old value in the new + * tuple. + */ + toast_action[i] = 'p'; + continue; + } + } + } + else + { + /* + * For INSERT simply get the new value + */ + new_value = (struct varlena *) DatumGetPointer(toast_values[i]); + } + + /* + * Handle NULL attributes + */ + if (toast_isnull[i]) + { + toast_action[i] = 'p'; + has_nulls = true; + continue; + } + + /* + * Now look at varlena attributes + */ + if (att->attlen == -1) + { + /* + * If the table's attribute says PLAIN always, force it so. + */ + if (att->attstorage == 'p') + toast_action[i] = 'p'; + + /* + * We took care of UPDATE above, so any external value we find + * still in the tuple must be someone else's that we cannot reuse + * (this includes the case of an out-of-line in-memory datum). + * Fetch it back (without decompression, unless we are forcing + * PLAIN storage). If necessary, we'll push it out as a new + * external value below. + */ + if (VARATT_IS_EXTERNAL(new_value)) + { + toast_oldexternal[i] = new_value; + if (att->attstorage == 'p') + new_value = heap_tuple_untoast_attr(new_value); + else + new_value = heap_tuple_fetch_attr(new_value); + toast_values[i] = PointerGetDatum(new_value); + toast_free[i] = true; + need_change = true; + need_free = true; + } + + /* + * Remember the size of this attribute + */ + toast_sizes[i] = VARSIZE_ANY(new_value); + } + else + { + /* + * Not a varlena attribute, plain storage always + */ + toast_action[i] = 'p'; + } + } + + /* ---------- + * Compress and/or save external until data fits into target length + * + * 1: Inline compress attributes with attstorage 'x', and store very + * large attributes with attstorage 'x' or 'e' external immediately + * 2: Store attributes with attstorage 'x' or 'e' external + * 3: Inline compress attributes with attstorage 'm' + * 4: Store attributes with attstorage 'm' external + * ---------- + */ + + /* compute header overhead --- this should match heap_form_tuple() */ + hoff = SizeofZHeapTupleHeader; + if (has_nulls) + hoff += BITMAPLEN(numAttrs); + + /* now convert to a limit on the tuple data size */ + maxDataLen = RelationGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff; + + /* + * Look for attributes with attstorage 'x' to compress. Also find large + * attributes with attstorage 'x' or 'e', and store them external. + */ + while (zheap_compute_data_size(tupleDesc, + toast_values, toast_isnull, hoff) > maxDataLen) + { + int biggest_attno = -1; + int32 biggest_size = MAXALIGN(TOAST_POINTER_SIZE); + Datum old_value; + Datum new_value; + + /* + * Search for the biggest yet unprocessed internal attribute + */ + for (i = 0; i < numAttrs; i++) + { + Form_pg_attribute att = TupleDescAttr(tupleDesc, i); + + if (toast_action[i] != ' ') + continue; + if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i]))) + continue; /* can't happen, toast_action would be 'p' */ + if (VARATT_IS_COMPRESSED(DatumGetPointer(toast_values[i]))) + continue; + if (att->attstorage != 'x' && att->attstorage != 'e') + continue; + if (toast_sizes[i] > biggest_size) + { + biggest_attno = i; + biggest_size = toast_sizes[i]; + } + } + + if (biggest_attno < 0) + break; + + /* + * Attempt to compress it inline, if it has attstorage 'x' + */ + i = biggest_attno; + if (TupleDescAttr(tupleDesc, i)->attstorage == 'x') + { + old_value = toast_values[i]; + new_value = toast_compress_datum(old_value); + + if (DatumGetPointer(new_value) != NULL) + { + /* successful compression */ + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + toast_values[i] = new_value; + toast_free[i] = true; + toast_sizes[i] = VARSIZE(DatumGetPointer(toast_values[i])); + need_change = true; + need_free = true; + } + else + { + /* incompressible, ignore on subsequent compression passes */ + toast_action[i] = 'x'; + } + } + else + { + /* has attstorage 'e', ignore on subsequent compression passes */ + toast_action[i] = 'x'; + } + + /* + * If this value is by itself more than maxDataLen (after compression + * if any), push it out to the toast table immediately, if possible. + * This avoids uselessly compressing other fields in the common case + * where we have one long field and several short ones. + * + * XXX maybe the threshold should be less than maxDataLen? + */ + if (toast_sizes[i] > maxDataLen && + rel->rd_rel->reltoastrelid != InvalidOid) + { + old_value = toast_values[i]; + toast_action[i] = 'p'; + toast_values[i] = ztoast_save_datum(rel, toast_values[i], + toast_oldexternal[i], options); + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + toast_free[i] = true; + need_change = true; + need_free = true; + } + } + + /* + * Second we look for attributes of attstorage 'x' or 'e' that are still + * inline. But skip this if there's no toast table to push them to. + */ + while (zheap_compute_data_size(tupleDesc, + toast_values, toast_isnull, hoff) > maxDataLen && + rel->rd_rel->reltoastrelid != InvalidOid) + { + int biggest_attno = -1; + int32 biggest_size = MAXALIGN(TOAST_POINTER_SIZE); + Datum old_value; + + /*------ + * Search for the biggest yet inlined attribute with + * attstorage equals 'x' or 'e' + *------ + */ + for (i = 0; i < numAttrs; i++) + { + Form_pg_attribute att = TupleDescAttr(tupleDesc, i); + + if (toast_action[i] == 'p') + continue; + if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i]))) + continue; /* can't happen, toast_action would be 'p' */ + if (att->attstorage != 'x' && att->attstorage != 'e') + continue; + if (toast_sizes[i] > biggest_size) + { + biggest_attno = i; + biggest_size = toast_sizes[i]; + } + } + + if (biggest_attno < 0) + break; + + /* + * Store this external + */ + i = biggest_attno; + old_value = toast_values[i]; + toast_action[i] = 'p'; + toast_values[i] = ztoast_save_datum(rel, toast_values[i], + toast_oldexternal[i], options); + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + toast_free[i] = true; + + need_change = true; + need_free = true; + } + + /* + * Round 3 - this time we take attributes with storage 'm' into + * compression + */ + while (zheap_compute_data_size(tupleDesc, + toast_values, toast_isnull, hoff) > maxDataLen) + { + int biggest_attno = -1; + int32 biggest_size = MAXALIGN(TOAST_POINTER_SIZE); + Datum old_value; + Datum new_value; + + /* + * Search for the biggest yet uncompressed internal attribute + */ + for (i = 0; i < numAttrs; i++) + { + if (toast_action[i] != ' ') + continue; + if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i]))) + continue; /* can't happen, toast_action would be 'p' */ + if (VARATT_IS_COMPRESSED(DatumGetPointer(toast_values[i]))) + continue; + if (TupleDescAttr(tupleDesc, i)->attstorage != 'm') + continue; + if (toast_sizes[i] > biggest_size) + { + biggest_attno = i; + biggest_size = toast_sizes[i]; + } + } + + if (biggest_attno < 0) + break; + + /* + * Attempt to compress it inline + */ + i = biggest_attno; + old_value = toast_values[i]; + new_value = toast_compress_datum(old_value); + + if (DatumGetPointer(new_value) != NULL) + { + /* successful compression */ + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + toast_values[i] = new_value; + toast_free[i] = true; + toast_sizes[i] = VARSIZE(DatumGetPointer(toast_values[i])); + need_change = true; + need_free = true; + } + else + { + /* incompressible, ignore on subsequent compression passes */ + toast_action[i] = 'x'; + } + } + + /* + * Finally we store attributes of type 'm' externally. At this point we + * increase the target tuple size, so that 'm' attributes aren't stored + * externally unless really necessary. + */ + maxDataLen = TOAST_TUPLE_TARGET_MAIN - hoff; + + while (zheap_compute_data_size(tupleDesc, + toast_values, toast_isnull, hoff) > maxDataLen && + rel->rd_rel->reltoastrelid != InvalidOid) + { + int biggest_attno = -1; + int32 biggest_size = MAXALIGN(TOAST_POINTER_SIZE); + Datum old_value; + + /*-------- + * Search for the biggest yet inlined attribute with + * attstorage = 'm' + *-------- + */ + for (i = 0; i < numAttrs; i++) + { + if (toast_action[i] == 'p') + continue; + if (VARATT_IS_EXTERNAL(DatumGetPointer(toast_values[i]))) + continue; /* can't happen, toast_action would be 'p' */ + if (TupleDescAttr(tupleDesc, i)->attstorage != 'm') + continue; + if (toast_sizes[i] > biggest_size) + { + biggest_attno = i; + biggest_size = toast_sizes[i]; + } + } + + if (biggest_attno < 0) + break; + + /* + * Store this external + */ + i = biggest_attno; + old_value = toast_values[i]; + toast_action[i] = 'p'; + toast_values[i] = ztoast_save_datum(rel, toast_values[i], + toast_oldexternal[i], options); + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + toast_free[i] = true; + + need_change = true; + need_free = true; + } + + /* + * In the case we toasted any values, we need to build a new heap tuple + * with the changed values. + */ + if (need_change) + { + ZHeapTupleHeader olddata = newtup->t_data; + ZHeapTupleHeader new_data; + int32 new_header_len; + int32 new_data_len; + int32 new_tuple_len; + + /* + * Calculate the new size of the tuple. + * + * Note: we used to assume here that the old tuple's t_hoff must equal + * the new_header_len value, but that was incorrect. The old tuple + * might have a smaller-than-current natts, if there's been an ALTER + * TABLE ADD COLUMN since it was stored; and that would lead to a + * different conclusion about the size of the null bitmap, or even + * whether there needs to be one at all. + */ + new_header_len = SizeofZHeapTupleHeader; + if (has_nulls) + new_header_len += BITMAPLEN(numAttrs); + new_data_len = zheap_compute_data_size(tupleDesc, + toast_values, toast_isnull, hoff); + new_tuple_len = new_header_len + new_data_len; + + /* + * Allocate and zero the space needed, and fill ZHeapTupleData fields. + */ + result_tuple = (ZHeapTuple) palloc0(ZHEAPTUPLESIZE + new_tuple_len); + result_tuple->t_len = new_tuple_len; + result_tuple->t_self = newtup->t_self; + result_tuple->t_tableOid = newtup->t_tableOid; + new_data = (ZHeapTupleHeader) ((char *) result_tuple + ZHEAPTUPLESIZE); + result_tuple->t_data = new_data; + + /* + * Copy the existing tuple header, but adjust natts and t_hoff. + */ + memcpy(new_data, olddata, SizeofZHeapTupleHeader); + ZHeapTupleHeaderSetNatts(new_data, numAttrs); + new_data->t_hoff = new_header_len; + + /* Copy over the data, and fill the null bitmap if needed */ + zheap_fill_tuple(tupleDesc, + toast_values, + toast_isnull, + (char *) new_data + new_header_len, + new_data_len, + &(new_data->t_infomask), + has_nulls ? new_data->t_bits : NULL); + } + else + result_tuple = newtup; + + /* + * Free allocated temp values + */ + if (need_free) + for (i = 0; i < numAttrs; i++) + if (toast_free[i]) + pfree(DatumGetPointer(toast_values[i])); + + /* + * Delete external values from the old tuple + */ + if (need_delold) + for (i = 0; i < numAttrs; i++) + if (toast_delold[i]) + ztoast_delete_datum(rel, toast_oldvalues[i], false); + + return result_tuple; +} + +/* + * ztoast_save_datum + * Just like toast_save_datum but for zheap relations. + */ +static Datum +ztoast_save_datum(Relation rel, Datum value, + struct varlena *oldexternal, int options) +{ + Relation toastrel; + Relation *toastidxs; + ZHeapTuple toasttup; + TupleDesc toasttupDesc; + Datum t_values[3]; + bool t_isnull[3]; + CommandId mycid = GetCurrentCommandId(true); + struct varlena *result; + struct varatt_external toast_pointer; + union + { + struct varlena hdr; + /* this is to make the union big enough for a chunk: */ + char data[TOAST_MAX_CHUNK_SIZE + VARHDRSZ]; + /* ensure union is aligned well enough: */ + int32 align_it; + } chunk_data; + int32 chunk_size; + int32 chunk_seq = 0; + char *data_p; + int32 data_todo; + Pointer dval = DatumGetPointer(value); + int num_indexes; + int validIndex; + + Assert(!VARATT_IS_EXTERNAL(value)); + + /* + * Open the toast relation and its indexes. We can use the index to check + * uniqueness of the OID we assign to the toasted item, even though it has + * additional columns besides OID. + */ + toastrel = heap_open(rel->rd_rel->reltoastrelid, RowExclusiveLock); + toasttupDesc = toastrel->rd_att; + + /* The toast table of zheap table should also be of zheap type */ + Assert (RelationStorageIsZHeap(toastrel)); + + /* Open all the toast indexes and look for the valid one */ + validIndex = toast_open_indexes(toastrel, + RowExclusiveLock, + &toastidxs, + &num_indexes); + + /* + * Get the data pointer and length, and compute va_rawsize and va_extsize. + * + * va_rawsize is the size of the equivalent fully uncompressed datum, so + * we have to adjust for short headers. + * + * va_extsize is the actual size of the data payload in the toast records. + */ + if (VARATT_IS_SHORT(dval)) + { + data_p = VARDATA_SHORT(dval); + data_todo = VARSIZE_SHORT(dval) - VARHDRSZ_SHORT; + toast_pointer.va_rawsize = data_todo + VARHDRSZ; /* as if not short */ + toast_pointer.va_extsize = data_todo; + } + else if (VARATT_IS_COMPRESSED(dval)) + { + data_p = VARDATA(dval); + data_todo = VARSIZE(dval) - VARHDRSZ; + /* rawsize in a compressed datum is just the size of the payload */ + toast_pointer.va_rawsize = VARRAWSIZE_4B_C(dval) + VARHDRSZ; + toast_pointer.va_extsize = data_todo; + /* Assert that the numbers look like it's compressed */ + Assert(VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer)); + } + else + { + data_p = VARDATA(dval); + data_todo = VARSIZE(dval) - VARHDRSZ; + toast_pointer.va_rawsize = VARSIZE(dval); + toast_pointer.va_extsize = data_todo; + } + + /* + * Insert the correct table OID into the result TOAST pointer. + * + * Normally this is the actual OID of the target toast table, but during + * table-rewriting operations such as CLUSTER, we have to insert the OID + * of the table's real permanent toast table instead. rd_toastoid is set + * if we have to substitute such an OID. + */ + if (OidIsValid(rel->rd_toastoid)) + toast_pointer.va_toastrelid = rel->rd_toastoid; + else + toast_pointer.va_toastrelid = RelationGetRelid(toastrel); + + /* + * Choose an OID to use as the value ID for this toast value. + * + * Normally we just choose an unused OID within the toast table. But + * during table-rewriting operations where we are preserving an existing + * toast table OID, we want to preserve toast value OIDs too. So, if + * rd_toastoid is set and we had a prior external value from that same + * toast table, re-use its value ID. If we didn't have a prior external + * value (which is a corner case, but possible if the table's attstorage + * options have been changed), we have to pick a value ID that doesn't + * conflict with either new or existing toast value OIDs. + */ + if (!OidIsValid(rel->rd_toastoid)) + { + /* normal case: just choose an unused OID */ + toast_pointer.va_valueid = + GetNewOidWithIndex(toastrel, + RelationGetRelid(toastidxs[validIndex]), + (AttrNumber) 1); + } + else + { + /* rewrite case: check to see if value was in old toast table */ + toast_pointer.va_valueid = InvalidOid; + if (oldexternal != NULL) + { + struct varatt_external old_toast_pointer; + + Assert(VARATT_IS_EXTERNAL_ONDISK(oldexternal)); + /* Must copy to access aligned fields */ + VARATT_EXTERNAL_GET_POINTER(old_toast_pointer, oldexternal); + if (old_toast_pointer.va_toastrelid == rel->rd_toastoid) + { + /* This value came from the old toast table; reuse its OID */ + toast_pointer.va_valueid = old_toast_pointer.va_valueid; + + /* + * There is a corner case here: the table rewrite might have + * to copy both live and recently-dead versions of a row, and + * those versions could easily reference the same toast value. + * When we copy the second or later version of such a row, + * reusing the OID will mean we select an OID that's already + * in the new toast table. Check for that, and if so, just + * fall through without writing the data again. + * + * While annoying and ugly-looking, this is a good thing + * because it ensures that we wind up with only one copy of + * the toast value when there is only one copy in the old + * toast table. Before we detected this case, we'd have made + * multiple copies, wasting space; and what's worse, the + * copies belonging to already-deleted heap tuples would not + * be reclaimed by VACUUM. + */ + if (toastrel_valueid_exists(toastrel, + toast_pointer.va_valueid)) + { + /* Match, so short-circuit the data storage loop below */ + data_todo = 0; + } + } + } + if (toast_pointer.va_valueid == InvalidOid) + { + /* + * new value; must choose an OID that doesn't conflict in either + * old or new toast table + */ + do + { + toast_pointer.va_valueid = + GetNewOidWithIndex(toastrel, + RelationGetRelid(toastidxs[validIndex]), + (AttrNumber) 1); + } while (toastid_valueid_exists(rel->rd_toastoid, + toast_pointer.va_valueid)); + } + } + + /* + * Initialize constant parts of the tuple data + */ + t_values[0] = ObjectIdGetDatum(toast_pointer.va_valueid); + t_values[2] = PointerGetDatum(&chunk_data); + t_isnull[0] = false; + t_isnull[1] = false; + t_isnull[2] = false; + + /* + * Split up the item into chunks + */ + while (data_todo > 0) + { + int i; + + CHECK_FOR_INTERRUPTS(); + + /* + * Calculate the size of this chunk + */ + chunk_size = Min(TOAST_MAX_CHUNK_SIZE, data_todo); + + /* + * Build a tuple and store it + */ + t_values[1] = Int32GetDatum(chunk_seq++); + SET_VARSIZE(&chunk_data, chunk_size + VARHDRSZ); + memcpy(VARDATA(&chunk_data), data_p, chunk_size); + toasttup = zheap_form_tuple(toasttupDesc, t_values, t_isnull); + + zheap_insert(toastrel, toasttup, mycid, options, NULL); + + /* + * Create the index entry. We cheat a little here by not using + * FormIndexDatum: this relies on the knowledge that the index columns + * are the same as the initial columns of the table for all the + * indexes. We also cheat by not providing an IndexInfo: this is okay + * for now because btree doesn't need one, but we might have to be + * more honest someday. + * + * Note also that there had better not be any user-created index on + * the TOAST table, since we don't bother to update anything else. + */ + for (i = 0; i < num_indexes; i++) + { + /* Only index relations marked as ready can be updated */ + if (IndexIsReady(toastidxs[i]->rd_index)) + index_insert(toastidxs[i], t_values, t_isnull, + &(toasttup->t_self), + toastrel, + toastidxs[i]->rd_index->indisunique ? + UNIQUE_CHECK_YES : UNIQUE_CHECK_NO, + NULL); + } + + /* + * Free memory + */ + zheap_freetuple(toasttup); + + /* + * Move on to next chunk + */ + data_todo -= chunk_size; + data_p += chunk_size; + } + + /* + * Done - close toast relation and its indexes + */ + toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock); + heap_close(toastrel, RowExclusiveLock); + + /* + * Create the TOAST pointer value that we'll return + */ + result = (struct varlena *) palloc(TOAST_POINTER_SIZE); + SET_VARTAG_EXTERNAL(result, VARTAG_ONDISK); + memcpy(VARDATA_EXTERNAL(result), &toast_pointer, sizeof(toast_pointer)); + + return PointerGetDatum(result); +} + +/* + * ztoast_delete_datum + * Just like toast_delete_datum but for zheap relations. + */ +static void +ztoast_delete_datum(Relation rel, Datum value, bool is_speculative) +{ + struct varlena *attr = (struct varlena *) DatumGetPointer(value); + struct varatt_external toast_pointer; + Relation toastrel; + Relation *toastidxs; + ScanKeyData toastkey; + SysScanDesc toastscan; + HeapTuple toasttup; + int num_indexes; + int validIndex; + SnapshotData SnapshotToast; + + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) + return; + + /* Must copy to access aligned fields */ + VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); + + /* + * Open the toast relation and its indexes + */ + toastrel = heap_open(toast_pointer.va_toastrelid, RowExclusiveLock); + + /* The toast table of zheap table should also be of zheap type */ + Assert (RelationStorageIsZHeap(toastrel)); + + /* Fetch valid relation used for process */ + validIndex = toast_open_indexes(toastrel, + RowExclusiveLock, + &toastidxs, + &num_indexes); + + /* + * Setup a scan key to find chunks with matching va_valueid + */ + ScanKeyInit(&toastkey, + (AttrNumber) 1, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(toast_pointer.va_valueid)); + + /* + * Find all the chunks. (We don't actually care whether we see them in + * sequence or not, but since we've already locked the index we might as + * well use systable_beginscan_ordered.) + */ + init_toast_snapshot(&SnapshotToast); + toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex], + &SnapshotToast, 1, &toastkey); + while ((toasttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL) + { + /* + * Have a chunk, delete it + */ + if (!is_speculative) + simple_zheap_delete(toastrel, &toasttup->t_self, &SnapshotToast); + else + { + TupleDesc tupdesc = toastrel->rd_att; + zheap_abort_speculative(toastrel, heap_to_zheap(toasttup, tupdesc)); + } + } + + /* + * End scan and close relations + */ + systable_endscan_ordered(toastscan); + toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock); + heap_close(toastrel, RowExclusiveLock); +} + +/* ---------- + * ztoast_delete - + * + * Cascaded delete toast-entries on DELETE + * ---------- + */ +void +ztoast_delete(Relation rel, ZHeapTuple oldtup, bool is_speculative) +{ + TupleDesc tupleDesc; + int numAttrs; + int i; + Datum toast_values[MaxHeapAttributeNumber]; + bool toast_isnull[MaxHeapAttributeNumber]; + + /* + * We should only ever be called for tuples of plain relations or + * materialized views --- recursing on a toast rel is bad news. + */ + Assert(rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_MATVIEW); + + /* + * Get the tuple descriptor and break down the tuple into fields. + * + * NOTE: it's debatable whether to use heap_deform_tuple() here or just + * heap_getattr() only the varlena columns. The latter could win if there + * are few varlena columns and many non-varlena ones. However, + * heap_deform_tuple costs only O(N) while the heap_getattr way would cost + * O(N^2) if there are many varlena columns, so it seems better to err on + * the side of linear cost. (We won't even be here unless there's at + * least one varlena column, by the way.) + */ + tupleDesc = rel->rd_att; + numAttrs = tupleDesc->natts; + + Assert(numAttrs <= MaxHeapAttributeNumber); + zheap_deform_tuple(oldtup, tupleDesc, toast_values, toast_isnull); + + /* + * Check for external stored attributes and delete them from the secondary + * relation. + */ + for (i = 0; i < numAttrs; i++) + { + if (TupleDescAttr(tupleDesc, i)->attlen == -1) + { + Datum value = toast_values[i]; + + if (toast_isnull[i]) + continue; + else if (VARATT_IS_EXTERNAL_ONDISK(PointerGetDatum(value))) + ztoast_delete_datum(rel, value, is_speculative); + } + } +} diff --git a/src/backend/access/zheap/zvacuumlazy.c b/src/backend/access/zheap/zvacuumlazy.c new file mode 100644 index 0000000000..7b60591b21 --- /dev/null +++ b/src/backend/access/zheap/zvacuumlazy.c @@ -0,0 +1,1462 @@ +/*------------------------------------------------------------------------- + * + * zvacuumlazy.c + * Concurrent ("lazy") vacuuming. + * + * + * The lazy vacuum in zheap uses two-passes to clean up the dead tuples in + * heap and index. It reclaims all the dead items in heap in the first pass + * and write undo record for such items, then clean the indexes in second + * pass. The undo is written, so that if there is any error while cleaning + * indexes, we can rollback the operation and mark the entries in as dead. + * + * The other important aspect that is ensured in this system is that we don't + * item ids that are marked as unused to be reused till the transaction that + * has marked them unused is committed. + * + * The dead tuple tracking works in the same way as in heap. See lazyvacuum.c. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/commands/zvacuumlazy.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include + +#include "access/genam.h" +#include "access/tpd.h" +#include "access/visibilitymap.h" +#include "access/xact.h" +#include "access/zhtup.h" +#include "utils/ztqual.h" +#include "access/zheapam_xlog.h" +#include "access/zheaputils.h" +#include "commands/dbcommands.h" +#include "commands/vacuum.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "portability/instr_time.h" +#include "postmaster/autovacuum.h" +#include "storage/bufmgr.h" +#include "storage/freespace.h" +#include "storage/lmgr.h" +#include "storage/procarray.h" +#include "utils/lsyscache.h" +#include "utils/memutils.h" +#include "utils/pg_rusage.h" + +/* + * Before we consider skipping a page that's marked as clean in + * visibility map, we must've seen at least this many clean pages. + */ +#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32) + +/* A few variables that don't seem worth passing around as parameters */ +static int elevel = -1; +static TransactionId OldestXmin; +static BufferAccessStrategy vac_strategy; + +/* + * Guesstimation of number of dead tuples per page. This is used to + * provide an upper limit to memory allocated when vacuuming small + * tables. + */ +#define LAZY_ALLOC_TUPLES MaxZHeapTuplesPerPage + +/* non-export function prototypes */ +static int +lazy_vacuum_zpage(Relation onerel, BlockNumber blkno, Buffer buffer, + int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); +static int +lazy_vacuum_zpage_with_undo(Relation onerel, BlockNumber blkno, Buffer buffer, + int tupindex, LVRelStats *vacrelstats, + Buffer *vmbuffer, + TransactionId *global_visibility_cutoff_xid); +static void +lazy_space_zalloc(LVRelStats *vacrelstats, BlockNumber relblocks); +static void +lazy_scan_zheap(Relation onerel, int options, LVRelStats *vacrelstats, + Relation *Irel, int nindexes, + BufferAccessStrategy vac_strategy, bool aggressive); +static bool +zheap_page_is_all_visible(Relation rel, Buffer buf, + TransactionId *visibility_cutoff_xid); + +/* + * lazy_vacuum_zpage() -- free dead tuples on a page + * and repair its fragmentation. + * + * Caller must hold pin and buffer exclusive lock on the buffer. + * + * tupindex is the index in vacrelstats->dead_tuples of the first dead + * tuple for this page. We assume the rest follow sequentially. + * The return value is the first tupindex after the tuples of this page. + */ +static int +lazy_vacuum_zpage(Relation onerel, BlockNumber blkno, Buffer buffer, + int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer) +{ + Page page = BufferGetPage(buffer); + Page tmppage; + OffsetNumber unused[MaxOffsetNumber]; + int uncnt = 0; + TransactionId visibility_cutoff_xid; + bool pruned = false; + + /* + * We prepare the temporary copy of the page so that during page + * repair fragmentation we can use it to copy the actual tuples. + * See comments atop zheap_page_prune_guts. + */ + tmppage = PageGetTempPageCopy(page); + + /* + * Lock the TPD page before starting critical section. We might need + * to access it during page repair fragmentation. + */ + if (ZHeapPageHasTPDSlot((PageHeader) page)) + TPDPageLock(onerel, buffer); + + START_CRIT_SECTION(); + + for (; tupindex < vacrelstats->num_dead_tuples; tupindex++) + { + BlockNumber tblk; + OffsetNumber toff; + ItemId itemid; + + tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]); + if (tblk != blkno) + break; /* past end of tuples for this block */ + toff = ItemPointerGetOffsetNumber(&vacrelstats->dead_tuples[tupindex]); + itemid = PageGetItemId(page, toff); + ItemIdSetUnused(itemid); + unused[uncnt++] = toff; + } + + ZPageRepairFragmentation(buffer, tmppage, InvalidOffsetNumber, 0, false, + &pruned); + + /* + * Mark buffer dirty before we write WAL. + */ + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(onerel)) + { + XLogRecPtr recptr; + + recptr = log_zheap_clean(onerel, buffer, InvalidOffsetNumber, 0, + NULL, 0, NULL, 0, + unused, uncnt, + vacrelstats->latestRemovedXid, pruned); + PageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + /* be tidy */ + pfree(tmppage); + UnlockReleaseTPDBuffers(); + + /* + * Now that we have removed the dead tuples from the page, once again + * check if the page has become all-visible. The page is already marked + * dirty, exclusively locked. + */ + if (zheap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid)) + { + uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer); + uint8 flags = 0; + + /* Set the VM all-visible bit to flag, if needed */ + if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0) + flags |= VISIBILITYMAP_ALL_VISIBLE; + + Assert(BufferIsValid(*vmbuffer)); + if (flags != 0) + visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, + *vmbuffer, visibility_cutoff_xid, flags); + } + + return tupindex; +} + +/* + * lazy_vacuum_zpage_with_undo() -- free dead tuples on a page + * and repair its fragmentation. + * + * Caller must hold pin and buffer exclusive lock on the buffer. + */ +static int +lazy_vacuum_zpage_with_undo(Relation onerel, BlockNumber blkno, Buffer buffer, + int tupindex, LVRelStats *vacrelstats, + Buffer *vmbuffer, + TransactionId *global_visibility_cutoff_xid) +{ + TransactionId visibility_cutoff_xid; + TransactionId xid = GetTopTransactionId(); + uint32 epoch = GetEpochForXid(xid); + Page page = BufferGetPage(buffer); + Page tmppage; + UnpackedUndoRecord undorecord; + OffsetNumber unused[MaxOffsetNumber]; + UndoRecPtr urecptr, prev_urecptr; + int i, uncnt = 0; + int trans_slot_id; + xl_undolog_meta undometa; + XLogRecPtr RedoRecPtr; + bool doPageWrites; + bool lock_reacquired; + bool pruned = false; + + for (; tupindex < vacrelstats->num_dead_tuples; tupindex++) + { + BlockNumber tblk PG_USED_FOR_ASSERTS_ONLY; + OffsetNumber toff; + + tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]); + + /* + * We should never pass the end of tuples for this block as we clean + * the tuples in the current block before moving to next block. + */ + Assert(tblk == blkno); + + toff = ItemPointerGetOffsetNumber(&vacrelstats->dead_tuples[tupindex]); + unused[uncnt++] = toff; + } + + if (uncnt <= 0) + return tupindex; + +reacquire_slot: + /* + * The transaction information of tuple needs to be set in transaction + * slot, so needs to reserve the slot before proceeding with the actual + * operation. It will be costly to wait for getting the slot, but we do + * that by releasing the buffer lock. + */ + trans_slot_id = PageReserveTransactionSlot(onerel, + buffer, + PageGetMaxOffsetNumber(page), + epoch, + xid, + &prev_urecptr, + &lock_reacquired); + if (lock_reacquired) + goto reacquire_slot; + + if (trans_slot_id == InvalidXactSlotId) + { + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + + pgstat_report_wait_start(PG_WAIT_PAGE_TRANS_SLOT); + pg_usleep(10000L); /* 10 ms */ + pgstat_report_wait_end(); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + goto reacquire_slot; + } + + /* prepare an undo record */ + undorecord.uur_type = UNDO_ITEMID_UNUSED; + undorecord.uur_info = 0; + undorecord.uur_prevlen = 0; + undorecord.uur_reloid = onerel->rd_id; + undorecord.uur_prevxid = xid; + undorecord.uur_xid = xid; + undorecord.uur_cid = InvalidCommandId; + undorecord.uur_fork = MAIN_FORKNUM; + undorecord.uur_blkprev = prev_urecptr; + undorecord.uur_block = blkno; + undorecord.uur_offset = 0; + undorecord.uur_tuple.len = 0; + undorecord.uur_payload.len = uncnt * sizeof(OffsetNumber); + undorecord.uur_payload.data = (char *) palloc(uncnt * sizeof(OffsetNumber)); + + /* + * XXX Unlike other undo records, we don't set the TPD slot number in undo + * record as this record is just skipped during processing of undo. + */ + + urecptr = PrepareUndoInsert(&undorecord, + InvalidTransactionId, + UndoPersistenceForRelation(onerel), + &undometa); + + /* + * We prepare the temporary copy of the page so that during page + * repair fragmentation we can use it to copy the actual tuples. + * See comments atop zheap_page_prune_guts. + */ + tmppage = PageGetTempPageCopy(page); + + /* + * Lock the TPD page before starting critical section. We might need + * to access it during page repair fragmentation. Note that if the + * transaction slot belongs to TPD entry, then the TPD page must be + * locked during slot reservation. + */ + if (trans_slot_id <= ZHEAP_PAGE_TRANS_SLOTS && + ZHeapPageHasTPDSlot((PageHeader) page)) + TPDPageLock(onerel, buffer); + + START_CRIT_SECTION(); + + memcpy(undorecord.uur_payload.data, unused, uncnt * sizeof(OffsetNumber)); + InsertPreparedUndo(); + /* + * We're sending the undo record for debugging purpose. So, just send + * the last one. + */ + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + { + PageSetUNDO(undorecord, + buffer, + trans_slot_id, + true, + epoch, + xid, + urecptr, + unused, + uncnt); + } + else + { + PageSetUNDO(undorecord, + buffer, + trans_slot_id, + true, + epoch, + xid, + urecptr, + NULL, + 0); + } + + for (i = 0; i < uncnt; i++) + { + ItemId itemid; + + itemid = PageGetItemId(page, unused[i]); + ItemIdSetUnusedExtended(itemid, trans_slot_id); + } + + ZPageRepairFragmentation(buffer, tmppage, InvalidOffsetNumber, 0, false, + &pruned); + + /* + * Mark buffer dirty before we write WAL. + */ + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(onerel)) + { + xl_zheap_unused xl_rec; + xl_undo_header xlundohdr; + XLogRecPtr recptr; + + /* + * Store the information required to generate undo record during + * replay. + */ + xlundohdr.reloid = undorecord.uur_reloid; + xlundohdr.urec_ptr = urecptr; + xlundohdr.blkprev = prev_urecptr; + + xl_rec.latestRemovedXid = vacrelstats->latestRemovedXid; + xl_rec.nunused = uncnt; + xl_rec.trans_slot_id = trans_slot_id; + xl_rec.flags = 0; + if (pruned) + xl_rec.flags |= XLZ_UNUSED_ALLOW_PRUNING; + +prepare_xlog: + /* + * WAL-LOG undolog meta data if this is the fisrt WAL after the + * checkpoint. + */ + LogUndoMetaData(&undometa); + + GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); + + XLogBeginInsert(); + XLogRegisterData((char *) &xlundohdr, SizeOfUndoHeader); + XLogRegisterData((char *) &xl_rec, SizeOfZHeapUnused); + + XLogRegisterData((char *) unused, uncnt * sizeof(OffsetNumber)); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + (void) RegisterTPDBuffer(page, 1); + + recptr = XLogInsertExtended(RM_ZHEAP2_ID, XLOG_ZHEAP_UNUSED, RedoRecPtr, + doPageWrites); + if (recptr == InvalidXLogRecPtr) + goto prepare_xlog; + + PageSetLSN(page, recptr); + if (trans_slot_id > ZHEAP_PAGE_TRANS_SLOTS) + TPDPageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + + UnlockReleaseUndoBuffers(); + UnlockReleaseTPDBuffers(); + + /* be tidy */ + pfree(tmppage); + + /* + * Now that we have removed the dead tuples from the page, once again + * check if the page has become potentially all-visible. The page is + * already marked dirty, exclusively locked. We can't mark the page + * as all-visible here because we have yet to remove index entries + * corresponding dead tuples. So, we mark them potentially all-visible + * and later after removing index entries, if still the bit is set, we + * mark them as all-visible. + */ + if (zheap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid)) + { + uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer); + uint8 flags = 0; + + /* Set the VM to become potentially all-visible, if needed */ + if ((vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE) == 0) + flags |= VISIBILITYMAP_POTENTIAL_ALL_VISIBLE; + + if (TransactionIdFollows(visibility_cutoff_xid, + *global_visibility_cutoff_xid)) + *global_visibility_cutoff_xid = visibility_cutoff_xid; + + Assert(BufferIsValid(*vmbuffer)); + if (flags != 0) + visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, + *vmbuffer, InvalidTransactionId, flags); + } + + return tupindex; +} + +/* + * MarkPagesAsAllVisible() -- Mark all the pages corresponding to dead tuples + * as all-visible. + * + * We mark the page as all-visible, if it is already marked as potential + * all-visible. + */ +static void +MarkPagesAsAllVisible(Relation rel, LVRelStats *vacrelstats, + TransactionId visibility_cutoff_xid) +{ + int idx = 0; + + for (; idx < vacrelstats->num_dead_tuples; idx++) + { + BlockNumber tblk; + BlockNumber prev_tblk = InvalidBlockNumber; + Buffer vmbuffer = InvalidBuffer; + Buffer buf = InvalidBuffer; + uint8 vm_status; + + tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[idx]); + buf = ReadBufferExtended(rel, MAIN_FORKNUM, tblk, + RBM_NORMAL, NULL); + + /* Avoid processing same block again and again. */ + if (tblk == prev_tblk) + continue; + + visibilitymap_pin(rel, tblk, &vmbuffer); + vm_status = visibilitymap_get_status(rel, tblk, &vmbuffer); + + /* Set the VM all-visible bit, if needed */ + if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0 && + (vm_status & VISIBILITYMAP_POTENTIAL_ALL_VISIBLE)) + { + visibilitymap_clear(rel, tblk, vmbuffer, + VISIBILITYMAP_VALID_BITS); + + Assert(BufferIsValid(buf)); + LockBuffer(buf, BUFFER_LOCK_SHARE); + + visibilitymap_set(rel, tblk, buf, InvalidXLogRecPtr, vmbuffer, + visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE); + + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + } + + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + + if (BufferIsValid(buf)) + { + ReleaseBuffer(buf); + buf = InvalidBuffer; + } + + prev_tblk = tblk; + } +} + +/* + * lazy_scan_zheap() -- scan an open heap relation + * + * This routine prunes each page in the zheap, which will among other + * things truncate dead tuples to dead line pointers, truncate recently + * dead tuples to deleted line pointers and defragment the page + * (see zheap_page_prune). It also builds lists of dead tuples and pages + * with free space, calculates statistics on the number of live tuples in + * the zheap. It then reclaim all dead line pointers and write undo for + * each of them, so that if there is any error later, we can rollback the + * operation. When done, or when we run low on space for dead-tuple + * TIDs, invoke vacuuming of indexes. + * + * We also need to ensure that the heap-TIDs won't get reused till the + * transaction that has performed this vacuum is committed. To achieve + * that, we need to store transaction slot information in the line + * pointers that are marked unused in the first-pass of heap. + * + * If there are no indexes then we can reclaim line pointers without + * writting any undo; + */ +static void +lazy_scan_zheap(Relation onerel, int options, LVRelStats *vacrelstats, + Relation *Irel, int nindexes, + BufferAccessStrategy vac_strategy, bool aggressive) +{ + BlockNumber nblocks, + blkno; + ZHeapTupleData tuple; + char *relname; + BlockNumber empty_pages, + vacuumed_pages, + next_fsm_block_to_vacuum; + double num_tuples, + tups_vacuumed, + nkeep, + nunused; + IndexBulkDeleteResult **indstats; + StringInfoData infobuf; + int i; + int tupindex = 0; + PGRUsage ru0; + BlockNumber next_unskippable_block; + bool skipping_blocks; + Buffer vmbuffer = InvalidBuffer; + TransactionId visibility_cutoff_xid = InvalidTransactionId; + + pg_rusage_init(&ru0); + + relname = RelationGetRelationName(onerel); + if (aggressive) + ereport(elevel, + (errmsg("aggressively vacuuming \"%s.%s\"", + get_namespace_name(RelationGetNamespace(onerel)), + relname))); + else + ereport(elevel, + (errmsg("vacuuming \"%s.%s\"", + get_namespace_name(RelationGetNamespace(onerel)), + relname))); + + empty_pages = vacuumed_pages = 0; + next_fsm_block_to_vacuum = (BlockNumber) 0; + num_tuples = tups_vacuumed = nkeep = nunused = 0; + + indstats = (IndexBulkDeleteResult **) + palloc0(nindexes * sizeof(IndexBulkDeleteResult *)); + + nblocks = RelationGetNumberOfBlocks(onerel); + vacrelstats->rel_pages = nblocks; + vacrelstats->scanned_pages = 0; + vacrelstats->tupcount_pages = 0; + vacrelstats->nonempty_pages = 0; + vacrelstats->latestRemovedXid = InvalidTransactionId; + + lazy_space_zalloc(vacrelstats, nblocks); + next_unskippable_block = ZHEAP_METAPAGE + 1; + if (!aggressive) + { + + Assert((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0); + while (next_unskippable_block < nblocks) + { + uint8 vmstatus; + + vmstatus = visibilitymap_get_status(onerel, next_unskippable_block, + &vmbuffer); + + if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0) + break; + + vacuum_delay_point(); + next_unskippable_block++; + } + } + + if (next_unskippable_block >= SKIP_PAGES_THRESHOLD) + skipping_blocks = true; + else + skipping_blocks = false; + + for (blkno = ZHEAP_METAPAGE + 1; blkno < nblocks; blkno++) + { + Buffer buf; + Page page; + TransactionId xid; + OffsetNumber offnum, + maxoff; + Size freespace; + bool tupgone, + hastup; + bool all_visible_according_to_vm = false; + bool all_visible; + bool has_dead_tuples; + + if (blkno == next_unskippable_block) + { + /* Time to advance next_unskippable_block */ + next_unskippable_block++; + if (!aggressive) + { + while (next_unskippable_block < nblocks) + { + uint8 vmskipflags; + + vmskipflags = visibilitymap_get_status(onerel, + next_unskippable_block, + &vmbuffer); + if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0) + break; + + vacuum_delay_point(); + next_unskippable_block++; + } + } + + /* + * We know we can't skip the current block. But set up + * skipping_blocks to do the right thing at the following blocks. + */ + if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD) + skipping_blocks = true; + else + skipping_blocks = false; + } + else + { + /* + * The current block is potentially skippable; if we've seen a + * long enough run of skippable blocks to justify skipping it. + */ + if (skipping_blocks) + continue; + all_visible_according_to_vm = true; + } + + vacuum_delay_point(); + + /* + * If we are close to overrunning the available space for dead-tuple + * TIDs, pause and do a cycle of vacuuming before we tackle this page. + */ + if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxZHeapTuplesPerPage && + vacrelstats->num_dead_tuples > 0) + { + /* + * Before beginning index vacuuming, we release any pin we may + * hold on the visibility map page. This isn't necessary for + * correctness, but we do it anyway to avoid holding the pin + * across a lengthy, unrelated operation. + */ + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + + /* + * Remove index entries. Unlike, heap we don't need to log special + * cleanup info which includes latest latestRemovedXid for standby. + * This is because we have covered all the dead tuples in the first + * pass itself and we don't need another pass on heap after index. + */ + for (i = 0; i < nindexes; i++) + lazy_vacuum_index(Irel[i], + &indstats[i], + vacrelstats, + vac_strategy); + /* + * XXX - The cutoff xid used here is the highest xmin of all the heap + * pages scanned. This can lead to more query cancellations on + * standby. However, alternative is that we track cutoff_xid for + * each page in first-pass of vacuum and then use it after removing + * index entries. We didn't pursue the alternative because it would + * require more work memory which means it can lead to more index + * passes. + */ + MarkPagesAsAllVisible(onerel, vacrelstats, visibility_cutoff_xid); + + /* + * Forget the now-vacuumed tuples, and press on, but be careful + * not to reset latestRemovedXid since we want that value to be + * valid. + */ + tupindex = 0; + vacrelstats->num_dead_tuples = 0; + vacrelstats->num_index_scans++; + + /* + * Vacuum the Free Space Map to make newly-freed space visible on + * upper-level FSM pages. Note we have not yet processed blkno. + */ + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); + next_fsm_block_to_vacuum = blkno; + } + + /* + * Pin the visibility map page in case we need to mark the page + * all-visible. In most cases this will be very cheap, because we'll + * already have the correct page pinned anyway. However, it's + * possible that (a) next_unskippable_block is covered by a different + * VM page than the current block or (b) we released our pin and did a + * cycle of index vacuuming. + * + */ + visibilitymap_pin(onerel, blkno, &vmbuffer); + + buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno, + RBM_NORMAL, vac_strategy); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + vacrelstats->scanned_pages++; + vacrelstats->tupcount_pages++; + + page = BufferGetPage(buf); + + if (PageIsNew(page)) + { + /* + * An all-zeroes page could be left over if a backend extends the + * relation but crashes before initializing the page. Reclaim such + * pages for use. See the similar code in lazy_scan_heap to know + * why we have used relation extension lock. + */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + LockRelationForExtension(onerel, ExclusiveLock); + UnlockRelationForExtension(onerel, ExclusiveLock); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + if (PageIsNew(page)) + { + ereport(WARNING, + (errmsg("relation \"%s\" page %u is uninitialized --- fixing", + relname, blkno))); + Assert(BufferGetBlockNumber(buf) != ZHEAP_METAPAGE); + ZheapInitPage(page, BufferGetPageSize(buf)); + empty_pages++; + } + freespace = PageGetZHeapFreeSpace(page); + MarkBufferDirty(buf); + UnlockReleaseBuffer(buf); + + RecordPageWithFreeSpace(onerel, blkno, freespace); + continue; + } + + /* + * Skip TPD pages. This needs to be checked before PageIsEmpty as TPD + * pages can also be empty, but we don't want to deal with it like a + * heap page. + */ + /* + * Prune the TPD pages and if all the entries are removed, then record + * it in FSM, so that it can be reused as a zheap page. + */ + if (PageGetSpecialSize(page) == sizeof(TPDPageOpaqueData)) + { + /* If the page is already pruned, skip it. */ + if (!PageIsEmpty(page)) + TPDPagePrune(onerel, buf, vac_strategy, InvalidOffsetNumber, 0, + true, NULL, NULL); + UnlockReleaseBuffer(buf); + continue; + } + + if (PageIsEmpty(page)) + { + uint8 vmstatus; + empty_pages++; + freespace = PageGetZHeapFreeSpace(page); + + vmstatus = visibilitymap_get_status(onerel, + blkno, + &vmbuffer); + if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0) + { + START_CRIT_SECTION(); + + /* mark buffer dirty before writing a WAL record */ + MarkBufferDirty(buf); + + /* + * It's possible that another backend has extended the heap, + * initialized the page, and then failed to WAL-log the page + * due to an ERROR. Since heap extension is not WAL-logged, + * recovery might try to replay our record setting the page + * all-visible and find that the page isn't initialized, which + * will cause a PANIC. To prevent that, check whether the + * page has been previously WAL-logged, and if not, do that + * now. + */ + if (RelationNeedsWAL(onerel) && + PageGetLSN(page) == InvalidXLogRecPtr) + log_newpage_buffer(buf, true); + + visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, + vmbuffer, InvalidTransactionId, + VISIBILITYMAP_ALL_VISIBLE); + + END_CRIT_SECTION(); + } + + UnlockReleaseBuffer(buf); + RecordPageWithFreeSpace(onerel, blkno, freespace); + continue; + } + + /* + * We count tuples removed by the pruning step as removed by VACUUM. + */ + tups_vacuumed += zheap_page_prune_guts(onerel, buf, OldestXmin, + InvalidOffsetNumber, 0, false, + false, + &vacrelstats->latestRemovedXid, + NULL); + + /* Now scan the page to collect vacuumable items. */ + hastup = false; + freespace = 0; + maxoff = PageGetMaxOffsetNumber(page); + all_visible = true; + has_dead_tuples = false; + + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid; + + itemid = PageGetItemId(page, offnum); + + /* Unused items require no processing, but we count 'em */ + if (!ItemIdIsUsed(itemid)) + { + nunused += 1; + continue; + } + + /* Deleted items mustn't be touched */ + if (ItemIdIsDeleted(itemid)) + { + hastup = true; /* this page won't be truncatable */ + all_visible = false; + continue; + } + + ItemPointerSet(&(tuple.t_self), blkno, offnum); + + /* + * DEAD item pointers are to be vacuumed normally; but we don't + * count them in tups_vacuumed, else we'd be double-counting (at + * least in the common case where zheap_page_prune_guts() just + * freed up a tuple). + */ + if (ItemIdIsDead(itemid)) + { + all_visible = false; + lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); + continue; + } + + Assert(ItemIdIsNormal(itemid)); + + tuple.t_data = (ZHeapTupleHeader) PageGetItem(page, itemid); + tuple.t_len = ItemIdGetLength(itemid); + tuple.t_tableOid = RelationGetRelid(onerel); + + tupgone = false; + + switch (ZHeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf, &xid)) + { + case ZHEAPTUPLE_DEAD: + + /* + * Ordinarily, DEAD tuples would have been removed by + * zheap_page_prune_guts(), but it's possible that the + * tuple state changed since heap_page_prune() looked. + * In particular an INSERT_IN_PROGRESS tuple could have + * changed to DEAD if the inserter aborted. So this + * cannot be considered an error condition. + */ + tupgone = true; /* we can delete the tuple */ + all_visible = false; + break; + case ZHEAPTUPLE_LIVE: + if (all_visible) + { + if (!TransactionIdPrecedes(xid, OldestXmin)) + { + all_visible = false; + break; + } + } + + /* Track newest xmin on page. */ + if (TransactionIdFollows(xid, visibility_cutoff_xid)) + visibility_cutoff_xid = xid; + break; + case ZHEAPTUPLE_RECENTLY_DEAD: + + /* + * If tuple is recently deleted then we must not remove it + * from relation. + */ + nkeep += 1; + all_visible = false; + break; + case ZHEAPTUPLE_INSERT_IN_PROGRESS: + case ZHEAPTUPLE_DELETE_IN_PROGRESS: + /* This is an expected case during concurrent vacuum */ + all_visible = false; + break; + case ZHEAPTUPLE_ABORT_IN_PROGRESS: + /* + * We can simply skip the tuple if it has inserted/operated by + * some aborted transaction and its rollback is still pending. It'll + * be taken care of by future vacuum calls. + */ + all_visible = false; + break; + default: + elog(ERROR, "unexpected ZHeapTupleSatisfiesVacuum result"); + break; + } + + if (tupgone) + { + lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); + ZHeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, xid, + &vacrelstats->latestRemovedXid); + tups_vacuumed += 1; + has_dead_tuples = true; + } + else + { + num_tuples += 1; + hastup = true; + } + } /* scan along page */ + + /* + * If there are no indexes then we can vacuum the page right now + * instead of doing a second scan. + */ + if (vacrelstats->num_dead_tuples > 0) + { + if (nindexes == 0) + { + /* Remove tuples from zheap */ + tupindex = lazy_vacuum_zpage(onerel, blkno, buf, tupindex, + vacrelstats, &vmbuffer); + has_dead_tuples = false; + + /* + * Forget the now-vacuumed tuples, and press on, but be careful + * not to reset latestRemovedXid since we want that value to be + * valid. + */ + vacrelstats->num_dead_tuples = 0; + vacuumed_pages++; + /* + * Periodically do incremental FSM vacuuming to make newly-freed + * space visible on upper FSM pages. Note: although we've cleaned + * the current block, we haven't yet updated its FSM entry (that + * happens further down), so passing end == blkno is correct. + */ + if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) + { + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, + blkno); + next_fsm_block_to_vacuum = blkno; + } + } + else + { + Assert(nindexes > 0); + + /* Remove tuples from zheap and write the undo for it. */ + tupindex = lazy_vacuum_zpage_with_undo(onerel, blkno, buf, + tupindex, vacrelstats, + &vmbuffer, + &visibility_cutoff_xid); + } + } + + /* Now that we are done with the page, get its available space */ + freespace = PageGetZHeapFreeSpace(page); + + /* mark page all-visible, if appropriate */ + if (all_visible && !all_visible_according_to_vm) + { + uint8 flags = VISIBILITYMAP_ALL_VISIBLE; + + visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, + vmbuffer, visibility_cutoff_xid, flags); + } + else if (has_dead_tuples && all_visible_according_to_vm) + { + visibilitymap_clear(onerel, blkno, vmbuffer, + VISIBILITYMAP_VALID_BITS); + } + + UnlockReleaseBuffer(buf); + + /* Remember the location of the last page with nonremovable tuples */ + if (hastup) + vacrelstats->nonempty_pages = blkno + 1; + + /* We're done with this page, so remember its free space as-is. */ + if (freespace) + RecordPageWithFreeSpace(onerel, blkno, freespace); + } + + /* save stats for use later */ + vacrelstats->tuples_deleted = tups_vacuumed; + vacrelstats->new_dead_tuples = nkeep; + + /* + * Now we can compute the new value for pg_class.reltuples. To compensate + * for metapage pass one less than the actual nblocks. + */ + vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, + nblocks - 1, + vacrelstats->tupcount_pages, + num_tuples); + + /* + * Release any remaining pin on visibility map page. + */ + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + + if (vacrelstats->num_dead_tuples > 0) + { + /* + * Remove index entries. Unlike, heap we don't need to log special + * cleanup info which includes latest latestRemovedXid for standby. + * This is because we have covered all the dead tuples in the first + * pass itself and we don't need another pass on heap after index. + */ + for (i = 0; i < nindexes; i++) + lazy_vacuum_index(Irel[i], + &indstats[i], + vacrelstats, + vac_strategy); + + /* + * XXX - The cutoff xid used here is the highest xmin of all the heap + * pages scanned. This can lead to more query cancellations on + * standby. However, alternative is that we track cutoff_xid for + * each page in first-pass of vacuum and then use it after removing + * index entries. We didn't pursue the alternative because it would + * require more work memory which means it can lead to more index + * passes. + */ + MarkPagesAsAllVisible(onerel, vacrelstats, visibility_cutoff_xid); + + vacrelstats->num_index_scans++; + + /* + * Vacuum the Free Space Map to make newly-freed space visible on + * upper-level FSM pages. + */ + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); + next_fsm_block_to_vacuum = blkno; + } + + /* + * Vacuum the remainder of the Free Space Map. We must do this whether or + * not there were indexes. + */ + if (blkno > next_fsm_block_to_vacuum) + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); + + /* Do post-vacuum cleanup and statistics update for each index */ + for (i = 0; i < nindexes; i++) + lazy_cleanup_index(Irel[i], indstats[i], vacrelstats, vac_strategy); + + /* + * This is pretty messy, but we split it up so that we can skip emitting + * individual parts of the message when not applicable. + */ + initStringInfo(&infobuf); + appendStringInfo(&infobuf, + _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"), + nkeep, OldestXmin); + appendStringInfo(&infobuf, _("There were %.0f unused item pointers.\n"), + nunused); + appendStringInfo(&infobuf, ngettext("%u page is entirely empty.\n", + "%u pages are entirely empty.\n", + empty_pages), + empty_pages); + appendStringInfo(&infobuf, _("%s."), pg_rusage_show(&ru0)); + + ereport(elevel, + (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", + RelationGetRelationName(onerel), + tups_vacuumed, num_tuples, + vacrelstats->scanned_pages, nblocks), + errdetail_internal("%s", infobuf.data))); + pfree(infobuf.data); +} + +/* + * lazy_vacuum_zheap_rel() -- perform LAZY VACUUM for one zheap relation + */ +void +lazy_vacuum_zheap_rel(Relation onerel, int options, VacuumParams *params, + BufferAccessStrategy bstrategy) +{ + LVRelStats *vacrelstats; + Relation *Irel; + int nindexes; + PGRUsage ru0; + TimestampTz starttime = 0; + long secs; + int usecs; + double read_rate, + write_rate; + bool aggressive = false; /* should we scan all unfrozen pages? */ + BlockNumber new_rel_pages; + double new_rel_tuples; + double new_live_tuples; + + Assert(params != NULL); + + /* + * For zheap, since vacuum process also reserves transaction slot + * in page, other backend can't ignore this while calculating + * OldestXmin/RecentXmin. See GetSnapshotData for details. + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + MyPgXact->vacuumFlags &= ~PROC_IN_VACUUM; + LWLockRelease(ProcArrayLock); + + /* measure elapsed time iff autovacuum logging requires it */ + if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) + { + pg_rusage_init(&ru0); + starttime = GetCurrentTimestamp(); + } + + if (options & VACOPT_VERBOSE) + elevel = INFO; + else + elevel = DEBUG2; + + vac_strategy = bstrategy; + + /* + * We can't ignore processes running lazy vacuum on zheap relations because + * like other backends operating on zheap, lazy vacuum also reserves a + * transaction slot in the page for pruning purpose. + */ + OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_DEFAULT); + + Assert(TransactionIdIsNormal(OldestXmin)); + + /* + * We request an aggressive scan if DISABLE_PAGE_SKIPPING was specified. + */ + if (options & VACOPT_DISABLE_PAGE_SKIPPING) + aggressive = true; + + vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats)); + + vacrelstats->old_rel_pages = onerel->rd_rel->relpages; + vacrelstats->old_live_tuples = onerel->rd_rel->reltuples; + vacrelstats->num_index_scans = 0; + vacrelstats->pages_removed = 0; + vacrelstats->lock_waiter_detected = false; + + /* Open all indexes of the relation */ + vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); + vacrelstats->hasindex = (nindexes > 0); + + /* Do the vacuuming */ + lazy_scan_zheap(onerel, options, vacrelstats, Irel, nindexes, + vac_strategy, aggressive); + + /* Done with indexes */ + vac_close_indexes(nindexes, Irel, NoLock); + + /* + * Optionally truncate the relation. + */ + if (should_attempt_truncation(vacrelstats)) + lazy_truncate_heap(onerel, vacrelstats, vac_strategy); + + /* + * Update statistics in pg_class. + * + * A corner case here is that if we scanned no pages at all because every + * page is all-visible, we should not update relpages/reltuples, because + * we have no new information to contribute. In particular this keeps us + * from replacing relpages=reltuples=0 (which means "unknown tuple + * density") with nonzero relpages and reltuples=0 (which means "zero + * tuple density") unless there's some actual evidence for the latter. + * + * We can use either tupcount_pages or scanned_pages for the check + * described above as both the valuse should be same. However, we use + * earlier so as to be consistent with heap. + * + * Fixme: We do need to update relallvisible as in heap once we start + * using visibilitymap or something equivalent to it. + * + * relfrozenxid/relminmxid are invalid as we don't perform freeze + * operation in zheap. + */ + new_rel_pages = vacrelstats->rel_pages; + new_rel_tuples = vacrelstats->new_rel_tuples; + if (vacrelstats->tupcount_pages == 0 && new_rel_pages > 0) + { + new_rel_pages = vacrelstats->old_rel_pages; + new_rel_tuples = vacrelstats->old_live_tuples; + } + + vac_update_relstats(onerel, + new_rel_pages, + new_rel_tuples, + new_rel_pages, + vacrelstats->hasindex, + InvalidTransactionId, + InvalidMultiXactId, + false); + + /* report results to the stats collector, too */ + new_live_tuples = new_rel_tuples - vacrelstats->new_dead_tuples; + if (new_live_tuples < 0) + new_live_tuples = 0; /* just in case */ + + pgstat_report_vacuum(RelationGetRelid(onerel), + onerel->rd_rel->relisshared, + new_live_tuples, + vacrelstats->new_dead_tuples); + + /* and log the action if appropriate */ + if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) + { + TimestampTz endtime = GetCurrentTimestamp(); + + if (params->log_min_duration == 0 || + TimestampDifferenceExceeds(starttime, endtime, + params->log_min_duration)) + { + StringInfoData buf; + char *msgfmt; + + TimestampDifference(starttime, endtime, &secs, &usecs); + + read_rate = 0; + write_rate = 0; + if ((secs > 0) || (usecs > 0)) + { + read_rate = (double) BLCKSZ * VacuumPageMiss / (1024 * 1024) / + (secs + usecs / 1000000.0); + write_rate = (double) BLCKSZ * VacuumPageDirty / (1024 * 1024) / + (secs + usecs / 1000000.0); + } + + /* + * This is pretty messy, but we split it up so that we can skip + * emitting individual parts of the message when not applicable. + */ + initStringInfo(&buf); + if (aggressive) + msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n"); + else + msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"); + appendStringInfo(&buf, msgfmt, + get_database_name(MyDatabaseId), + get_namespace_name(RelationGetNamespace(onerel)), + RelationGetRelationName(onerel), + vacrelstats->num_index_scans); + appendStringInfo(&buf, _("pages: %u removed, %u remain\n"), + vacrelstats->pages_removed, + vacrelstats->rel_pages); + appendStringInfo(&buf, + _("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable, oldest xmin: %u\n"), + vacrelstats->tuples_deleted, + vacrelstats->new_rel_tuples, + vacrelstats->new_dead_tuples, + OldestXmin); + appendStringInfo(&buf, + _("buffer usage: %d hits, %d misses, %d dirtied\n"), + VacuumPageHit, + VacuumPageMiss, + VacuumPageDirty); + appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"), + read_rate, write_rate); + appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0)); + + ereport(LOG, + (errmsg_internal("%s", buf.data))); + pfree(buf.data); + } + } +} + +/* + * lazy_space_zalloc - space allocation decisions for lazy vacuum + * + * See the comments at the head of this file for rationale. + */ +static void +lazy_space_zalloc(LVRelStats *vacrelstats, BlockNumber relblocks) +{ + long maxtuples; + int vac_work_mem = IsAutoVacuumWorkerProcess() && + autovacuum_work_mem != -1 ? + autovacuum_work_mem : maintenance_work_mem; + + if (vacrelstats->hasindex) + { + maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); + maxtuples = Min(maxtuples, INT_MAX); + maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); + + /* curious coding here to ensure the multiplication can't overflow */ + if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) + maxtuples = relblocks * LAZY_ALLOC_TUPLES; + + /* stay sane if small maintenance_work_mem */ + maxtuples = Max(maxtuples, MaxZHeapTuplesPerPage); + } + else + { + maxtuples = MaxZHeapTuplesPerPage; + } + + vacrelstats->num_dead_tuples = 0; + vacrelstats->max_dead_tuples = (int) maxtuples; + vacrelstats->dead_tuples = (ItemPointer) + palloc(maxtuples * sizeof(ItemPointerData)); +} + +/* + * Check if every tuple in the given page is visible to all current and future + * transactions. Also return the visibility_cutoff_xid which is the highest + * xmin amongst the visible tuples. + */ +static bool +zheap_page_is_all_visible(Relation rel, Buffer buf, + TransactionId *visibility_cutoff_xid) +{ + Page page = BufferGetPage(buf); + BlockNumber blockno = BufferGetBlockNumber(buf); + OffsetNumber offnum, + maxoff; + bool all_visible = true; + + *visibility_cutoff_xid = InvalidTransactionId; + + /* + * This is a stripped down version of the line pointer scan in + * lazy_scan_zheap(). So if you change anything here, also check that code. + */ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff && all_visible; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid; + TransactionId xid; + ZHeapTupleData tuple; + + itemid = PageGetItemId(page, offnum); + + /* Unused or redirect line pointers are of no interest */ + if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid)) + continue; + + ItemPointerSet(&(tuple.t_self), blockno, offnum); + + /* + * Dead line pointers can have index pointers pointing to them. So + * they can't be treated as visible + */ + if (ItemIdIsDead(itemid)) + { + all_visible = false; + break; + } + + Assert(ItemIdIsNormal(itemid)); + + tuple.t_data = (ZHeapTupleHeader) PageGetItem(page, itemid); + tuple.t_len = ItemIdGetLength(itemid); + tuple.t_tableOid = RelationGetRelid(rel); + + switch (ZHeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf, &xid)) + { + case ZHEAPTUPLE_LIVE: + { + /* + * The inserter definitely committed. But is it old enough + * that everyone sees it as committed? + */ + if (!TransactionIdPrecedes(xid, OldestXmin)) + { + all_visible = false; + break; + } + + /* Track newest xmin on page. */ + if (TransactionIdFollows(xid, *visibility_cutoff_xid)) + *visibility_cutoff_xid = xid; + } + break; + + case ZHEAPTUPLE_DEAD: + case ZHEAPTUPLE_RECENTLY_DEAD: + case ZHEAPTUPLE_INSERT_IN_PROGRESS: + case ZHEAPTUPLE_DELETE_IN_PROGRESS: + case ZHEAPTUPLE_ABORT_IN_PROGRESS: + { + all_visible = false; + break; + } + default: + elog(ERROR, "unexpected ZHeapTupleSatisfiesVacuum result"); + break; + } + } /* scan along page */ + + return all_visible; +} diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index a3015333f3..3d06ba3013 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -31,11 +31,13 @@ #include "access/htup_details.h" #include "access/multixact.h" +#include "access/reloptions.h" #include "access/sysattr.h" #include "access/tableam.h" #include "access/transam.h" #include "access/xact.h" #include "access/xlog.h" +#include "access/zheap.h" #include "catalog/binary_upgrade.h" #include "catalog/catalog.h" #include "catalog/dependency.h" @@ -918,10 +920,14 @@ AddNewRelationTuple(Relation pg_class_desc, break; } - /* Initialize relfrozenxid and relminmxid */ - if (relkind == RELKIND_RELATION || - relkind == RELKIND_MATVIEW || - relkind == RELKIND_TOASTVALUE) + /* + * Initialize relfrozenxid and relminmxid. The relations stored in zheap + * doesn't need to perform freeze. + */ + if ((relkind == RELKIND_RELATION || + relkind == RELKIND_MATVIEW || + relkind == RELKIND_TOASTVALUE) && + new_rel_desc->rd_rel->relam != ZHEAP_TABLE_AM_OID) { /* * Initialize to the minimum XID that could put tuples in the table. @@ -1391,6 +1397,15 @@ heap_create_with_catalog(const char *relname, if (oncommit != ONCOMMIT_NOOP) register_on_commit_action(relid, oncommit); + /* + * Initialize the metapage for zheap, except for partitioned relations as + * they do not have any storage + * PBORKED: abstract + */ + if (accessmtd == ZHEAP_TABLE_AM_OID && + relkind != 'p') + ZheapInitMetaPage(new_rel_desc, MAIN_FORKNUM); + /* * Unlogged objects need an init fork, except for partitioned tables which * have no storage at all. diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 5df4382b7e..37062f39cb 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -24,6 +24,8 @@ #include "access/xlog.h" #include "access/xloginsert.h" #include "access/xlogutils.h" +#include "access/tableam.h" +#include "access/zheap.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" #include "storage/freespace.h" @@ -288,6 +290,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks) /* Do the real work */ smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks); + + /* + * Re-Initialize the meta page for zheap when the relation is completely + * truncated. + * + * ZBORKED: Is this really sufficient / necessary? Why can't we just stop + * truncating so far in the smgrtruncate above? And if we can't do that, + * why isn't the metapage outdated regardless? + */ + if (RelationStorageIsZHeap(rel) && nblocks <= 0) + ZheapInitMetaPage(rel, MAIN_FORKNUM); } /* diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 8630542bb3..8f83ce196c 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -940,6 +940,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio ON pg_subscription TO public; +CREATE VIEW pg_stat_undo_logs AS + SELECT * + FROM pg_stat_get_undo_logs(); + -- -- We have a few function definitions in here, too. -- At some point there might be enough to justify breaking them out into diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 1b8d03642c..0acd9c885e 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -858,6 +858,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, if (MultiXactIdPrecedes(MultiXactCutoff, OldHeap->rd_rel->relminmxid)) MultiXactCutoff = OldHeap->rd_rel->relminmxid; + // ZBORKED / PBORKED: change API so table_copy_for_cluster can set + /* return selected values to caller */ *pFreezeXid = FreezeXid; *pCutoffMulti = MultiXactCutoff; @@ -1096,9 +1098,24 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, /* set rel1's frozen Xid and minimum MultiXid */ if (relform1->relkind != RELKIND_INDEX) { - Assert(TransactionIdIsNormal(frozenXid)); + Relation rel; + + /* + * ZBORKED: + * + * We don't have multixact or frozenXid concept for zheap. This is a + * hack to keep Asserts, probably we need some pluggable API here to + * set frozen and multixact cutoff xid's. + */ + rel = heap_open(r1, NoLock); + if (!RelationStorageIsZHeap(rel)) + { + Assert(TransactionIdIsNormal(frozenXid)); + Assert(MultiXactIdIsValid(cutoffMulti)); + } + heap_close(rel, NoLock); + relform1->relfrozenxid = frozenXid; - Assert(MultiXactIdIsValid(cutoffMulti)); relform1->relminmxid = cutoffMulti; } diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 9851695fca..c36f3379ff 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2395,7 +2395,12 @@ CopyFrom(CopyState cstate) cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)) { hi_options |= HEAP_INSERT_SKIP_FSM; - if (!XLogIsNeeded()) + /* + * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL. + * See zheap_prepare_insert for details. + * PBORKED / ZBORKED: abstract + */ + if (!RelationStorageIsZHeap(cstate->rel) && !XLogIsNeeded()) hi_options |= HEAP_INSERT_SKIP_WAL; } diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index d346bf0749..503a319808 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -560,9 +560,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) /* * We can skip WAL-logging the insertions, unless PITR or streaming * replication is in use. We can skip the FSM in any case. + * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL. + * See zheap_prepare_insert for details. + * PBORKED / ZBORKED: Move logic into table_getbulkinsertstate, somehow? */ myState->hi_options = HEAP_INSERT_SKIP_FSM | - (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL); + (RelationStorageIsZHeap(myState->rel) || XLogIsNeeded() ? 0 + : HEAP_INSERT_SKIP_WAL); myState->bistate = GetBulkInsertState(); /* Not using WAL requires smgr_targblock be initially invalid */ diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 03a9d22162..4ece73ca3f 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -23,8 +23,10 @@ #include "access/tableam.h" #include "access/sysattr.h" #include "access/tupconvert.h" +#include "access/tpd.h" #include "access/xact.h" #include "access/xlog.h" +#include "access/zheap.h" #include "catalog/catalog.h" #include "catalog/dependency.h" #include "catalog/heap.h" @@ -470,6 +472,7 @@ static void ATExecForceNoForceRowSecurity(Relation rel, bool force_rls); static void copy_relation_data(SMgrRelation rel, SMgrRelation dst, ForkNumber forkNum, char relpersistence); +static void copy_zrelation_data(Relation srcRel, SMgrRelation dst, ForkNumber forkNum); static const char *storage_name(char c); static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, @@ -1621,8 +1624,12 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged, * * PBORKED: needs to be a callback */ - RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence, - RecentXmin, minmulti); + if (RelationStorageIsZHeap(rel)) + RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence, + InvalidTransactionId, InvalidMultiXactId); + else + RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence, + RecentXmin, minmulti); if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED) table_create_init_fork(rel); @@ -1637,9 +1644,14 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged, Relation toastrel = relation_open(toast_relid, AccessExclusiveLock); - RelationSetNewRelfilenode(toastrel, - toastrel->rd_rel->relpersistence, - RecentXmin, minmulti); + if (RelationStorageIsZHeap(toastrel)) + RelationSetNewRelfilenode(toastrel, + toastrel->rd_rel->relpersistence, + InvalidTransactionId, InvalidMultiXactId); + else + RelationSetNewRelfilenode(toastrel, + toastrel->rd_rel->relpersistence, + RecentXmin, minmulti); if (toastrel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED) table_create_init_fork(toastrel); heap_close(toastrel, NoLock); @@ -4588,7 +4600,13 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) bistate = GetBulkInsertState(); hi_options = HEAP_INSERT_SKIP_FSM; - if (!XLogIsNeeded()) + /* + * In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL. + * See zheap_prepare_insert for details. + * + * ZBORKED / PBORKED: We probably need a different abstraction for this. + */ + if (!RelationStorageIsZHeap(newrel) && !XLogIsNeeded()) hi_options |= HEAP_INSERT_SKIP_WAL; } else @@ -10920,8 +10938,11 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) RelationCreateStorage(newrnode, rel->rd_rel->relpersistence); /* copy main fork */ - copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM, - rel->rd_rel->relpersistence); + if (RelationStorageIsZHeap(rel)) + copy_zrelation_data(rel, dstrel, MAIN_FORKNUM); + else + copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM, + rel->rd_rel->relpersistence); /* copy those extra forks that exist */ for (forkNum = MAIN_FORKNUM + 1; forkNum <= MAX_FORKNUM; forkNum++) @@ -11287,6 +11308,137 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst, smgrimmedsync(dst, forkNum); } +/* + * ZBORKED: this breaks abstraction + * + * copy_zrelation_data - same as copy_relation_data but for zheap + * + * In this method, we copy a zheap relation block by block. Here is the algorithm + * for the same: + * For each zheap page, + * a. If it's a meta page, copy it as it is. + * b. If it's a TPD page, copy it as it is. + * c. If it's a zheap data page, apply pending aborts, copy the page and + * the corresponding TPD page (if any). + * + * Please note that we may copy a tpd page multiple times. The reason is one + * tpd page can be referred by multiple zheap pages. While applying pending + * aborts on a zheap page, we also need to modify the transaction and undo + * information in the corresponding TPD page, hence, we need to copy it again + * to reflect the changes. + */ +static void +copy_zrelation_data(Relation srcRel, SMgrRelation dst, ForkNumber forkNum) +{ + Page page; + bool use_wal; + bool copying_initfork; + BlockNumber nblocks; + BlockNumber blkno; + SMgrRelation src = srcRel->rd_smgr; + char relpersistence = srcRel->rd_rel->relpersistence; + + /* + * The init fork for an unlogged relation in many respects has to be + * treated the same as normal relation, changes need to be WAL logged and + * it needs to be synced to disk. + */ + copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM; + + /* + * We need to log the copied data in WAL iff WAL archiving/streaming is + * enabled AND it's a permanent relation. + */ + use_wal = XLogIsNeeded() && + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); + + nblocks = smgrnblocks(src, forkNum); + + for (blkno = 0; blkno < nblocks; blkno++) + { + BlockNumber target_blkno = InvalidBlockNumber; + BlockNumber tpd_blkno = InvalidBlockNumber; + Buffer buffer = InvalidBuffer; + + /* If we got a cancel signal during the copy of the data, quit */ + CHECK_FOR_INTERRUPTS(); + + if (blkno != ZHEAP_METAPAGE) + { + buffer = ReadBuffer(srcRel, blkno); + + /* If it's a zheap page, apply the pending undo actions */ + if (PageGetSpecialSize(BufferGetPage(buffer)) != + MAXALIGN(sizeof(TPDPageOpaqueData))) + zbuffer_exec_pending_rollback(srcRel, buffer, &tpd_blkno); + } + + target_blkno = blkno; + +copy_buffer: + /* Read the buffer if not already done. */ + if (!BufferIsValid(buffer)) + buffer = ReadBuffer(srcRel, target_blkno); + page = (Page) BufferGetPage(buffer); + + if (!PageIsVerified(page, target_blkno)) + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("invalid page in block %u of relation %s", + target_blkno, + relpathbackend(src->smgr_rnode.node, + src->smgr_rnode.backend, + forkNum)))); + + /* + * WAL-log the copied page. Unfortunately we don't know what kind of a + * page this is, so we have to log the full page including any unused + * space. + */ + if (use_wal) + log_newpage(&dst->smgr_rnode.node, forkNum, target_blkno, page, false); + + PageSetChecksumInplace(page, target_blkno); + + /* + * Now write the page. We say isTemp = true even if it's not a temp + * rel, because there's no need for smgr to schedule an fsync for this + * write; we'll do it ourselves below. + */ + smgrextend(dst, forkNum, target_blkno, page, true); + + ReleaseBuffer(buffer); + + /* If there is a TPD page corresponding to the current page, copy it. */ + if (BlockNumberIsValid(tpd_blkno)) + { + target_blkno = tpd_blkno; + tpd_blkno = InvalidBlockNumber; + buffer = InvalidBuffer; + goto copy_buffer; + } + } + + /* + * If the rel is WAL-logged, must fsync before commit. We use heap_sync + * to ensure that the toast table gets fsync'd too. (For a temp or + * unlogged rel we don't care since the data will be gone after a crash + * anyway.) + * + * It's obvious that we must do this when not WAL-logging the copy. It's + * less obvious that we have to do it even if we did WAL-log the copied + * pages. The reason is that since we're copying outside shared buffers, a + * CHECKPOINT occurring during the copy has no way to flush the previously + * written data to disk (indeed it won't know the new rel even exists). A + * crash later on would replay WAL from the checkpoint, therefore it + * wouldn't replay our earlier WAL entries. If we do not fsync those pages + * here, they might still not be on disk when the crash occurs. + */ + if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork) + smgrimmedsync(dst, forkNum); +} + /* * ALTER TABLE ENABLE/DISABLE TRIGGER * @@ -13163,6 +13315,10 @@ PreCommit_on_commit_actions(void) case ONCOMMIT_DROP: oids_to_drop = lappend_oid(oids_to_drop, oc->relid); break; + case ONCOMMIT_TEMP_DISCARD: + /* Discard temp table undo logs for temp tables. */ + TempUndoDiscard(oc->relid); + break; } } diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index ca429731d4..73bdd539fe 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -488,6 +488,20 @@ DropTableSpace(DropTableSpaceStmt *stmt) */ LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); + /* + * Drop the undo logs in this tablespace. This will fail (without + * dropping anything) if there are undo logs that we can't afford to drop + * because they contain non-discarded data or a transaction is in + * progress. Since we hold TablespaceCreateLock, no other session will be + * able to attach to an undo log in this tablespace (or any tablespace + * except default) concurrently. + */ + if (!DropUndoLogsInTablespace(tablespaceoid)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs", + tablespacename))); + /* * Try to remove the physical infrastructure. */ @@ -1488,6 +1502,14 @@ tblspc_redo(XLogReaderState *record) { xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record); + /* This shouldn't be able to fail in recovery. */ + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); + if (!DropUndoLogsInTablespace(xlrec->ts_id)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("tablespace cannot be dropped because it contains non-empty undo logs"))); + LWLockRelease(TablespaceCreateLock); + /* * If we issued a WAL record for a drop tablespace it implies that * there were no files in it at all when the DROP was done. That means diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index fcae282044..bd15d04cf0 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -1262,12 +1262,15 @@ vac_update_datfrozenxid(void) Form_pg_class classForm = (Form_pg_class) GETSTRUCT(classTup); /* - * Only consider relations able to hold unfrozen XIDs (anything else - * should have InvalidTransactionId in relfrozenxid anyway.) + * Only consider relations able to hold unfrozen XIDs */ - if (classForm->relkind != RELKIND_RELATION && - classForm->relkind != RELKIND_MATVIEW && - classForm->relkind != RELKIND_TOASTVALUE) + if ((classForm->relkind != RELKIND_RELATION && + classForm->relkind != RELKIND_MATVIEW && + classForm->relkind != RELKIND_TOASTVALUE)) + continue; + + /* some AMs might not use frozen xids */ + if (!TransactionIdIsValid(classForm->relfrozenxid)) continue; Assert(TransactionIdIsNormal(classForm->relfrozenxid)); @@ -1382,6 +1385,7 @@ vac_truncate_clog(TransactionId frozenXID, MultiXactId lastSaneMinMulti) { TransactionId nextXID = ReadNewTransactionId(); + TransactionId oldestXidHavingUndo; Relation relation; TableScanDesc scan; HeapTuple tuple; @@ -1475,6 +1479,16 @@ vac_truncate_clog(TransactionId frozenXID, if (bogus) return; + /* + * We can't truncate the clog for transactions that still have undo. The + * oldestXidHavingUndo will be only valid for zheap storage engine, so it + * won't impact any other storage engine. + */ + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + if (TransactionIdIsValid(oldestXidHavingUndo)) + frozenXID = Min(frozenXID, oldestXidHavingUndo); + /* * Advance the oldest value for commit timestamps before truncating, so * that if a user requests a timestamp for a transaction we're truncating diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c index 66d838dbce..154271a464 100644 --- a/src/backend/executor/execIndexing.c +++ b/src/backend/executor/execIndexing.c @@ -726,7 +726,7 @@ retry: while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot)) { - TransactionId xwait; + TransactionId xwait, xid; XLTW_Oper reason_wait; Datum existing_values[INDEX_MAX_KEYS]; bool existing_isnull[INDEX_MAX_KEYS]; @@ -778,11 +778,22 @@ retry: xwait = TransactionIdIsValid(DirtySnapshot.xmin) ? DirtySnapshot.xmin : DirtySnapshot.xmax; + /* For zheap, we always use Top Transaction Id. */ + // ZBORKED: What does this even mean? + if (RelationStorageIsZHeap(heap)) + { + xid = GetTopTransactionId(); + } + else + { + xid = GetCurrentTransactionId(); + } + if (TransactionIdIsValid(xwait) && (waitMode == CEOUC_WAIT || (waitMode == CEOUC_LIVELOCK_PREVENTING_WAIT && DirtySnapshot.speculativeToken && - TransactionIdPrecedes(GetCurrentTransactionId(), xwait)))) + TransactionIdPrecedes(xid, xwait)))) { /* * PBORKED? When waiting, we used to use t_ctid, rather than @@ -794,6 +805,9 @@ retry: if (DirtySnapshot.speculativeToken) SpeculativeInsertionWait(DirtySnapshot.xmin, DirtySnapshot.speculativeToken); + else if (DirtySnapshot.subxid != InvalidSubTransactionId) + SubXactLockTableWait(xwait, DirtySnapshot.subxid, heap, + &existing_slot->tts_tid, reason_wait); else XactLockTableWait(xwait, heap, &existing_slot->tts_tid, reason_wait); diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c index d91a71a7c1..1085fb4e9b 100644 --- a/src/backend/executor/execTuples.c +++ b/src/backend/executor/execTuples.c @@ -60,6 +60,9 @@ #include "access/htup_details.h" #include "access/tupdesc_details.h" #include "access/tuptoaster.h" +#include "access/zheap.h" +#include "access/zheaputils.h" +#include "access/zhtup.h" #include "funcapi.h" #include "catalog/pg_type.h" #include "nodes/nodeFuncs.h" @@ -1011,6 +1014,176 @@ slot_deform_heap_tuple(TupleTableSlot *slot, HeapTuple tuple, uint32 *offp, } + +/* + * TupleTableSlotOps implementation for ZheapHeapTupleTableSlot. + */ + +static void +tts_zheap_init(TupleTableSlot *slot) +{ +} + +static void +tts_zheap_release(TupleTableSlot *slot) +{ +} + +static void +tts_zheap_clear(TupleTableSlot *slot) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + + /* + * Free the memory for heap tuple if allowed. A tuple coming from zheap + * can never be freed. But we may have materialized a tuple from zheap. + * Such a tuple can be freed. + */ + if (TTS_SHOULDFREE(slot)) + { + zheap_freetuple(zslot->tuple); + slot->tts_flags &= ~TTS_FLAG_SHOULDFREE; + } + +#if 0 + if (ZheapIsValid(bslot->zheap)) + ReleaseZheap(bslot->zheap); +#endif + + slot->tts_nvalid = 0; + slot->tts_flags |= TTS_FLAG_EMPTY; + zslot->tuple = NULL; +} + +static void +tts_zheap_getsomeattrs(TupleTableSlot *slot, int natts) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + + Assert(!TTS_EMPTY(slot)); + slot_deform_ztuple(slot, zslot->tuple, &zslot->off, natts); +} + +static Datum +tts_zheap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + + return zheap_getsysattr(zslot->tuple, InvalidBuffer, attnum, + slot->tts_tupleDescriptor, isnull); +} + +/* + * Materialize the heap tuple contained in the given slot into its own memory + * context. + */ +static void +tts_zheap_materialize(TupleTableSlot *slot) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + MemoryContext oldContext; + + Assert(!TTS_EMPTY(slot)); + + /* If already materialized nothing to do. */ + if (TTS_SHOULDFREE(slot)) + return; + + slot->tts_flags |= TTS_FLAG_SHOULDFREE; + + oldContext = MemoryContextSwitchTo(slot->tts_mcxt); + + if (zslot->tuple) + zslot->tuple = zheap_copytuple(zslot->tuple); + else + { + /* + * The tuple contained in this slot is not allocated in the memory + * context of the given slot (else it would have TTS_SHOULDFREE set). + * Copy the tuple into the given slot's memory context. + */ + zslot->tuple = zheap_form_tuple(slot->tts_tupleDescriptor, + slot->tts_values, + slot->tts_isnull); + } + MemoryContextSwitchTo(oldContext); + +#if 0 + /* + * TODO: I expect a ZheapHeapTupleTableSlot to always have a zheap to be + * associated with it OR the tuple is materialized. In the later case we + * won't come here. So, we should always see a valid zheap here to be + * unpinned. + */ + if (zslot->tuple)) + { + ReleaseZheap(bslot->zheap); + bslot->zheap = InvalidZheap; + } +#endif + + /* + * Have to deform from scratch, otherwise tts_values[] entries could point + * into the non-materialized tuple (which might be gone when accessed). + */ + slot->tts_nvalid = 0; + zslot->off = 0; +} + +static void +tts_zheap_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) +{ + HeapTuple tuple; + MemoryContext oldcontext; + + // PBORKED: This is a horrible implementation + + oldcontext = MemoryContextSwitchTo(dstslot->tts_mcxt); + tuple = ExecCopySlotHeapTuple(srcslot); + MemoryContextSwitchTo(oldcontext); + + ExecForceStoreHeapTuple(tuple, dstslot); + ExecMaterializeSlot(dstslot); + + pfree(tuple); +} + +static HeapTuple +tts_zheap_copy_heap_tuple(TupleTableSlot *slot) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + + Assert(!TTS_EMPTY(slot)); + + if (!zslot->tuple) + tts_zheap_materialize(slot); + + return zheap_to_heap(zslot->tuple, slot->tts_tupleDescriptor); +} + +/* + * Return a minimal tuple constructed from the contents of the slot. + * + * We always return a new minimal tuple so no copy, per say, is needed. + * + * TODO: + * This function is exact copy of tts_zheap_get_minimal_tuple() and thus the + * callback should point to that one instead of a new implementation. But + * there's one TODO there which might change tts_heap_get_minimal_tuple(). + */ +static MinimalTuple +tts_zheap_copy_minimal_tuple(TupleTableSlot *slot) +{ + slot_getallattrs(slot); + + return heap_form_minimal_tuple(slot->tts_tupleDescriptor, + slot->tts_values, slot->tts_isnull); +} + +/* + * TupleTableSlotOps for each of TupleTableSlotTypes. These are used to + * identify the type of slot. + */ const TupleTableSlotOps TTSOpsVirtual = { .base_slot_size = sizeof(VirtualTupleTableSlot), .init = tts_virtual_init, @@ -1082,6 +1255,22 @@ const TupleTableSlotOps TTSOpsBufferHeapTuple = { .copy_minimal_tuple = tts_buffer_heap_copy_minimal_tuple }; +const TupleTableSlotOps TTSOpsZHeapTuple = { + .base_slot_size = sizeof(ZHeapTupleTableSlot), + .init = tts_zheap_init, + .release = tts_zheap_release, + .clear = tts_zheap_clear, + .getsomeattrs = tts_zheap_getsomeattrs, + .getsysattr = tts_zheap_getsysattr, + .materialize = tts_zheap_materialize, + .copyslot = tts_zheap_copyslot, + + .get_heap_tuple = NULL, + .get_minimal_tuple = NULL, + + .copy_heap_tuple = tts_zheap_copy_heap_tuple, + .copy_minimal_tuple = tts_zheap_copy_minimal_tuple +}; /* ---------------------------------------------------------------- * tuple table create/delete functions @@ -1476,6 +1665,43 @@ ExecStoreMinimalTuple(MinimalTuple mtup, return slot; } +/* -------------------------------- + * ExecStoreZTuple + * + * This function is same as ExecStoreTuple except that it used to store a + * physical zheap tuple into a specified slot in the tuple table. + * + * NOTE: Unlike ExecStoreTuple, it's possible that buffer is valid and + * should_free is true. Because, slot->tts_ztuple may be a copy of the + * tuple allocated locally. So, we want to free the tuple even after + * keeping a pin/lock to the previously valid buffer. + */ +TupleTableSlot * +ExecStoreZTuple(ZHeapTuple tuple, + TupleTableSlot *slot, + Buffer buffer, + bool shouldFree) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + /* + * sanity checks + */ + Assert(slot != NULL); + Assert(TTS_IS_ZHEAP(slot)); + tts_zheap_clear(slot); + + slot->tts_nvalid = 0; + zslot->tuple = tuple; + zslot->off = 0; + slot->tts_flags &= ~TTS_FLAG_EMPTY; + slot->tts_tid = tuple->t_self; + + if (shouldFree) + slot->tts_flags |= TTS_FLAG_SHOULDFREE; + + return slot; +} + /* * Store a HeapTuple into any kind of slot, performing conversion if * necessary. @@ -1559,6 +1785,23 @@ ExecForceStoreHeapTupleDatum(Datum data, TupleTableSlot *slot) ExecMaterializeSlot(slot); } +ZHeapTuple +ExecGetZHeapTupleFromSlot(TupleTableSlot *slot) +{ + ZHeapTupleTableSlot *zslot = (ZHeapTupleTableSlot *) slot; + + if (!TTS_IS_ZHEAP(slot)) + elog(ERROR, "unsupported"); + + if (TTS_EMPTY(slot)) + return NULL; + + if (!zslot->tuple) + slot->tts_ops->materialize(slot); + + return zslot->tuple; +} + /* -------------------------------- * ExecStoreVirtualTuple * Mark a slot as containing a virtual tuple. diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index d1ac9fc2e9..78b0607406 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -680,7 +680,7 @@ ldelete:; estate->es_snapshot, slot, estate->es_output_cid, LockTupleExclusive, LockWaitBlock, - TUPLE_LOCK_FLAG_FIND_LAST_VERSION, + TUPLE_LOCK_FLAG_FIND_LAST_VERSION | TUPLE_LOCK_FLAG_WEIRD, &hufd); /*hari FIXME*/ /*Assert(result != HeapTupleUpdated && hufd.traversed);*/ @@ -1174,7 +1174,7 @@ lreplace:; estate->es_snapshot, inputslot, estate->es_output_cid, lockmode, LockWaitBlock, - TUPLE_LOCK_FLAG_FIND_LAST_VERSION, + TUPLE_LOCK_FLAG_FIND_LAST_VERSION | TUPLE_LOCK_FLAG_WEIRD, &hufd); /* hari FIXME*/ /*Assert(result != HeapTupleUpdated && hufd.traversed);*/ @@ -1360,7 +1360,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, test = table_lock_tuple(relation, conflictTid, estate->es_snapshot, mtstate->mt_existing, estate->es_output_cid, - lockmode, LockWaitBlock, TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, + lockmode, LockWaitBlock, + TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS | TUPLE_LOCK_FLAG_WEIRD, &hufd); switch (test) { @@ -1403,6 +1404,15 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, break; case HeapTupleSelfUpdated: +#ifdef ZBORKED + /* + * ZHEAP accepts this, but this isn't ok from a layering POV (and + * I'm doubtful about the correctness). See 1e9d17cc240. + * + * Unlike heap, we expect HeapTupleSelfUpdated in the same scenario + * as the new tuple could have been in-place updated. + */ +#endif /* * This state should never be reached. As a dirty snapshot is used diff --git a/src/backend/lib/stringinfo.c b/src/backend/lib/stringinfo.c index df7e01f76d..73e57d4d47 100644 --- a/src/backend/lib/stringinfo.c +++ b/src/backend/lib/stringinfo.c @@ -238,15 +238,47 @@ appendBinaryStringInfo(StringInfo str, const char *data, int datalen) */ void appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen) +{ + Assert(str != NULL); + + /* Make more room if needed */ + enlargeStringInfo(str, datalen); + + /* OK, append the data */ + memcpy(str->data + str->len, data, datalen); + str->len += datalen; +} + +/* appendBinaryStringInfoNoExtend + * + * Append arbitrary binary data to a StringInfo. + * + * Returns false, if more space is required to append the string, true + * otherwise. + * + * This can be used in critical section. + */ +bool +appendBinaryStringInfoNoExtend(StringInfo str, const char *data, int datalen) { Assert(str != NULL); - /* Make more room if needed */ - enlargeStringInfo(str, datalen); + /* fail, if more space is required */ + if (datalen > str->maxlen) + return false; /* OK, append the data */ memcpy(str->data + str->len, data, datalen); str->len += datalen; + + /* + * Keep a trailing null in place, even though it's probably useless for + * binary data. (Some callers are dealing with text but call this because + * their input isn't null-terminated.) + */ + str->data[str->len] = '\0'; + + return true; } /* diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index 17dc53898f..c7b98edf7a 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -41,6 +41,7 @@ #include #include "access/htup_details.h" +#include "access/zheap.h" #include "nodes/bitmapset.h" #include "nodes/tidbitmap.h" #include "storage/lwlock.h" @@ -53,7 +54,7 @@ * the per-page bitmaps variable size. We just legislate that the size * is this: */ -#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage +#define MAX_TUPLES_PER_PAGE Max(MaxHeapTuplesPerPage, MaxZHeapTuplesPerPage) /* * When we have to switch over to lossy storage, we use a data structure diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 5e1b27be75..6b33814f39 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1000,7 +1000,9 @@ estimate_rel_size(Relation rel, int32 *attr_widths, * minimum to indexes. */ if (curpages < 10 && - rel->rd_rel->relpages == 0 && + (rel->rd_rel->relpages == 0 || + (RelationStorageIsZHeap(rel) && + rel->rd_rel->relpages == ZHEAP_METAPAGE + 1)) && !rel->rd_rel->relhassubclass && rel->rd_rel->relkind != RELKIND_INDEX) curpages = 10; @@ -1008,7 +1010,8 @@ estimate_rel_size(Relation rel, int32 *attr_widths, /* report estimated # pages */ *pages = curpages; /* quick exit if rel is clearly empty */ - if (curpages == 0) + if (curpages == 0 || (RelationStorageIsZHeap(rel) && + curpages == ZHEAP_METAPAGE + 1)) { *tuples = 0; *allvisfrac = 0; @@ -1019,6 +1022,15 @@ estimate_rel_size(Relation rel, int32 *attr_widths, reltuples = (double) rel->rd_rel->reltuples; relallvisible = (BlockNumber) rel->rd_rel->relallvisible; + /* + * If it's a zheap relation, then subtract the pages + * to account for the metapage. + */ + if (relpages > 0 && RelationStorageIsZHeap(rel)) + { + curpages--; + relpages--; + } /* * If it's an index, discount the metapage while estimating the * number of tuples. This is a kluge because it assumes more than diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile index 71c23211b2..e44c563b45 100644 --- a/src/backend/postmaster/Makefile +++ b/src/backend/postmaster/Makefile @@ -13,6 +13,7 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \ - pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o + discardworker.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \ + undoworker.o walwriter.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index d2b695e146..49df516957 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -20,7 +20,9 @@ #include "pgstat.h" #include "port/atomics.h" #include "postmaster/bgworker_internals.h" +#include "postmaster/discardworker.h" #include "postmaster/postmaster.h" +#include "postmaster/undoworker.h" #include "replication/logicallauncher.h" #include "replication/logicalworker.h" #include "storage/dsm.h" @@ -129,6 +131,15 @@ static const struct }, { "ApplyWorkerMain", ApplyWorkerMain + }, + { + "UndoLauncherMain", UndoLauncherMain + }, + { + "UndoWorkerMain", UndoWorkerMain + }, + { + "DiscardWorkerMain", DiscardWorkerMain } }; diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index b9c118e156..b2505c8f23 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -1314,7 +1314,7 @@ AbsorbFsyncRequests(void) LWLockRelease(CheckpointerCommLock); for (request = requests; n > 0; request++, n--) - RememberFsyncRequest(request->rnode, request->forknum, request->segno); + smgrrequestsync(request->rnode, request->forknum, request->segno); END_CRIT_SECTION(); diff --git a/src/backend/postmaster/discardworker.c b/src/backend/postmaster/discardworker.c new file mode 100644 index 0000000000..121779009d --- /dev/null +++ b/src/backend/postmaster/discardworker.c @@ -0,0 +1,169 @@ +/*------------------------------------------------------------------------- + * + * discardworker.c + * The undo discard worker for asynchronous undo management. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/postmaster/discardworker.c + *------------------------------------------------------------------------- + */ + +#include "postgres.h" +#include + +/* These are always necessary for a bgworker. */ +#include "miscadmin.h" +#include "postmaster/bgworker.h" +#include "storage/ipc.h" +#include "storage/latch.h" +#include "storage/lwlock.h" +#include "storage/proc.h" +#include "storage/shmem.h" + +#include "access/undodiscard.h" +#include "pgstat.h" +#include "postmaster/discardworker.h" +#include "storage/procarray.h" +#include "tcop/tcopprot.h" +#include "utils/guc.h" +#include "utils/resowner.h" + +static void undoworker_sigterm_handler(SIGNAL_ARGS); + +/* max sleep time between cycles (100 milliseconds) */ +#define MIN_NAPTIME_PER_CYCLE 100L +#define DELAYED_NAPTIME 10 * MIN_NAPTIME_PER_CYCLE +#define MAX_NAPTIME_PER_CYCLE 100 * MIN_NAPTIME_PER_CYCLE + +static bool got_SIGTERM = false; +static bool hibernate = false; +static long wait_time = MIN_NAPTIME_PER_CYCLE; + +/* SIGTERM: set flag to exit at next convenient time */ +static void +undoworker_sigterm_handler(SIGNAL_ARGS) +{ + got_SIGTERM = true; + + /* Waken anything waiting on the process latch */ + SetLatch(MyLatch); +} + +/* + * DiscardWorkerRegister -- Register a undo discard worker. + */ +void +DiscardWorkerRegister(void) +{ + BackgroundWorker bgw; + + /* TODO: This should be configurable. */ + + memset(&bgw, 0, sizeof(bgw)); + bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | + BGWORKER_BACKEND_DATABASE_CONNECTION; + bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; + snprintf(bgw.bgw_name, BGW_MAXLEN, "discard worker"); + sprintf(bgw.bgw_library_name, "postgres"); + sprintf(bgw.bgw_function_name, "DiscardWorkerMain"); + bgw.bgw_restart_time = 5; + bgw.bgw_notify_pid = 0; + bgw.bgw_main_arg = (Datum) 0; + + RegisterBackgroundWorker(&bgw); +} + +/* + * DiscardWorkerMain -- Main loop for the undo discard worker. + */ +void +DiscardWorkerMain(Datum main_arg) +{ + ereport(LOG, + (errmsg("discard worker started"))); + + /* Establish signal handlers. */ + pqsignal(SIGTERM, undoworker_sigterm_handler); + BackgroundWorkerUnblockSignals(); + + /* Make it easy to identify our processes. */ + SetConfigOption("application_name", MyBgworkerEntry->bgw_name, + PGC_USERSET, PGC_S_SESSION); + + /* + * Create resource owner for discard worker as it need to read the undo + * records outside the transaction blocks which intern access buffer read + * routine. + */ + CreateAuxProcessResourceOwner(); + + /* Enter main loop */ + while (!got_SIGTERM) + { + int rc; + TransactionId OldestXmin, oldestXidHavingUndo; + + OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT); + + oldestXidHavingUndo = GetXidFromEpochXid( + pg_atomic_read_u64(&ProcGlobal->oldestXidWithEpochHavingUndo)); + + /* + * Call the discard routine if there oldestXidHavingUndo is lagging + * behind OldestXmin. + */ + if (OldestXmin != InvalidTransactionId && + TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin)) + { + UndoDiscard(OldestXmin, &hibernate); + + /* + * If we got some undo logs to discard or discarded something, + * then reset the wait_time as we have got work to do. + * Note that if there are some undologs that cannot be discarded, + * then above condition will remain unsatisified till oldestXmin + * remains unchanged and the wait_time will not reset in that case. + */ + if (!hibernate) + wait_time = MIN_NAPTIME_PER_CYCLE; + } + + /* Wait for more work. */ + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + wait_time, + WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN); + + ResetLatch(&MyProc->procLatch); + + /* + * Increase the wait_time based on the length of inactivity. If wait_time + * is within one second, then increment it by 100 ms at a time. Henceforth, + * increment it one second at a time, till it reaches ten seconds. Never + * increase the wait_time more than ten seconds, it will be too much of + * waiting otherwise. + */ + if (rc & WL_TIMEOUT && hibernate) + { + wait_time += (wait_time < DELAYED_NAPTIME ? + MIN_NAPTIME_PER_CYCLE : DELAYED_NAPTIME); + if (wait_time > MAX_NAPTIME_PER_CYCLE) + wait_time = MAX_NAPTIME_PER_CYCLE; + } + + /* emergency bailout if postmaster has died */ + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); + } + + ReleaseAuxProcessResources(true); + + /* we're done */ + ereport(LOG, + (errmsg("discard worker shutting down"))); + + proc_exit(0); +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 7762dbc44b..b72795825f 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -1926,6 +1926,28 @@ pgstat_count_heap_insert(Relation rel, PgStat_Counter n) } } +/* + * pgstat_count_zheap_update - count a inplace tuple update + */ +void +pgstat_count_zheap_update(Relation rel) +{ + PgStat_TableStatus *pgstat_info = rel->pgstat_info; + + if (pgstat_info != NULL) + { + /* We have to log the effect at the proper transactional level */ + int nest_level = GetCurrentTransactionNestLevel(); + + if (pgstat_info->trans == NULL || + pgstat_info->trans->nest_level != nest_level) + add_tabstat_xact_level(pgstat_info, nest_level); + + /* t_tuples_hot_updated is nontransactional, so just advance it */ + pgstat_info->t_counts.t_tuples_hot_updated++; + } +} + /* * pgstat_count_heap_update - count a tuple update */ @@ -3376,6 +3398,9 @@ pgstat_get_wait_event_type(uint32 wait_event_info) case PG_WAIT_IO: event_type = "IO"; break; + case PG_WAIT_PAGE_TRANS_SLOT: + event_type = "TransSlot"; + break; default: event_type = "???"; break; @@ -3453,6 +3478,9 @@ pgstat_get_wait_event(uint32 wait_event_info) event_name = pgstat_get_wait_io(w); break; } + case PG_WAIT_PAGE_TRANS_SLOT: + event_name = "TransSlot"; + break; default: event_name = "unknown wait event"; break; @@ -3516,7 +3544,13 @@ pgstat_get_wait_activity(WaitEventActivity w) case WAIT_EVENT_WAL_WRITER_MAIN: event_name = "WalWriterMain"; break; - /* no default case, so that compiler will warn */ + case WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN: + event_name = "UndoDiscardWorkerMain"; + break; + case WAIT_EVENT_UNDO_LAUNCHER_MAIN: + event_name = "UndoLauncherMain"; + break; + /* no default case, so that compiler will warn */ } return event_name; @@ -3898,6 +3932,28 @@ pgstat_get_wait_io(WaitEventIO w) case WAIT_EVENT_TWOPHASE_FILE_WRITE: event_name = "TwophaseFileWrite"; break; + case WAIT_EVENT_UNDO_CHECKPOINT_READ: + event_name = "UndoCheckpointRead"; + break; + case WAIT_EVENT_UNDO_CHECKPOINT_WRITE: + event_name = "UndoCheckpointWrite"; + break; + case WAIT_EVENT_UNDO_CHECKPOINT_SYNC: + event_name = "UndoCheckpointSync"; + break; + case WAIT_EVENT_UNDO_FILE_READ: + event_name = "UndoFileRead"; + break; + case WAIT_EVENT_UNDO_FILE_WRITE: + event_name = "UndoFileWrite"; + break; + case WAIT_EVENT_UNDO_FILE_FLUSH: + event_name = "UndoFileFlush"; + break; + case WAIT_EVENT_UNDO_FILE_SYNC: + event_name = "UndoFileSync"; + break; + case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ: event_name = "WALSenderTimelineHistoryRead"; break; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index a33a131182..9f6ba1a65f 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -111,10 +111,12 @@ #include "port/pg_bswap.h" #include "postmaster/autovacuum.h" #include "postmaster/bgworker_internals.h" +#include "postmaster/discardworker.h" #include "postmaster/fork_process.h" #include "postmaster/pgarch.h" #include "postmaster/postmaster.h" #include "postmaster/syslogger.h" +#include "postmaster/undoworker.h" #include "replication/logicallauncher.h" #include "replication/walsender.h" #include "storage/fd.h" @@ -246,6 +248,8 @@ bool enable_bonjour = false; char *bonjour_name; bool restart_after_crash = true; +bool disable_undo_launcher; + /* PIDs of special child processes; 0 when not running */ static pid_t StartupPID = 0, BgWriterPID = 0, @@ -991,6 +995,13 @@ PostmasterMain(int argc, char *argv[]) */ ApplyLauncherRegister(); + /* Register the Undo worker launcher. */ + if (!disable_undo_launcher) + UndoLauncherRegister(); + + /* Register the Undo Discard worker. */ + DiscardWorkerRegister(); + /* * process any libraries that should be preloaded at postmaster start */ diff --git a/src/backend/postmaster/undoworker.c b/src/backend/postmaster/undoworker.c new file mode 100644 index 0000000000..55ac139a91 --- /dev/null +++ b/src/backend/postmaster/undoworker.c @@ -0,0 +1,665 @@ +/*------------------------------------------------------------------------- + * + * undoworker.c + * undo launcher and undo worker process. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/postmaster/undoworker.c + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "funcapi.h" +#include "miscadmin.h" +#include "pgstat.h" + +#include "access/heapam.h" +#include "access/htup.h" +#include "access/htup_details.h" +#include "access/sysattr.h" +#include "access/xact.h" + +#include "catalog/indexing.h" +#include "catalog/pg_database.h" + +#include "libpq/pqsignal.h" + +#include "postmaster/bgworker.h" +#include "postmaster/fork_process.h" +#include "postmaster/postmaster.h" +#include "postmaster/undoloop.h" +#include "postmaster/undoworker.h" + +#include "replication/slot.h" +#include "replication/worker_internal.h" + +#include "storage/ipc.h" +#include "storage/lmgr.h" +#include "storage/proc.h" +#include "storage/procarray.h" +#include "storage/procsignal.h" + +#include "tcop/tcopprot.h" + +#include "utils/fmgroids.h" +#include "utils/hsearch.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +/* max sleep time between cycles (100 milliseconds) */ +#define DEFAULT_NAPTIME_PER_CYCLE 100L +#define DEFAULT_RETRY_NAPTIME 50L + +int max_undo_workers = 5; + +typedef struct UndoApplyWorker +{ + /* Indicates if this slot is used or free. */ + bool in_use; + + /* Increased everytime the slot is taken by new worker. */ + uint16 generation; + + /* Pointer to proc array. NULL if not running. */ + PGPROC *proc; + + /* Database id to connect to. */ + Oid dbid; +} UndoApplyWorker; + +UndoApplyWorker *MyUndoWorker = NULL; + +typedef struct UndoApplyCtxStruct +{ + /* Supervisor process. */ + pid_t launcher_pid; + + /* Background workers. */ + UndoApplyWorker workers[FLEXIBLE_ARRAY_MEMBER]; +} UndoApplyCtxStruct; + +UndoApplyCtxStruct *UndoApplyCtx; + +static void undo_worker_onexit(int code, Datum arg); +static void undo_worker_cleanup(UndoApplyWorker *worker); + +static volatile sig_atomic_t got_SIGHUP = false; + +/* + * Wait for a background worker to start up and attach to the shmem context. + * + * This is only needed for cleaning up the shared memory in case the worker + * fails to attach. + */ +static void +WaitForUndoWorkerAttach(UndoApplyWorker *worker, + uint16 generation, + BackgroundWorkerHandle *handle) +{ + BgwHandleStatus status; + int rc; + + for (;;) + { + pid_t pid; + + CHECK_FOR_INTERRUPTS(); + + LWLockAcquire(UndoWorkerLock, LW_SHARED); + + /* Worker either died or has started; no need to do anything. */ + if (!worker->in_use || worker->proc) + { + LWLockRelease(UndoWorkerLock); + return; + } + + LWLockRelease(UndoWorkerLock); + + /* Check if worker has died before attaching, and clean up after it. */ + status = GetBackgroundWorkerPid(handle, &pid); + + if (status == BGWH_STOPPED) + { + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + /* Ensure that this was indeed the worker we waited for. */ + if (generation == worker->generation) + undo_worker_cleanup(worker); + LWLockRelease(UndoWorkerLock); + return; + } + + /* + * We need timeout because we generally don't get notified via latch + * about the worker attach. But we don't expect to have to wait long. + */ + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 10L, WAIT_EVENT_BGWORKER_STARTUP); + + /* emergency bailout if postmaster has died */ + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); + + if (rc & WL_LATCH_SET) + { + ResetLatch(MyLatch); + CHECK_FOR_INTERRUPTS(); + } + } + + return; +} + +/* + * Get dbid from the worker slot. + */ +static Oid +slot_get_dbid(int slot) +{ + Oid dbid; + + /* Block concurrent access. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + + MyUndoWorker = &UndoApplyCtx->workers[slot]; + + if (!MyUndoWorker->in_use) + { + LWLockRelease(UndoWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("undo worker slot %d is empty,", + slot))); + } + + dbid = MyUndoWorker->dbid; + + LWLockRelease(UndoWorkerLock); + + return dbid; +} + +/* + * Attach to a slot. + */ +static void +undo_worker_attach(int slot) +{ + /* Block concurrent access. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + + MyUndoWorker = &UndoApplyCtx->workers[slot]; + + if (!MyUndoWorker->in_use) + { + LWLockRelease(UndoWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("undo worker slot %d is empty, cannot attach", + slot))); + } + + if (MyUndoWorker->proc) + { + LWLockRelease(UndoWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("undo worker slot %d is already used by " + "another worker, cannot attach", slot))); + } + + MyUndoWorker->proc = MyProc; + before_shmem_exit(undo_worker_onexit, (Datum) 0); + + LWLockRelease(UndoWorkerLock); +} + +/* + * Walks the workers array and searches for one that matches given + * dbid. + */ +static UndoApplyWorker * +undo_worker_find(Oid dbid) +{ + int i; + UndoApplyWorker *res = NULL; + + Assert(LWLockHeldByMe(UndoWorkerLock)); + + /* Search for attached worker for a given db id. */ + for (i = 0; i < max_undo_workers; i++) + { + UndoApplyWorker *w = &UndoApplyCtx->workers[i]; + + if (w->in_use && w->dbid == dbid) + { + res = w; + break; + } + } + + return res; +} + +/* + * Check whether the dbid exist or not. + * + * Refer comments from GetDatabaseTupleByOid. + * FIXME: Should we expose GetDatabaseTupleByOid and directly use it. + */ +static bool +dbid_exist(Oid dboid) +{ + HeapTuple tuple; + Relation relation; + SysScanDesc scan; + ScanKeyData key[1]; + bool result = false; + + /* + * form a scan key + */ + ScanKeyInit(&key[0], + Anum_pg_database_oid, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(dboid)); + + relation = heap_open(DatabaseRelationId, AccessShareLock); + scan = systable_beginscan(relation, DatabaseOidIndexId, + criticalSharedRelcachesBuilt, + NULL, + 1, key); + + tuple = systable_getnext(scan); + + if (HeapTupleIsValid(tuple)) + result = true; + + /* all done */ + systable_endscan(scan); + heap_close(relation, AccessShareLock); + + return result; +} + +/* + * Start new undo apply background worker, if possible otherwise return false. + */ +static bool +undo_worker_launch(Oid dbid) +{ + BackgroundWorker bgw; + BackgroundWorkerHandle *bgw_handle; + uint16 generation; + int i; + int slot = 0; + UndoApplyWorker *worker = NULL; + + /* + * We need to do the modification of the shared memory under lock so that + * we have consistent view. + */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + + /* Find unused worker slot. */ + for (i = 0; i < max_undo_workers; i++) + { + UndoApplyWorker *w = &UndoApplyCtx->workers[i]; + + if (!w->in_use) + { + worker = w; + slot = i; + break; + } + } + + /* There are no more free worker slots */ + if (worker == NULL) + return false; + + /* Prepare the worker slot. */ + worker->in_use = true; + worker->proc = NULL; + worker->dbid = dbid; + worker->generation++; + + generation = worker->generation; + LWLockRelease(UndoWorkerLock); + + /* Register the new dynamic worker. */ + memset(&bgw, 0, sizeof(bgw)); + bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | + BGWORKER_BACKEND_DATABASE_CONNECTION; + bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; + snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres"); + snprintf(bgw.bgw_function_name, BGW_MAXLEN, "UndoWorkerMain"); + snprintf(bgw.bgw_type, BGW_MAXLEN, "undo apply worker"); + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker"); + + bgw.bgw_restart_time = BGW_NEVER_RESTART; + bgw.bgw_notify_pid = MyProcPid; + bgw.bgw_main_arg = Int32GetDatum(slot); + + StartTransactionCommand(); + /* Check the database exists or not. */ + if (!dbid_exist(dbid)) + { + CommitTransactionCommand(); + + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + undo_worker_cleanup(worker); + LWLockRelease(UndoWorkerLock); + return true; + } + + /* + * Acquire database object lock before launching the worker so that it + * doesn't get dropped while worker is connecting to the database. + */ + LockSharedObject(DatabaseRelationId, dbid, 0, RowExclusiveLock); + + /* Recheck whether database still exists or not. */ + if (!dbid_exist(dbid)) + { + CommitTransactionCommand(); + + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + undo_worker_cleanup(worker); + LWLockRelease(UndoWorkerLock); + return true; + } + + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) + { + /* Failed to start worker, so clean up the worker slot. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + undo_worker_cleanup(worker); + LWLockRelease(UndoWorkerLock); + + UnlockSharedObject(DatabaseRelationId, dbid, 0, RowExclusiveLock); + CommitTransactionCommand(); + + return false; + } + + /* Now wait until it attaches. */ + WaitForUndoWorkerAttach(worker, generation, bgw_handle); + + /* + * By this point the undo-worker has already connected to the database so we + * can release the database lock. + */ + UnlockSharedObject(DatabaseRelationId, dbid, 0, RowExclusiveLock); + CommitTransactionCommand(); + + return true; +} + +/* + * Detach the worker (cleans up the worker info). + */ +static void +undo_worker_detach(void) +{ + /* Block concurrent access. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + + undo_worker_cleanup(MyUndoWorker); + + LWLockRelease(UndoWorkerLock); +} + +/* + * Clean up worker info. + */ +static void +undo_worker_cleanup(UndoApplyWorker *worker) +{ + Assert(LWLockHeldByMeInMode(UndoWorkerLock, LW_EXCLUSIVE)); + + worker->in_use = false; + worker->proc = NULL; + worker->dbid = InvalidOid; +} + +/* + * Cleanup function for undo worker launcher. + * + * Called on undo worker launcher exit. + */ +static void +undo_launcher_onexit(int code, Datum arg) +{ + UndoApplyCtx->launcher_pid = 0; +} + +/* SIGHUP: set flag to reload configuration at next convenient time */ +static void +undo_launcher_sighup(SIGNAL_ARGS) +{ + int save_errno = errno; + + got_SIGHUP = true; + + /* Waken anything waiting on the process latch */ + SetLatch(MyLatch); + + errno = save_errno; +} + +/* + * Cleanup function. + * + * Called on logical replication worker exit. + */ +static void +undo_worker_onexit(int code, Datum arg) +{ + undo_worker_detach(); +} + +/* + * UndoLauncherShmemSize + * Compute space needed for undo launcher shared memory + */ +Size +UndoLauncherShmemSize(void) +{ + Size size; + + /* + * Need the fixed struct and the array of LogicalRepWorker. + */ + size = sizeof(UndoApplyCtxStruct); + size = MAXALIGN(size); + size = add_size(size, mul_size(max_undo_workers, + sizeof(UndoApplyWorker))); + return size; +} + +/* + * UndoLauncherRegister + * Register a background worker running the undo worker launcher. + */ +void +UndoLauncherRegister(void) +{ + BackgroundWorker bgw; + + if (max_undo_workers == 0) + return; + + memset(&bgw, 0, sizeof(bgw)); + bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | + BGWORKER_BACKEND_DATABASE_CONNECTION; + bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; + snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres"); + snprintf(bgw.bgw_function_name, BGW_MAXLEN, "UndoLauncherMain"); + snprintf(bgw.bgw_name, BGW_MAXLEN, + "undo worker launcher"); + snprintf(bgw.bgw_type, BGW_MAXLEN, + "undo worker launcher"); + bgw.bgw_restart_time = 5; + bgw.bgw_notify_pid = 0; + bgw.bgw_main_arg = (Datum) 0; + + RegisterBackgroundWorker(&bgw); +} + +/* + * UndoLauncherShmemInit + * Allocate and initialize undo worker launcher shared memory + */ +void +UndoLauncherShmemInit(void) +{ + bool found; + + UndoApplyCtx = (UndoApplyCtxStruct *) + ShmemInitStruct("Undo Worker Launcher Data", + UndoLauncherShmemSize(), + &found); + + if (!found) + memset(UndoApplyCtx, 0, UndoLauncherShmemSize()); +} + +/* + * Main loop for the undo worker launcher process. + */ +void +UndoLauncherMain(Datum main_arg) +{ + MemoryContext tmpctx; + MemoryContext oldctx; + + ereport(DEBUG1, + (errmsg("undo launcher started"))); + + before_shmem_exit(undo_launcher_onexit, (Datum) 0); + + Assert(UndoApplyCtx->launcher_pid == 0); + UndoApplyCtx->launcher_pid = MyProcPid; + + /* Establish signal handlers. */ + pqsignal(SIGHUP, undo_launcher_sighup); + pqsignal(SIGTERM, die); + BackgroundWorkerUnblockSignals(); + + /* + * Establish connection to nailed catalogs (we only ever access + * pg_subscription). + */ + BackgroundWorkerInitializeConnection(NULL, NULL, 0); + + /* Use temporary context for the database list and worker info. */ + tmpctx = AllocSetContextCreate(TopMemoryContext, + "Undo worker Launcher context", + ALLOCSET_DEFAULT_SIZES); + /* Enter main loop */ + for (;;) + { + int rc; + List *dblist; + ListCell *l; + + CHECK_FOR_INTERRUPTS(); + + /* switch to the temp context. */ + oldctx = MemoryContextSwitchTo(tmpctx); + dblist = RollbackHTGetDBList(); + + foreach(l, dblist) + { + UndoApplyWorker *w; + Oid dbid = lfirst_oid(l); + + LWLockAcquire(UndoWorkerLock, LW_SHARED); + w = undo_worker_find(dbid); + LWLockRelease(UndoWorkerLock); + + if (w == NULL) + { +retry: + if (!undo_worker_launch(dbid)) + { + /* Could not launch the worker, retry after sometime, */ + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + DEFAULT_RETRY_NAPTIME, + WAIT_EVENT_UNDO_LAUNCHER_MAIN); + goto retry; + } + } + } + + /* Switch back to original memory context. */ + MemoryContextSwitchTo(oldctx); + + /* Clean the temporary memory. */ + MemoryContextReset(tmpctx); + + /* Wait for more work. */ + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + DEFAULT_NAPTIME_PER_CYCLE, + WAIT_EVENT_UNDO_LAUNCHER_MAIN); + + /* emergency bailout if postmaster has died */ + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); + + if (rc & WL_LATCH_SET) + { + ResetLatch(MyLatch); + CHECK_FOR_INTERRUPTS(); + } + + if (got_SIGHUP) + { + got_SIGHUP = false; + ProcessConfigFile(PGC_SIGHUP); + } + } +} + +/* + * UndoWorkerMain -- Main loop for the undo apply worker. + */ +void +UndoWorkerMain(Datum main_arg) +{ + int worker_slot = DatumGetInt32(main_arg); + Oid dbid; + + dbid = slot_get_dbid(worker_slot); + + /* Setup signal handling */ + pqsignal(SIGTERM, die); + BackgroundWorkerUnblockSignals(); + + /* Connect to the database. */ + BackgroundWorkerInitializeConnectionByOid(dbid, 0, 0); + + /* Attach to slot */ + undo_worker_attach(worker_slot); + + /* + * Create resource owner for undo worker. Undo worker need this as it + * need to read the undo records outside the transaction blocks which + * intern access buffer read routine. + */ + CreateAuxProcessResourceOwner(); + + RollbackFromHT(dbid); + + ReleaseAuxProcessResources(true); + + proc_exit(0); +} diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index e3b05657f8..95153f4e29 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -154,10 +154,27 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor case RM_COMMIT_TS_ID: case RM_REPLORIGIN_ID: case RM_GENERIC_ID: + case RM_UNDOLOG_ID: /* just deal with xid, and done */ ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record), buf.origptr); break; + case RM_ZHEAP_ID: + /* Logical decoding is not yet implemented for zheap. */ + Assert(0); + break; + case RM_ZHEAP2_ID: + /* Logical decoding is not yet implemented for zheap. */ + Assert(0); + break; + case RM_UNDOACTION_ID: + /* Logical decoding is not yet implemented for undoactions. */ + Assert(0); + break; + case RM_TPD_ID: + /* Logical decoding is not yet implemented for TPD. */ + Assert(0); + break; case RM_NEXT_ID: elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record)); } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 9817770aff..4c8088eb2d 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move); static inline int32 GetPrivateRefCount(Buffer buffer); static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref); +static void InvalidateBuffer(BufferDesc *buf); /* * Ensure that the PrivateRefCountArray has sufficient space to store one more @@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum) * valid, the page is zeroed instead of throwing an error. This is intended * for non-critical data, where the caller is prepared to repair errors. * - * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's + * In RBM_ZERO mode, if the page isn't in buffer cache already, it's * filled with zeros instead of reading it from disk. Useful when the caller * is going to fill the page from scratch, since this saves I/O and avoids * unnecessary failure if the page-on-disk has corrupt page headers. + * + * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked. * The page is returned locked to ensure that the caller has a chance to * initialize the page before it's made visible to others. * Caution: do not use this mode to read a page that is beyond the relation's @@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum, /* * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require * a relcache entry for the relation. - * - * NB: At present, this function may only be used on permanent relations, which - * is OK, because we only use it during XLOG replay. If in the future we - * want to use it on temporary or unlogged relations, we could pass additional - * parameters. */ Buffer ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, - BufferAccessStrategy strategy) + BufferAccessStrategy strategy, + char relpersistence) { bool hit; - SMgrRelation smgr = smgropen(rnode, InvalidBackendId); - - Assert(InRecovery); + SMgrRelation smgr = smgropen(rnode, + relpersistence == RELPERSISTENCE_TEMP + ? MyBackendId : InvalidBackendId); - return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum, + return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum, mode, strategy, &hit); } @@ -883,7 +882,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, * Read in the page, unless the caller intends to overwrite it and * just wants us to allocate a buffer. */ - if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) + if (mode == RBM_ZERO || + mode == RBM_ZERO_AND_LOCK || + mode == RBM_ZERO_AND_CLEANUP_LOCK) MemSet((char *) bufBlock, 0, BLCKSZ); else { @@ -1337,6 +1338,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, return buf; } +/* + * ForgetBuffer -- drop a buffer from shared buffers + * + * If the buffer isn't present in shared buffers, nothing happens. If it is + * present, it is discarded without making any attempt to write it back out to + * the operating system. The caller must therefore somehow be sure that the + * data won't be needed for anything now or in the future. It assumes that + * there is no concurrent access to the block, except that it might be being + * concurrently written. + */ +void +ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum) +{ + SMgrRelation smgr = smgropen(rnode, InvalidBackendId); + BufferTag tag; /* identity of target block */ + uint32 hash; /* hash value for tag */ + LWLock *partitionLock; /* buffer partition lock for it */ + int buf_id; + BufferDesc *bufHdr; + uint32 buf_state; + + /* create a tag so we can lookup the buffer */ + INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum); + + /* determine its hash code and partition lock ID */ + hash = BufTableHashCode(&tag); + partitionLock = BufMappingPartitionLock(hash); + + /* see if the block is in the buffer pool */ + LWLockAcquire(partitionLock, LW_SHARED); + buf_id = BufTableLookup(&tag, hash); + LWLockRelease(partitionLock); + + /* didn't find it, so nothing to do */ + if (buf_id < 0) + return; + + /* take the buffer header lock */ + bufHdr = GetBufferDescriptor(buf_id); + buf_state = LockBufHdr(bufHdr); + + /* + * The buffer might been evicted after we released the partition lock and + * before we acquired the buffer header lock. If so, the buffer we've + * locked might contain some other data which we shouldn't touch. If the + * buffer hasn't been recycled, we proceed to invalidate it. + */ + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && + bufHdr->tag.blockNum == blockNum && + bufHdr->tag.forkNum == forkNum) + InvalidateBuffer(bufHdr); /* releases spinlock */ + else + UnlockBufHdr(bufHdr, buf_state); +} + /* * InvalidateBuffer -- mark a shared buffer invalid and return it to the * freelist. diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index e4146a260a..553a416ab4 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -272,6 +272,49 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, return bufHdr; } +/* + * ForgetLocalBuffer - drop a buffer from local buffers + * + * This is similar to bufmgr.c's ForgetBuffer, except that we do not need + * to do any locking since this is all local. As with that function, this + * must be used very carefully, since we'll cheerfully throw away dirty + * buffers without any attempt to write them. + */ +void +ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum) +{ + SMgrRelation smgr = smgropen(rnode, BackendIdForTempRelations()); + BufferTag tag; /* identity of target block */ + LocalBufferLookupEnt *hresult; + BufferDesc *bufHdr; + uint32 buf_state; + + /* + * If somehow this is the first request in the session, there's nothing to + * do. (This probably shouldn't happen, though.) + */ + if (LocalBufHash == NULL) + return; + + /* create a tag so we can lookup the buffer */ + INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum); + + /* see if the block is in the local buffer pool */ + hresult = (LocalBufferLookupEnt *) + hash_search(LocalBufHash, (void *) &tag, HASH_REMOVE, NULL); + + /* didn't find it, so nothing to do */ + if (!hresult) + return; + + /* mark buffer invalid */ + bufHdr = GetLocalBufferDescriptor(hresult->id); + CLEAR_BUFFERTAG(bufHdr->tag); + buf_state = pg_atomic_read_u32(&bufHdr->state); + buf_state &= ~(BM_VALID | BM_TAG_VALID | BM_DIRTY); + pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); +} + /* * MarkLocalBufferDirty - * mark a local buffer dirty diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 0c86a581c0..8f6122ae17 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -21,6 +21,8 @@ #include "access/nbtree.h" #include "access/subtrans.h" #include "access/twophase.h" +#include "access/undolog.h" +#include "access/undodiscard.h" #include "commands/async.h" #include "miscadmin.h" #include "pgstat.h" @@ -28,6 +30,7 @@ #include "postmaster/bgworker_internals.h" #include "postmaster/bgwriter.h" #include "postmaster/postmaster.h" +#include "postmaster/undoworker.h" #include "replication/logicallauncher.h" #include "replication/slot.h" #include "replication/walreceiver.h" @@ -127,6 +130,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, ProcGlobalShmemSize()); size = add_size(size, XLOGShmemSize()); size = add_size(size, CLOGShmemSize()); + size = add_size(size, UndoLogShmemSize()); size = add_size(size, CommitTsShmemSize()); size = add_size(size, SUBTRANSShmemSize()); size = add_size(size, TwoPhaseShmemSize()); @@ -150,6 +154,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); size = add_size(size, BackendRandomShmemSize()); + size = add_size(size, RollbackHTSize()); + size = add_size(size, UndoLauncherShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif @@ -219,10 +225,12 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) */ XLOGShmemInit(); CLOGShmemInit(); + UndoLogShmemInit(); CommitTsShmemInit(); SUBTRANSShmemInit(); MultiXactShmemInit(); InitBufferPool(); + InitRollbackHashTable(); /* * Set up lock manager @@ -261,6 +269,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) WalSndShmemInit(); WalRcvShmemInit(); ApplyLauncherShmemInit(); + UndoLauncherShmemInit(); /* * Set up other modules that need some shared memory space diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index 3f57507bce..5d2cb3eaef 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -721,6 +721,143 @@ ConditionalXactLockTableWait(TransactionId xid) return true; } +/* + * SubXactLockTableInsert + * + * Insert a lock showing that the current subtransaction is running --- + * this is done when a subtransaction performs the operation. The lock can + * then be used to wait for the subtransaction to finish. + */ +void +SubXactLockTableInsert(SubTransactionId subxid) +{ + LOCKTAG tag; + TransactionId xid; + ResourceOwner currentOwner; + + /* Acquire lock only if we doesn't already hold that lock. */ + if (HasCurrentSubTransactionLock()) + return; + + xid = GetTopTransactionId(); + + /* + * Acquire lock on the transaction XID. (We assume this cannot block.) We + * have to ensure that the lock is assigned to the transaction's own + * ResourceOwner. + */ + currentOwner = CurrentResourceOwner; + CurrentResourceOwner = GetCurrentTransactionResOwner(); + + SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid); + (void) LockAcquire(&tag, ExclusiveLock, false, false); + + CurrentResourceOwner = currentOwner; + + SetCurrentSubTransactionLocked(); +} + +/* + * SubXactLockTableDelete + * + * Delete the lock showing that the given subtransaction is running. + * (This is never used for main transaction IDs; those locks are only + * released implicitly at transaction end. But we do use it for + * subtransactions in zheap.) + */ +void +SubXactLockTableDelete(SubTransactionId subxid) +{ + LOCKTAG tag; + TransactionId xid = GetTopTransactionId(); + + SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid); + + LockRelease(&tag, ExclusiveLock, false); +} + +/* + * SubXactLockTableWait + * + * Wait for the specified subtransaction to commit or abort. Here, instead of + * waiting on xid, we wait on xid + subTransactionId. Whenever any concurrent + * transaction finds conflict then it will create a lock tag by (slot xid + + * subtransaction id from the undo) and wait on that. + * + * Unlike XactLockTableWait, we don't need to wait for topmost transaction to + * finish as we release the lock only when the transaction (committed/aborted) + * is recorded in clog. This has some overhead in terms of maintianing unique + * xid locks for subtransactions during commit, but that shouldn't be much as + * we release the locks immediately after transaction is recorded in clog. + * This function is designed for zheap where we don't have xids assigned for + * subtransaction, so we can't really figure out if the subtransaction is + * still in progress. + */ +void +SubXactLockTableWait(TransactionId xid, SubTransactionId subxid, Relation rel, + ItemPointer ctid, XLTW_Oper oper) +{ + LOCKTAG tag; + XactLockTableWaitInfo info; + ErrorContextCallback callback; + + /* + * If an operation is specified, set up our verbose error context + * callback. + */ + if (oper != XLTW_None) + { + Assert(RelationIsValid(rel)); + Assert(ItemPointerIsValid(ctid)); + + info.rel = rel; + info.ctid = ctid; + info.oper = oper; + + callback.callback = XactLockTableWaitErrorCb; + callback.arg = &info; + callback.previous = error_context_stack; + error_context_stack = &callback; + } + + Assert(TransactionIdIsValid(xid)); + Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny())); + Assert(subxid != InvalidSubTransactionId); + + SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid); + + (void) LockAcquire(&tag, ShareLock, false, false); + + LockRelease(&tag, ShareLock, false); + + if (oper != XLTW_None) + error_context_stack = callback.previous; +} + +/* + * ConditionalSubXactLockTableWait + * + * As above, but only lock if we can get the lock without blocking. + * Returns true if the lock was acquired. + */ +bool +ConditionalSubXactLockTableWait(TransactionId xid, SubTransactionId subxid) +{ + LOCKTAG tag; + + Assert(TransactionIdIsValid(xid)); + Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny())); + + SET_LOCKTAG_SUBTRANSACTION(tag, xid, subxid); + + if (LockAcquire(&tag, ShareLock, false, true) == LOCKACQUIRE_NOT_AVAIL) + return false; + + LockRelease(&tag, ShareLock, false); + + return true; +} + /* * SpeculativeInsertionLockAcquire * @@ -768,6 +905,17 @@ SpeculativeInsertionLockRelease(TransactionId xid) LockRelease(&tag, ExclusiveLock, false); } +/* + * GetSpeculativeInsertionToken + * + * Return the value of speculative insertion token. + */ +uint32 +GetSpeculativeInsertionToken(void) +{ + return speculativeInsertionToken; +} + /* * SpeculativeInsertionWait * diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index a6fda81feb..b6c0b00ed0 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -521,6 +521,8 @@ RegisterLWLockTranches(void) LWLockRegisterTranche(LWTRANCHE_TBM, "tbm"); LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append"); LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join"); + LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log"); + LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard"); /* Register named tranches. */ for (i = 0; i < NamedLWLockTrancheRequests; i++) diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt index e6025ecedb..cde0daef7b 100644 --- a/src/backend/storage/lmgr/lwlocknames.txt +++ b/src/backend/storage/lmgr/lwlocknames.txt @@ -50,3 +50,6 @@ OldSnapshotTimeMapLock 42 BackendRandomLock 43 LogicalRepWorkerLock 44 CLogTruncationLock 45 +UndoLogLock 46 +RollbackHTLock 47 +UndoWorkerLock 48 \ No newline at end of file diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index 2960e21340..a5730a9bba 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -461,7 +461,6 @@ static void SetNewSxactGlobalXmin(void); static void ClearOldPredicateLocks(void); static void ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial, bool summarize); -static bool XidIsConcurrent(TransactionId xid); static void CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag); static void FlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer); static void OnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader, @@ -1049,6 +1048,12 @@ CheckPointPredicate(void) /*------------------------------------------------------------------------*/ +bool +IsSerializableXact() +{ + return (MySerializableXact != InvalidSerializableXact); +} + /* * InitPredicateLocks -- Initialize the predicate locking data structures. * @@ -2495,11 +2500,10 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot) * Skip if this is a temporary table. */ void -PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot) +PredicateLockTid(Relation relation, ItemPointer tid, Snapshot snapshot, + TransactionId targetxmin) { PREDICATELOCKTARGETTAG tag; - ItemPointer tid; - TransactionId targetxmin; if (!SerializationNeededForRead(relation, snapshot)) return; @@ -2511,8 +2515,6 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot) { TransactionId myxid; - targetxmin = HeapTupleHeaderGetXmin(tuple->t_data); - myxid = GetTopTransactionIdIfAny(); if (TransactionIdIsValid(myxid)) { @@ -2541,7 +2543,6 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot) if (PredicateLockExists(&tag)) return; - tid = &(tuple->t_self); SET_PREDICATELOCKTARGETTAG_TUPLE(tag, relation->rd_node.dbNode, relation->rd_id, @@ -3853,7 +3854,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial, * that to this function to save the overhead of checking the snapshot's * subxip array. */ -static bool +bool XidIsConcurrent(TransactionId xid) { Snapshot snap; @@ -3898,14 +3899,13 @@ XidIsConcurrent(TransactionId xid) */ void CheckForSerializableConflictOut(bool visible, Relation relation, - HeapTuple tuple, Buffer buffer, + void *stup, Buffer buffer, Snapshot snapshot) { TransactionId xid; SERIALIZABLEXIDTAG sxidtag; SERIALIZABLEXID *sxid; SERIALIZABLEXACT *sxact; - HTSV_Result htsvResult; if (!SerializationNeededForRead(relation, snapshot)) return; @@ -3920,65 +3920,17 @@ CheckForSerializableConflictOut(bool visible, Relation relation, errhint("The transaction might succeed if retried."))); } - /* - * Check to see whether the tuple has been written to by a concurrent - * transaction, either to create it not visible to us, or to delete it - * while it is visible to us. The "visible" bool indicates whether the - * tuple is visible to us, while HeapTupleSatisfiesVacuum checks what else - * is going on with it. - */ - htsvResult = HeapTupleSatisfiesVacuum(tuple, TransactionXmin, buffer); - switch (htsvResult) + if (RelationStorageIsZHeap(relation)) { - case HEAPTUPLE_LIVE: - if (visible) - return; - xid = HeapTupleHeaderGetXmin(tuple->t_data); - break; - case HEAPTUPLE_RECENTLY_DEAD: - if (!visible) - return; - xid = HeapTupleHeaderGetUpdateXid(tuple->t_data); - break; - case HEAPTUPLE_DELETE_IN_PROGRESS: - xid = HeapTupleHeaderGetUpdateXid(tuple->t_data); - break; - case HEAPTUPLE_INSERT_IN_PROGRESS: - xid = HeapTupleHeaderGetXmin(tuple->t_data); - break; - case HEAPTUPLE_DEAD: + if (!ZHeapTupleHasSerializableConflictOut(visible, relation, + (ItemPointer) stup, buffer, &xid)) + return; + } + else + { + if (!HeapTupleHasSerializableConflictOut(visible, (HeapTuple) stup, buffer, &xid)) return; - default: - - /* - * The only way to get to this default clause is if a new value is - * added to the enum type without adding it to this switch - * statement. That's a bug, so elog. - */ - elog(ERROR, "unrecognized return value from HeapTupleSatisfiesVacuum: %u", htsvResult); - - /* - * In spite of having all enum values covered and calling elog on - * this default, some compilers think this is a code path which - * allows xid to be used below without initialization. Silence - * that warning. - */ - xid = InvalidTransactionId; } - Assert(TransactionIdIsValid(xid)); - Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)); - - /* - * Find top level xid. Bail out if xid is too early to be a conflict, or - * if it's our own xid. - */ - if (TransactionIdEquals(xid, GetTopTransactionIdIfAny())) - return; - xid = SubTransGetTopmostTransaction(xid); - if (TransactionIdPrecedes(xid, TransactionXmin)) - return; - if (TransactionIdEquals(xid, GetTopTransactionIdIfAny())) - return; /* * Find sxact or summarized info for the top level xid. @@ -4278,7 +4230,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag) * tuple itself. */ void -CheckForSerializableConflictIn(Relation relation, HeapTuple tuple, +CheckForSerializableConflictIn(Relation relation, ItemPointer tid, Buffer buffer) { PREDICATELOCKTARGETTAG targettag; @@ -4309,13 +4261,13 @@ CheckForSerializableConflictIn(Relation relation, HeapTuple tuple, * It is not possible to take and hold a lock across the checks for all * granularities because each target could be in a separate partition. */ - if (tuple != NULL) + if (ItemPointerIsValid(tid)) { SET_PREDICATELOCKTARGETTAG_TUPLE(targettag, relation->rd_node.dbNode, relation->rd_id, - ItemPointerGetBlockNumber(&(tuple->t_self)), - ItemPointerGetOffsetNumber(&(tuple->t_self))); + ItemPointerGetBlockNumber(tid), + ItemPointerGetOffsetNumber(tid)); CheckTargetForConflictsIn(&targettag); } diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 33387fb71b..69c7b6a781 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -286,6 +286,8 @@ InitProcGlobal(void) /* Create ProcStructLock spinlock, too */ ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t)); SpinLockInit(ProcStructLock); + + pg_atomic_init_u64(&ProcGlobal->oldestXidWithEpochHavingUndo, 0); } /* diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index dfbda5458f..890c7a337e 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -17,6 +17,9 @@ #include "access/htup_details.h" #include "access/itup.h" #include "access/xlog.h" +#include "access/zhtup.h" +#include "access/zheap.h" +#include "storage/bufmgr.h" #include "storage/checksum.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -107,7 +110,8 @@ PageIsVerified(Page page, BlockNumber blkno) * the block can still reveal problems, which is why we offer the * checksum option. */ - if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 && + if (((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 || + (p->pd_flags & ~PD_ZHEAP_VALID_FLAG_BITS) == 0) && p->pd_lower <= p->pd_upper && p->pd_upper <= p->pd_special && p->pd_special <= BLCKSZ && @@ -414,17 +418,6 @@ PageRestoreTempPage(Page tempPage, Page oldPage) pfree(tempPage); } -/* - * sorting support for PageRepairFragmentation and PageIndexMultiDelete - */ -typedef struct itemIdSortData -{ - uint16 offsetindex; /* linp array index */ - int16 itemoff; /* page offset of item data */ - uint16 alignedlen; /* MAXALIGN(item data len) */ -} itemIdSortData; -typedef itemIdSortData *itemIdSort; - static int itemoffcompare(const void *itemidp1, const void *itemidp2) { @@ -437,7 +430,7 @@ itemoffcompare(const void *itemidp1, const void *itemidp2) * After removing or marking some line pointers unused, move the tuples to * remove the gaps caused by the removed items. */ -static void +void compactify_tuples(itemIdSort itemidbase, int nitems, Page page) { PageHeader phdr = (PageHeader) page; diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile index 2b95cb0df1..b657eb275f 100644 --- a/src/backend/storage/smgr/Makefile +++ b/src/backend/storage/smgr/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global -OBJS = md.o smgr.o smgrtype.o +OBJS = md.o smgr.o smgrtype.o undofile.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/storage/smgr/README b/src/backend/storage/smgr/README index 37ed40b645..641926f876 100644 --- a/src/backend/storage/smgr/README +++ b/src/backend/storage/smgr/README @@ -10,16 +10,14 @@ memory, but these were never supported in any externally released Postgres, nor in any version of PostgreSQL.) The "magnetic disk" manager is itself seriously misnamed, because actually it supports any kind of device for which the operating system provides standard filesystem operations; which -these days is pretty much everything of interest. However, we retain the -notion of a storage manager switch in case anyone ever wants to reintroduce -other kinds of storage managers. Removing the switch layer would save -nothing noticeable anyway, since storage-access operations are surely far -more expensive than one extra layer of C function calls. +these days is pretty much everything of interest. However, we retained the +notion of a storage manager switch and it turned out to be useful for plugging +in a new storage manager to support buffered undo logs. In Berkeley Postgres each relation was tagged with the ID of the storage -manager to use for it. This is gone. It would be probably more reasonable -to associate storage managers with tablespaces, should we ever re-introduce -multiple storage managers into the system catalogs. +manager to use for it. This is gone. While earlier PostgreSQL releases were +hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo +pseudo-database to undo_file.c. The files in this directory, and their contents, are @@ -31,6 +29,12 @@ The files in this directory, and their contents, are md.c The "magnetic disk" storage manager, which is really just an interface to the kernel's filesystem operations. + undo_file.c The undo log storage manager. This supports + buffer-pool based access to the contents of undo log + segment files. It supports a limited subset of the + smgr interface: it can only read and write blocks of + existing files. + smgrtype.c Storage manager type -- maps string names to storage manager IDs and provides simple comparison operators. This is the regproc support for type "smgr" in the system catalogs. @@ -38,6 +42,9 @@ The files in this directory, and their contents, are in the catalogs anymore.) Note that md.c in turn relies on src/backend/storage/file/fd.c. +undo_file.c also uses fd.c to read and write blocks, but it expects +src/backend/access/undo/undolog.c to manage the files holding those +blocks. Relation Forks diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 4c6a50509f..4c489a2e59 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -45,7 +45,7 @@ #define UNLINKS_PER_ABSORB 10 /* - * Special values for the segno arg to RememberFsyncRequest. + * Special values for the segno arg to mdrequestsync. * * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an * fsync request from the queue if an identical, subsequent request is found. @@ -1420,7 +1420,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg) if (pendingOpsTable) { /* push it into local pending-ops table */ - RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno); + mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno); } else { @@ -1456,8 +1456,7 @@ register_unlink(RelFileNodeBackend rnode) if (pendingOpsTable) { /* push it into local pending-ops table */ - RememberFsyncRequest(rnode.node, MAIN_FORKNUM, - UNLINK_RELATION_REQUEST); + mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST); } else { @@ -1476,7 +1475,7 @@ register_unlink(RelFileNodeBackend rnode) } /* - * RememberFsyncRequest() -- callback from checkpointer side of fsync request + * mdrequestsync() -- callback from checkpointer side of fsync request * * We stuff fsync requests into the local hash table for execution * during the checkpointer's next checkpoint. UNLINK requests go into a @@ -1497,7 +1496,7 @@ register_unlink(RelFileNodeBackend rnode) * heavyweight operation anyhow, so we'll live with it.) */ void -RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno) +mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno) { Assert(pendingOpsTable); @@ -1640,7 +1639,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum) if (pendingOpsTable) { /* standalone backend or startup process: fsync state is local */ - RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC); + mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC); } else if (IsUnderPostmaster) { @@ -1679,7 +1678,7 @@ ForgetDatabaseFsyncRequests(Oid dbid) if (pendingOpsTable) { /* standalone backend or startup process: fsync state is local */ - RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC); + mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC); } else if (IsUnderPostmaster) { diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 189342ef86..57e1668b5d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -58,6 +58,8 @@ typedef struct f_smgr BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum); void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); + void (*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum, + int segno); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); void (*smgr_pre_ckpt) (void); /* may be NULL */ void (*smgr_sync) (void); /* may be NULL */ @@ -81,15 +83,33 @@ static const f_smgr smgrsw[] = { .smgr_writeback = mdwriteback, .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, + .smgr_requestsync = mdrequestsync, .smgr_immedsync = mdimmedsync, .smgr_pre_ckpt = mdpreckpt, .smgr_sync = mdsync, .smgr_post_ckpt = mdpostckpt + }, + /* undo logs */ + {undofile_init, undofile_shutdown, undofile_close, undofile_create, + undofile_exists, undofile_unlink, undofile_extend, undofile_prefetch, + undofile_read, undofile_write, undofile_writeback, undofile_nblocks, + undofile_truncate, + undofile_requestsync, + undofile_immedsync, undofile_preckpt, undofile_sync, + undofile_postckpt } }; static const int NSmgr = lengthof(smgrsw); +/* + * In ancient Postgres the catalog entry for each relation controlled the + * choice of storage manager implementation. Now we have only md.c for + * regular relations, and undofile.c for undo log storage in the undolog + * pseudo-database. + */ +#define SmgrWhichForRelFileNode(rfn) \ + ((rfn).dbNode == 9 ? 1 : 0) /* * Each backend has a hashtable that stores all extant SMgrRelation objects. @@ -185,11 +205,18 @@ smgropen(RelFileNode rnode, BackendId backend) reln->smgr_targblock = InvalidBlockNumber; reln->smgr_fsm_nblocks = InvalidBlockNumber; reln->smgr_vm_nblocks = InvalidBlockNumber; - reln->smgr_which = 0; /* we only have md.c at present */ + + /* Which storage manager implementation? */ + reln->smgr_which = SmgrWhichForRelFileNode(rnode); /* mark it not open */ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) + { reln->md_num_open_segs[forknum] = 0; + reln->md_seg_fds[forknum] = NULL; + } + + reln->private_data = NULL; /* it has no owner yet */ add_to_unowned_list(reln); @@ -722,6 +749,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) smgrsw[reln->smgr_which].smgr_truncate(reln, forknum, nblocks); } +/* + * smgrrequestsync() -- Enqueue a request for smgrsync() to flush data. + */ +void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno) +{ + smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno); +} + /* * smgrimmedsync() -- Force the specified relation to stable storage. * diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c new file mode 100644 index 0000000000..afba64eb9b --- /dev/null +++ b/src/backend/storage/smgr/undofile.c @@ -0,0 +1,546 @@ +/* + * undofile.h + * + * PostgreSQL undo file manager. This module provides SMGR-compatible + * interface to the files that back undo logs on the filesystem, so that undo + * log data can use the shared buffer pool. Other aspects of undo log + * management are provided by undolog.c, so the SMGR interfaces not directly + * concerned with reading, writing and flushing data are unimplemented. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/storage/smgr/undofile.c + */ + +#include "postgres.h" + +#include "access/undolog.h" +#include "access/xlog.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "postmaster/bgwriter.h" +#include "storage/fd.h" +#include "storage/undofile.h" +#include "utils/memutils.h" + +/* intervals for calling AbsorbFsyncRequests in undofile_sync */ +#define FSYNCS_PER_ABSORB 10 + +/* + * Special values for the fork arg to undofile_requestsync. + */ +#define FORGET_UNDO_SEGMENT_FSYNC (InvalidBlockNumber) + +/* + * While md.c expects random access and has a small number of huge + * segments, undofile.c manages a potentially very large number of smaller + * segments and has a less random access pattern. Therefore, instead of + * keeping a potentially huge array of vfds we'll just keep the most + * recently accessed N. + * + * For now, N == 1, so we just need to hold onto one 'File' handle. + */ +typedef struct UndoFileState +{ + int mru_segno; + File mru_file; +} UndoFileState; + +static MemoryContext UndoFileCxt; + +typedef uint16 CycleCtr; + +/* + * An entry recording the segments that need to be fsynced by undofile_sync(). + * This is a bit simpler than md.c's version, though it could perhaps be + * merged into a common struct. One difference is that we can have much + * larger segment numbers, so we'll adjust for that to avoid having a lot of + * leading zero bits. + */ +typedef struct +{ + RelFileNode rnode; + Bitmapset *requests; + CycleCtr cycle_ctr; +} PendingOperationEntry; + +static HTAB *pendingOpsTable = NULL; +static MemoryContext pendingOpsCxt; + +static CycleCtr undofile_sync_cycle_ctr = 0; + +static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno, + bool missing_ok); +static File undofile_get_segment_file(SMgrRelation reln, int segno); + +void +undofile_init(void) +{ + UndoFileCxt = AllocSetContextCreate(TopMemoryContext, + "UndoFileSmgr", + ALLOCSET_DEFAULT_SIZES); + + if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess()) + { + HASHCTL hash_ctl; + + pendingOpsCxt = AllocSetContextCreate(UndoFileCxt, + "Pending ops context", + ALLOCSET_DEFAULT_SIZES); + MemoryContextAllowInCriticalSection(pendingOpsCxt, true); + + MemSet(&hash_ctl, 0, sizeof(hash_ctl)); + hash_ctl.keysize = sizeof(RelFileNode); + hash_ctl.entrysize = sizeof(PendingOperationEntry); + hash_ctl.hcxt = pendingOpsCxt; + pendingOpsTable = hash_create("Pending Ops Table", + 100L, + &hash_ctl, + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + } +} + +void +undofile_shutdown(void) +{ +} + +void +undofile_close(SMgrRelation reln, ForkNumber forknum) +{ +} + +void +undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + elog(ERROR, "undofile_create is not supported"); +} + +bool +undofile_exists(SMgrRelation reln, ForkNumber forknum) +{ + elog(ERROR, "undofile_exists is not supported"); +} + +void +undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo) +{ + elog(ERROR, "undofile_unlink is not supported"); +} + +void +undofile_extend(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, char *buffer, + bool skipFsync) +{ + elog(ERROR, "undofile_extend is not supported"); +} + +void +undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) +{ + elog(ERROR, "undofile_prefetch is not supported"); +} + +void +undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + char *buffer) +{ + File file; + off_t seekpos; + int nbytes; + + Assert(forknum == MAIN_FORKNUM); + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE); + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE)); + Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE); + nbytes = FileRead(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_READ); + if (nbytes != BLCKSZ) + { + if (nbytes < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read block %u in file \"%s\": %m", + blocknum, FilePathName(file)))); + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("could not read block %u in file \"%s\": read only %d of %d bytes", + blocknum, FilePathName(file), + nbytes, BLCKSZ))); + } +} + +static void +register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file) +{ + /* Temp relations should never be fsync'd */ + Assert(!SmgrIsTemp(reln)); + + if (pendingOpsTable) + { + /* push it into local pending-ops table */ + undofile_requestsync(reln->smgr_rnode.node, forknum, segno); + } + else + { + if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno)) + return; /* passed it off successfully */ + + ereport(DEBUG1, + (errmsg("could not forward fsync request because request queue is full"))); + + if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not fsync file \"%s\": %m", + FilePathName(file)))); + } +} + +void +undofile_write(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, char *buffer, + bool skipFsync) +{ + File file; + off_t seekpos; + int nbytes; + + Assert(forknum == MAIN_FORKNUM); + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE); + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE)); + Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE); + nbytes = FileWrite(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_WRITE); + if (nbytes != BLCKSZ) + { + if (nbytes < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write block %u in file \"%s\": %m", + blocknum, FilePathName(file)))); + /* + * short write: unexpected, because this should be overwriting an + * entirely pre-allocated segment file + */ + ereport(ERROR, + (errcode(ERRCODE_DISK_FULL), + errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes", + blocknum, FilePathName(file), + nbytes, BLCKSZ))); + } + + if (!skipFsync && !SmgrIsTemp(reln)) + register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file); +} + +void +undofile_writeback(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, BlockNumber nblocks) +{ + while (nblocks > 0) + { + File file; + int nflush; + + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE); + + /* compute number of desired writes within the current segment */ + nflush = Min(nblocks, + 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE)); + + FileWriteback(file, + (blocknum % UNDOSEG_SIZE) * BLCKSZ, + nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH); + + nblocks -= nflush; + blocknum += nflush; + } +} + +BlockNumber +undofile_nblocks(SMgrRelation reln, ForkNumber forknum) +{ + elog(ERROR, "undofile_nblocks is not supported"); + return 0; +} + +void +undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) +{ + elog(ERROR, "undofile_truncate is not supported"); +} + +void +undofile_immedsync(SMgrRelation reln, ForkNumber forknum) +{ + elog(ERROR, "undofile_immedsync is not supported"); +} + +void +undofile_preckpt(void) +{ +} + +void +undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno) +{ + MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt); + PendingOperationEntry *entry; + bool found; + + Assert(pendingOpsTable); + + if (forknum == FORGET_UNDO_SEGMENT_FSYNC) + { + entry = (PendingOperationEntry *) hash_search(pendingOpsTable, + &rnode, + HASH_FIND, + NULL); + if (entry) + entry->requests = bms_del_member(entry->requests, segno); + } + else + { + entry = (PendingOperationEntry *) hash_search(pendingOpsTable, + &rnode, + HASH_ENTER, + &found); + if (!found) + { + entry->cycle_ctr = undofile_sync_cycle_ctr; + entry->requests = bms_make_singleton(segno); + } + else + entry->requests = bms_add_member(entry->requests, segno); + } + + MemoryContextSwitchTo(oldcxt); +} + +void +undofile_forgetsync(Oid logno, Oid tablespace, int segno) +{ + RelFileNode rnode; + + rnode.dbNode = 9; + rnode.spcNode = tablespace; + rnode.relNode = logno; + + if (pendingOpsTable) + undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno); + else if (IsUnderPostmaster) + { + while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno)) + pg_usleep(10000L); + } +} + +void +undofile_sync(void) +{ + static bool undofile_sync_in_progress = false; + + HASH_SEQ_STATUS hstat; + PendingOperationEntry *entry; + int absorb_counter; + int segno; + + if (!pendingOpsTable) + elog(ERROR, "cannot sync without a pendingOpsTable"); + + AbsorbFsyncRequests(); + + if (undofile_sync_in_progress) + { + /* prior try failed, so update any stale cycle_ctr values */ + hash_seq_init(&hstat, pendingOpsTable); + while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) + entry->cycle_ctr = undofile_sync_cycle_ctr; + } + + undofile_sync_cycle_ctr++; + undofile_sync_in_progress = true; + + absorb_counter = FSYNCS_PER_ABSORB; + hash_seq_init(&hstat, pendingOpsTable); + while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) + { + Bitmapset *requests; + + /* Skip entries that arrived after we arrived. */ + if (entry->cycle_ctr == undofile_sync_cycle_ctr) + continue; + + Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr); + + if (!enableFsync) + continue; + + requests = entry->requests; + entry->requests = NULL; + + segno = -1; + while ((segno = bms_next_member(requests, segno)) >= 0) + { + File file; + + if (!enableFsync) + continue; + + file = undofile_open_segment_file(entry->rnode.relNode, + entry->rnode.spcNode, + segno, true /* missing_ok */); + + /* + * The file may be gone due to concurrent discard. We'll ignore + * that, but only if we find a cancel request for this segment in + * the queue. + * + * It's also possible that we succeed in opening a segment file + * that is subsequently recycled (renamed to represent a new range + * of undo log), in which case we'll fsync that later file + * instead. That is rare and harmless. + */ + if (file <= 0) + { + char name[MAXPGPATH]; + + /* + * Put the request back into the bitset in a way that can't + * fail due to memory allocation. + */ + entry->requests = bms_join(entry->requests, requests); + /* + * Check if a forgetsync request has arrived to delete that + * segment. + */ + AbsorbFsyncRequests(); + if (bms_is_member(segno, entry->requests)) + { + UndoLogSegmentPath(entry->rnode.relNode, + segno, + entry->rnode.spcNode, + name); + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not fsync file \"%s\": %m", name))); + } + /* It must have been removed, so we can safely skip it. */ + continue; + } + + elog(LOG, "fsync()ing %s", FilePathName(file)); /* TODO: remove me */ + if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0) + { + char name[MAXPGPATH]; + + strcpy(name, FilePathName(file)); + FileClose(file); + + /* + * Keep the failed requests, but merge with any new ones. The + * requirement to be able to do this without risk of failure + * prevents us from using a smaller bitmap that doesn't bother + * tracking leading zeros. Perhaps another data structure + * would be better. + */ + entry->requests = bms_join(entry->requests, requests); + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not fsync file \"%s\": %m", name))); + } + requests = bms_del_member(requests, segno); + FileClose(file); + + if (--absorb_counter <= 0) + { + AbsorbFsyncRequests(); + absorb_counter = FSYNCS_PER_ABSORB; + } + } + + bms_free(requests); + } + + undofile_sync_in_progress = true; +} + +void undofile_postckpt(void) +{ +} + +static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno, + bool missing_ok) +{ + File file; + char path[MAXPGPATH]; + + UndoLogSegmentPath(relNode, segno, spcNode, path); + file = PathNameOpenFile(path, O_RDWR | PG_BINARY); + + if (file <= 0 && (!missing_ok || errno != ENOENT)) + elog(ERROR, "cannot open undo segment file '%s': %m", path); + + return file; +} + +/* + * Get a File for a particular segment of a SMgrRelation representing an undo + * log. + */ +static File undofile_get_segment_file(SMgrRelation reln, int segno) +{ + UndoFileState *state; + + + /* + * Create private state space on demand. + * + * XXX There should probably be a smgr 'open' or 'init' interface that + * would do this. smgr.c currently initializes reln->md_XXX stuff + * directly... + */ + state = (UndoFileState *) reln->private_data; + if (unlikely(state == NULL)) + { + state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState)); + reln->private_data = state; + } + + /* If we have a file open already, check if we need to close it. */ + if (state->mru_file > 0 && state->mru_segno != segno) + { + /* These are not the blocks we're looking for. */ + FileClose(state->mru_file); + state->mru_file = 0; + } + + /* Check if we need to open a new file. */ + if (state->mru_file <= 0) + { + state->mru_file = + undofile_open_segment_file(reln->smgr_rnode.node.relNode, + reln->smgr_rnode.node.spcNode, + segno, InRecovery); + if (InRecovery && state->mru_file <= 0) + { + /* + * If in recovery, we may be trying to access a file that will + * later be unlinked. Tolerate missing files, creating a new + * zero-filled file as required. + */ + UndoLogNewSegment(reln->smgr_rnode.node.relNode, + reln->smgr_rnode.node.spcNode, + segno); + state->mru_file = + undofile_open_segment_file(reln->smgr_rnode.node.relNode, + reln->smgr_rnode.node.spcNode, + segno, false); + Assert(state->mru_file > 0); + } + state->mru_segno = segno; + } + + return state->mru_file; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 5ab7d3cd8d..0054baa35a 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -4129,6 +4129,7 @@ PostgresMain(int argc, char *argv[], * not preventing advance of global xmin while we wait for the client. */ InvalidateCatalogSnapshotConditionally(); + XactPerformUndoActionsIfPending(); /* * (1) If we've reached idle state, tell the frontend we're ready for diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 970c94ee80..3b35c9d62d 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -1016,7 +1016,7 @@ ProcessUtilitySlow(ParseState *pstate, /* * parse and validate reloptions for the toast - * table + * table. */ toast_options = transformRelOptions((Datum) 0, ((CreateStmt *) stmt)->options, diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c index 525decb6f1..b5687fea8f 100644 --- a/src/backend/utils/adt/lockfuncs.c +++ b/src/backend/utils/adt/lockfuncs.c @@ -29,9 +29,11 @@ const char *const LockTagTypeNames[] = { "page", "tuple", "transactionid", + "subtransactionid", "virtualxid", "speculative token", "object", + "undoaction", "userlock", "advisory" }; diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index f955f1912a..219e98e1ad 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -28,6 +28,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/rel.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -137,12 +138,44 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS) Oid relid = PG_GETARG_OID(0); int64 result; PgStat_StatTabEntry *tabentry; + Relation rel = heap_open(relid, AccessShareLock); - if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL) + /* + * Counter tuples_hot_updated stores number of hot updates for heap table + * and the number of inplace updates for zheap table. + */ + if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL || + RelationStorageIsZHeap(rel)) result = 0; else result = (int64) (tabentry->tuples_hot_updated); + heap_close(rel, AccessShareLock); + + PG_RETURN_INT64(result); +} + + +Datum +pg_stat_get_tuples_inplace_updated(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + int64 result; + PgStat_StatTabEntry *tabentry; + Relation rel = heap_open(relid, AccessShareLock); + + /* + * Counter tuples_hot_updated stores number of hot updates for heap table + * and the number of inplace updates for zheap table. + */ + if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL || + !RelationStorageIsZHeap(rel)) + result = 0; + else + result = (int64) (tabentry->tuples_hot_updated); + + heap_close(rel, AccessShareLock); + PG_RETURN_INT64(result); } @@ -1685,12 +1718,43 @@ pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS) Oid relid = PG_GETARG_OID(0); int64 result; PgStat_TableStatus *tabentry; + Relation rel = heap_open(relid, AccessShareLock); - if ((tabentry = find_tabstat_entry(relid)) == NULL) + /* + * Counter t_tuples_hot_updated stores number of hot updates for heap + * table and the number of inplace updates for zheap table. + */ + if ((tabentry = find_tabstat_entry(relid)) == NULL || + RelationStorageIsZHeap(rel)) + result = 0; + else + result = (int64) (tabentry->t_counts.t_tuples_hot_updated); + + heap_close(rel, AccessShareLock); + + PG_RETURN_INT64(result); +} + +Datum +pg_stat_get_xact_tuples_inplace_updated(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + int64 result; + PgStat_TableStatus *tabentry; + Relation rel = heap_open(relid, AccessShareLock); + + /* + * Counter t_tuples_hot_updated stores number of hot updates for heap table + * and the number of inplace updates for zheap table. + */ + if ((tabentry = find_tabstat_entry(relid)) == NULL || + !RelationStorageIsZHeap(rel)) result = 0; else result = (int64) (tabentry->t_counts.t_tuples_hot_updated); + heap_close(rel, AccessShareLock); + PG_RETURN_INT64(result); } diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 120550f526..71f22b0382 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3432,9 +3432,13 @@ RelationSetNewRelfilenode(Relation relation, char persistence, } #endif - /* Indexes, sequences must have Invalid frozenxid; other rels must not */ + /* + * Indexes, sequences, zheap relations must have Invalid frozenxid; other + * rels must not + */ Assert((relation->rd_rel->relkind == RELKIND_INDEX || - relation->rd_rel->relkind == RELKIND_SEQUENCE) ? + relation->rd_rel->relkind == RELKIND_SEQUENCE || + RelationStorageIsZHeap(relation)) ? freezeXid == InvalidTransactionId : TransactionIdIsNormal(freezeXid)); Assert(TransactionIdIsNormal(freezeXid) == MultiXactIdIsValid(minmulti)); @@ -3517,6 +3521,10 @@ RelationSetNewRelfilenode(Relation relation, char persistence, /* Flag relation as needing eoxact cleanup (to remove the hint) */ EOXactListAdd(relation); + + /* Initialize the metapage for zheap relation. */ + if (RelationStorageIsZHeap(relation)) + ZheapInitMetaPage(relation, MAIN_FORKNUM); } diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index c6939779b9..12e7704fda 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -121,6 +121,7 @@ bool allowSystemTableMods = false; int work_mem = 1024; int maintenance_work_mem = 16384; int max_parallel_maintenance_workers = 2; +int rollback_overflow_size = 64; /* * Primary determinants of sizes of shared-memory structures. diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 1d57177cb5..2d90513714 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -557,6 +557,7 @@ BaseInit(void) InitFileAccess(); smgrinit(); InitBufferPoolAccess(); + UndoLogInit(); } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 11b6df209a..2cb7767fba 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -63,6 +63,7 @@ #include "postmaster/bgwriter.h" #include "postmaster/postmaster.h" #include "postmaster/syslogger.h" +#include "postmaster/undoworker.h" #include "postmaster/walwriter.h" #include "replication/logicallauncher.h" #include "replication/slot.h" @@ -120,6 +121,7 @@ extern int CommitDelay; extern int CommitSiblings; extern char *default_tablespace; extern char *temp_tablespaces; +extern char *undo_tablespaces; extern bool ignore_checksum_failure; extern bool synchronize_seqscans; @@ -1876,6 +1878,17 @@ static struct config_bool ConfigureNamesBool[] = NULL, NULL, NULL }, + { + {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS, + gettext_noop("Decides whether to launch an undo worker."), + NULL, + GUC_NOT_IN_SAMPLE + }, + &disable_undo_launcher, + false, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL @@ -2860,6 +2873,16 @@ static struct config_int ConfigureNamesInt[] = 5000, 1, INT_MAX, NULL, NULL, NULL }, + { + {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Rollbacks greater than this size are done lazily"), + NULL, + GUC_UNIT_MB + }, + &rollback_overflow_size, + 64, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, { {"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS, @@ -3545,6 +3568,17 @@ static struct config_string ConfigureNamesString[] = check_temp_tablespaces, assign_temp_tablespaces, NULL }, + { + {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Sets the tablespace(s) to use for undo logs."), + NULL, + GUC_LIST_INPUT | GUC_LIST_QUOTE + }, + &undo_tablespaces, + "", + check_undo_tablespaces, assign_undo_tablespaces, NULL + }, + { {"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER, gettext_noop("Sets the path for dynamically loadable modules."), diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c index a376875269..9a5a0b17ba 100644 --- a/src/backend/utils/misc/pg_controldata.c +++ b/src/backend/utils/misc/pg_controldata.c @@ -78,8 +78,8 @@ pg_control_system(PG_FUNCTION_ARGS) Datum pg_control_checkpoint(PG_FUNCTION_ARGS) { - Datum values[19]; - bool nulls[19]; + Datum values[20]; + bool nulls[20]; TupleDesc tupdesc; HeapTuple htup; ControlFileData *ControlFile; @@ -91,7 +91,7 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) * Construct a tuple descriptor for the result row. This must match this * function's pg_proc entry! */ - tupdesc = CreateTemplateTupleDesc(18); + tupdesc = CreateTemplateTupleDesc(19); TupleDescInitEntry(tupdesc, (AttrNumber) 1, "checkpoint_lsn", LSNOID, -1, 0); TupleDescInitEntry(tupdesc, (AttrNumber) 2, "redo_lsn", @@ -128,6 +128,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) XIDOID, -1, 0); TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time", TIMESTAMPTZOID, -1, 0); + TupleDescInitEntry(tupdesc, (AttrNumber) 19, "oldest_xid_with_epoch_having_undo", + INT8OID, -1, 0); tupdesc = BlessTupleDesc(tupdesc); /* Read the control file. */ @@ -202,6 +204,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) time_t_to_timestamptz(ControlFile->checkPointCopy.time)); nulls[17] = false; + values[18] = Int64GetDatum(ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo); + nulls[18] = false; + htup = heap_form_tuple(tupdesc, values, nulls); PG_RETURN_DATUM(HeapTupleGetDatum(htup)); diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 1fa02d2c93..9190c3f9b2 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -737,4 +737,10 @@ # CUSTOMIZED OPTIONS #------------------------------------------------------------------------------ +# If often there are large transactions requiring rollbacks, then we can push +# them to undo-workers for better performance. The size specifeid by the +# parameter below, determines the minimum size of the rollback requests to be +# sent to the undo-worker. +# +#rollback_overflow_size = 64 # Add settings for extensions here diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 211a96380e..ea0221060b 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -209,11 +209,13 @@ static const char *const subdirs[] = { "pg_snapshots", "pg_subtrans", "pg_twophase", + "pg_undo", "pg_multixact", "pg_multixact/members", "pg_multixact/offsets", "base", "base/1", + "base/undo", "pg_replslot", "pg_tblspc", "pg_stat", diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index 895a51f89d..55ddbb29ac 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -278,6 +278,8 @@ main(int argc, char *argv[]) ControlFile->checkPointCopy.oldestCommitTsXid); printf(_("Latest checkpoint's newestCommitTsXid:%u\n"), ControlFile->checkPointCopy.newestCommitTsXid); + printf(_("Latest checkpoint's oldestXidWithEpochHavingUndo:" UINT64_FORMAT "\n"), + ControlFile->checkPointCopy.oldestXidWithEpochHavingUndo); printf(_("Time of latest checkpoint: %s\n"), ckpttime_str); printf(_("Fake LSN counter for unlogged rels: %X/%X\n"), @@ -329,6 +331,8 @@ main(int argc, char *argv[]) ControlFile->toast_max_chunk_size); printf(_("Size of a large-object chunk: %u\n"), ControlFile->loblksize); + printf(_("Transaction slots per zheap page: %u\n"), + ControlFile->zheap_page_trans_slots); /* This is no longer configurable, but users may still expect to see it: */ printf(_("Date/time type storage: %s\n"), _("64-bit integers")); diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c index 6fb403a5a8..673ac29785 100644 --- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -58,6 +58,11 @@ #include "pg_getopt.h" #include "getopt_long.h" +#ifndef WIN32 +#define pg_mv_file rename +#else +#define pg_mv_file pgrename +#endif static ControlFileData ControlFile; /* pg_control values */ static XLogSegNo newXlogSegNo; /* new XLOG segment # */ @@ -85,6 +90,7 @@ static void FindEndOfXLOG(void); static void KillExistingXLOG(void); static void KillExistingArchiveStatus(void); static void WriteEmptyXLOG(void); +static bool FindLatestUndoCheckPointFile(char *latest_undo_checkpoint_file); static void usage(void); @@ -115,6 +121,9 @@ main(int argc, char *argv[]) char *DataDir = NULL; char *log_fname = NULL; int fd; + char latest_undo_checkpoint_file[MAXPGPATH]; + char new_undo_checkpoint_file[MAXPGPATH]; + bool found = false; set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_resetwal")); @@ -448,6 +457,7 @@ main(int argc, char *argv[]) if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId) ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId; ControlFile.checkPointCopy.oldestXidDB = InvalidOid; + ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo = 0; } if (set_oldest_commit_ts_xid != 0) @@ -514,6 +524,25 @@ main(int argc, char *argv[]) * Else, do the dirty deed. */ RewriteControlFile(); + + /* + * Find the newest undo checkpoint file under pg_undo directory and rename + * it as per the latest checkpoint redo location in control file. + */ + found = FindLatestUndoCheckPointFile(latest_undo_checkpoint_file); + if (!found) + fprintf(stderr, _("Could not find the latest undo checkpoint file.\n")); + + snprintf(new_undo_checkpoint_file, sizeof(new_undo_checkpoint_file), + "pg_undo/%016" INT64_MODIFIER "X", ControlFile.checkPointCopy.redo); + + if (pg_mv_file(latest_undo_checkpoint_file, new_undo_checkpoint_file) != 0) + { + fprintf(stderr, _("Unable to rename %s to %s.\n"), latest_undo_checkpoint_file, + new_undo_checkpoint_file); + exit(1); + } + KillExistingXLOG(); KillExistingArchiveStatus(); WriteEmptyXLOG(); @@ -716,6 +745,8 @@ GuessControlValues(void) ControlFile.checkPointCopy.oldestMultiDB = InvalidOid; ControlFile.checkPointCopy.time = (pg_time_t) time(NULL); ControlFile.checkPointCopy.oldestActiveXid = InvalidTransactionId; + ControlFile.checkPointCopy.nextXid = 0; + ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo = 0; ControlFile.state = DB_SHUTDOWNED; ControlFile.time = (pg_time_t) time(NULL); @@ -808,6 +839,8 @@ PrintControlValues(bool guessed) ControlFile.checkPointCopy.oldestCommitTsXid); printf(_("Latest checkpoint's newestCommitTsXid:%u\n"), ControlFile.checkPointCopy.newestCommitTsXid); + printf(_("Latest checkpoint's oldestXidWithEpochHavingUndo:" UINT64_FORMAT "\n"), + ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo); printf(_("Maximum data alignment: %u\n"), ControlFile.maxAlign); /* we don't print floatFormat since can't say much useful about it */ @@ -884,6 +917,8 @@ PrintNewControlValues(void) ControlFile.checkPointCopy.oldestXid); printf(_("OldestXID's DB: %u\n"), ControlFile.checkPointCopy.oldestXidDB); + printf(_("OldestXidWithEpochHavingUndo:" UINT64_FORMAT "\n"), + ControlFile.checkPointCopy.oldestXidWithEpochHavingUndo); } if (set_xid_epoch != -1) @@ -1303,6 +1338,55 @@ WriteEmptyXLOG(void) close(fd); } +/* + * Find the latest modified undo checkpoint file under pg_undo directory and + * delete all other files. + */ +static bool +FindLatestUndoCheckPointFile(char *latest_undo_checkpoint_file) +{ + char **filenames; + char **filename; + char latest[UNDO_CHECKPOINT_FILENAME_LENGTH + 1]; + bool result = false; + + memset(latest, 0, sizeof(latest)); + + /* Copy all the files from pg_undo directory into filenames */ + filenames = pgfnames("pg_undo"); + + /* + * Start reading each file under pg_undo to identify the latest + * modified file and remove the older files that are not required. + */ + for (filename = filenames; *filename; filename++) + { + if (!(strlen(*filename) == UNDO_CHECKPOINT_FILENAME_LENGTH)) + continue; + + if (UndoCheckPointFilenamePrecedes(latest, *filename)) + { + if (latest[0] != '\0') + { + snprintf(latest_undo_checkpoint_file, MAXPGPATH, "pg_undo/%s", + latest); + if (unlink(latest_undo_checkpoint_file) != 0) + fprintf(stderr, _("could not unlink file \"%s\": %s\n"), + *filename, strerror(errno)); + } + memcpy(latest, *filename, UNDO_CHECKPOINT_FILENAME_LENGTH); + latest[UNDO_CHECKPOINT_FILENAME_LENGTH] = '\0'; + result = true; + } + } + + if (result) + snprintf(latest_undo_checkpoint_file, MAXPGPATH, "pg_undo/%s", latest); + + pgfnames_cleanup(filenames); + + return result; +} static void usage(void) diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 47119dc42d..036f094904 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -468,6 +468,15 @@ copy_xact_xlog_xid(void) GET_MAJOR_VERSION(new_cluster.major_version) < 1000 ? "pg_clog" : "pg_xact"); + /* copy old undo checkpoint files to new data dir */ + copy_subdir_files("pg_undo", "pg_undo"); + + /* + * copy old undo logs to new data dir assuming that the + * undo logs exist in default location i.e. 'base/undo'. + */ + copy_subdir_files("base/undo", "base/undo"); + /* set the next transaction id and epoch of the new cluster */ prep_status("Setting next transaction ID and epoch for new cluster"); exec_prog(UTILITY_LOG_FILE, NULL, true, true, diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c index 852d8ca4b1..d29e76a637 100644 --- a/src/bin/pg_waldump/rmgrdesc.c +++ b/src/bin/pg_waldump/rmgrdesc.c @@ -20,8 +20,12 @@ #include "access/nbtxlog.h" #include "access/rmgr.h" #include "access/spgxlog.h" +#include "access/tpd_xlog.h" +#include "access/undoaction_xlog.h" +#include "access/undolog_xlog.h" #include "access/xact.h" #include "access/xlog_internal.h" +#include "access/zheapam_xlog.h" #include "catalog/storage_xlog.h" #include "commands/dbcommands_xlog.h" #include "commands/sequence.h" diff --git a/src/include/access/genham.h b/src/include/access/genham.h new file mode 100644 index 0000000000..92122ea5c5 --- /dev/null +++ b/src/include/access/genham.h @@ -0,0 +1,143 @@ +/*------------------------------------------------------------------------- + * + * genham.h + * POSTGRES generalized heap access method definitions. + * + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/genham.h + * + *------------------------------------------------------------------------- + */ +#ifndef GENHAM_H +#define GENHAM_H + +#include "access/multixact.h" +#include "access/sdir.h" +#include "access/skey.h" +#include "nodes/lockoptions.h" +#include "storage/buf.h" +#include "storage/itemptr.h" +#include "storage/lockdefs.h" +#include "utils/relcache.h" + +typedef struct BulkInsertStateData *BulkInsertState; + +/* struct definitions appear in relscan.h */ +typedef struct HeapScanDescData *HeapScanDesc; +typedef struct ParallelTableScanDescData *ParallelTableScanDesc; + +/* + * When heap_update, heap_delete, or heap_lock_tuple fail because the target + * tuple is already outdated, they fill in this struct to provide information + * to the caller about what happened. + * ctid is the target's ctid link: it is the same as the target's TID if the + * target was deleted, or the location of the replacement tuple if the target + * was updated. + * xmax is the outdating transaction's XID. If the caller wants to visit the + * replacement tuple, it must check that this matches before believing the + * replacement is really a match. + * cmax is the outdating command's CID, but only when the failure code is + * HeapTupleSelfUpdated (i.e., something in the current transaction outdated + * the tuple); otherwise cmax is zero. (We make this restriction because + * HeapTupleHeaderGetCmax doesn't work for tuples outdated in other + * transactions.) + * in_place_updated_or_locked indicates whether the tuple is updated or locked. + * We need to re-verify the tuple even if it is just marked as locked, because + * previously someone could have updated it in place. + */ +typedef struct HeapUpdateFailureData +{ + ItemPointerData ctid; + TransactionId xmax; + CommandId cmax; + bool traversed; + bool in_place_updated_or_locked; +} HeapUpdateFailureData; + +/* Result codes for HeapTupleSatisfiesVacuum */ +typedef enum +{ + HEAPTUPLE_DEAD, /* tuple is dead and deletable */ + HEAPTUPLE_LIVE, /* tuple is live (committed, no deleter) */ + HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */ + HEAPTUPLE_INSERT_IN_PROGRESS, /* inserting xact is still in progress */ + HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */ +} HTSV_Result; + +/* Result codes for ZHeapTupleSatisfiesVacuum */ +typedef enum +{ + ZHEAPTUPLE_DEAD, /* tuple is dead and deletable */ + ZHEAPTUPLE_LIVE, /* tuple is live (committed, no deleter) */ + ZHEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */ + ZHEAPTUPLE_INSERT_IN_PROGRESS, /* inserting xact is still in progress */ + ZHEAPTUPLE_DELETE_IN_PROGRESS, /* deleting xact is still in progress */ + ZHEAPTUPLE_ABORT_IN_PROGRESS /* rollback is still pending */ +} ZHTSV_Result; + +/* + * Possible lock modes for a tuple. + */ +typedef enum LockTupleMode +{ + /* SELECT FOR KEY SHARE */ + LockTupleKeyShare, + /* SELECT FOR SHARE */ + LockTupleShare, + /* SELECT FOR NO KEY UPDATE, and UPDATEs that don't modify key columns */ + LockTupleNoKeyExclusive, + /* SELECT FOR UPDATE, UPDATEs that modify key columns, and DELETE */ + LockTupleExclusive +} LockTupleMode; + +#define MaxLockTupleMode LockTupleExclusive + + +static const struct +{ + LOCKMODE hwlock; + int lockstatus; + int updstatus; +} + + tupleLockExtraInfo[MaxLockTupleMode + 1] = +{ + { /* LockTupleKeyShare */ + AccessShareLock, + MultiXactStatusForKeyShare, + -1 /* KeyShare does not allow updating tuples */ + }, + { /* LockTupleShare */ + RowShareLock, + MultiXactStatusForShare, + -1 /* Share does not allow updating tuples */ + }, + { /* LockTupleNoKeyExclusive */ + ExclusiveLock, + MultiXactStatusForNoKeyUpdate, + MultiXactStatusNoKeyUpdate + }, + { /* LockTupleExclusive */ + AccessExclusiveLock, + MultiXactStatusForUpdate, + MultiXactStatusUpdate + } +}; + +#define UnlockTupleTuplock(rel, tup, mode) \ + UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) + +extern bool heap_acquire_tuplock(Relation relation, ItemPointer tid, + LockTupleMode mode, LockWaitPolicy wait_policy, + bool *have_tuple_lock); +extern void GetVisibilityMapPins(Relation relation, Buffer buffer1, + Buffer buffer2, BlockNumber block1, BlockNumber block2, + Buffer *vmbuffer1, Buffer *vmbuffer2); +extern void RelationAddExtraBlocks(Relation relation, BulkInsertState bistate); +extern Buffer ReadBufferBI(Relation relation, BlockNumber targetBlock, + BulkInsertState bistate); + +#endif /* GENHAM_H */ diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h index 527138440b..7ca5058555 100644 --- a/src/include/access/hash_xlog.h +++ b/src/include/access/hash_xlog.h @@ -260,17 +260,24 @@ typedef struct xl_hash_init_bitmap_page * * Backup Blk 0: bucket page * Backup Blk 1: meta page + * + * In Hot Standby, we need to scan the entire relation to verify whether any + * hash delete index item conflicts with any standby query. For that, we need to + * know the relation type which is stored in xlog record. */ +#define XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP 0x0001 + typedef struct xl_hash_vacuum_one_page { RelFileNode hnode; int ntuples; + uint8 flags; /* See XLOG_HASH_VACUUM_* flags for details */ /* TARGET OFFSET NUMBERS FOLLOW AT THE END */ } xl_hash_vacuum_one_page; #define SizeOfHashVacuumOnePage \ - (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int)) + (offsetof(xl_hash_vacuum_one_page, flags) + sizeof(uint8)) extern void hash_redo(XLogReaderState *record); extern void hash_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index a309db1a1c..c16a3526a7 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -14,8 +14,7 @@ #ifndef HEAPAM_H #define HEAPAM_H -#include "access/sdir.h" -#include "access/skey.h" +#include "access/genham.h" #include "nodes/lockoptions.h" #include "nodes/primnodes.h" #include "storage/bufpage.h" @@ -35,57 +34,6 @@ typedef struct BulkInsertStateData *BulkInsertState; struct TupleTableSlot; -/* - * Possible lock modes for a tuple. - */ -typedef enum LockTupleMode -{ - /* SELECT FOR KEY SHARE */ - LockTupleKeyShare, - /* SELECT FOR SHARE */ - LockTupleShare, - /* SELECT FOR NO KEY UPDATE, and UPDATEs that don't modify key columns */ - LockTupleNoKeyExclusive, - /* SELECT FOR UPDATE, UPDATEs that modify key columns, and DELETE */ - LockTupleExclusive -} LockTupleMode; - -#define MaxLockTupleMode LockTupleExclusive - -/* - * When heap_update, heap_delete, or heap_lock_tuple fail because the target - * tuple is already outdated, they fill in this struct to provide information - * to the caller about what happened. - * ctid is the target's ctid link: it is the same as the target's TID if the - * target was deleted, or the location of the replacement tuple if the target - * was updated. - * xmax is the outdating transaction's XID. If the caller wants to visit the - * replacement tuple, it must check that this matches before believing the - * replacement is really a match. - * cmax is the outdating command's CID, but only when the failure code is - * HeapTupleSelfUpdated (i.e., something in the current transaction outdated - * the tuple); otherwise cmax is zero. (We make this restriction because - * HeapTupleHeaderGetCmax doesn't work for tuples outdated in other - * transactions.) - */ -typedef struct HeapUpdateFailureData -{ - ItemPointerData ctid; - TransactionId xmax; - CommandId cmax; - bool traversed; -} HeapUpdateFailureData; - -/* Result codes for HeapTupleSatisfiesVacuum */ -typedef enum -{ - HEAPTUPLE_DEAD, /* tuple is dead and deletable */ - HEAPTUPLE_LIVE, /* tuple is live (committed, no deleter) */ - HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */ - HEAPTUPLE_INSERT_IN_PROGRESS, /* inserting xact is still in progress */ - HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */ -} HTSV_Result; - /* struct definition is private to rewriteheap.c */ typedef struct RewriteStateData *RewriteState; @@ -139,6 +87,7 @@ extern void heap_rescan(TableScanDesc scan, ScanKey key, bool set_params, bool allow_strat, bool allow_sync, bool allow_pagemode); extern void heap_rescan_set_params(TableScanDesc scan, ScanKey key, bool allow_strat, bool allow_sync, bool allow_pagemode); + extern void heap_endscan(TableScanDesc scan); extern HeapTuple heap_getnext(TableScanDesc scan, ScanDirection direction); extern struct TupleTableSlot *heap_getnextslot(TableScanDesc sscan, ScanDirection direction, diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h index 708f73f0ea..2c91378b14 100644 --- a/src/include/access/htup_details.h +++ b/src/include/access/htup_details.h @@ -816,5 +816,6 @@ extern MinimalTuple minimal_tuple_from_heap_tuple(HeapTuple htup); extern size_t varsize_any(void *p); extern HeapTuple heap_expand_tuple(HeapTuple sourceTuple, TupleDesc tupleDesc); extern MinimalTuple minimal_expand_tuple(HeapTuple sourceTuple, TupleDesc tupleDesc); +extern Datum getmissingattr(TupleDesc tupleDesc, int attnum, bool *isnull); #endif /* HTUP_DETAILS_H */ diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h index 819373031c..67d83d1e37 100644 --- a/src/include/access/nbtxlog.h +++ b/src/include/access/nbtxlog.h @@ -120,17 +120,24 @@ typedef struct xl_btree_split * single index page when *not* executed by VACUUM. * * Backup Blk 0: index page + * + * In Hot Standby, we need to scan the entire relation to verify whether any + * btree delete record conflicts with any standby query. For that, we need to + * know the relation type which is stored in xlog record. */ +#define XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP 0x0001 + typedef struct xl_btree_delete { RelFileNode hnode; /* RelFileNode of the heap the index currently * points at */ int nitems; + uint8 flags; /* See XLOG_BTREE_DELETE_* flags for details */ /* TARGET OFFSET NUMBERS FOLLOW AT THE END */ } xl_btree_delete; -#define SizeOfBtreeDelete (offsetof(xl_btree_delete, nitems) + sizeof(int)) +#define SizeOfBtreeDelete (offsetof(xl_btree_delete, flags) + sizeof(uint8)) /* * This is what we need to know about page reuse within btree. diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 51a3ad74fa..fa26ffa16b 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -105,6 +105,14 @@ typedef struct IndexFetchHeapData /* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ } IndexFetchHeapData; +typedef struct IndexFetchZHeapData +{ + IndexFetchTableData xs_base; + + Buffer xs_cbuf; /* current heap buffer in scan, if any */ + /* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ +} IndexFetchZHeapData; + /* * We use the same IndexScanDescData structure for both amgettuple-based * and amgetbitmap-based index scans. Some fields are only relevant in diff --git a/src/include/access/rewritezheap.h b/src/include/access/rewritezheap.h new file mode 100644 index 0000000000..5e9e243336 --- /dev/null +++ b/src/include/access/rewritezheap.h @@ -0,0 +1,32 @@ +/*------------------------------------------------------------------------- + * + * rewritezheap.h + * Declarations for zheap rewrite support functions + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994-5, Regents of the University of California + * + * src/include/access/rewritezheap.h + * + *------------------------------------------------------------------------- + */ +#ifndef REWRITE_ZHEAP_H +#define REWRITE_ZHEAP_H + +#include "access/zhtup.h" +#include "utils/relcache.h" + +/* struct definition is private to rewritezheap.c */ +typedef struct RewriteZheapStateData *RewriteZheapState; + +extern RewriteZheapState begin_zheap_rewrite(Relation OldHeap, Relation NewHeap, + TransactionId OldestXmin, TransactionId FreezeXid, + MultiXactId MultiXactCutoff, bool use_wal); +extern void end_zheap_rewrite(RewriteZheapState state); +extern void reform_and_rewrite_ztuple(ZHeapTuple tuple, TupleDesc oldTupDesc, + TupleDesc newTupDesc, Datum *values, bool *isnull, + RewriteZheapState rwstate); +extern void rewrite_zheap_tuple(RewriteZheapState state, ZHeapTuple oldTuple, + ZHeapTuple newTuple); + +#endif /* REWRITE_ZHEAP_H */ diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h index 0bbe9879ca..2328a1cc48 100644 --- a/src/include/access/rmgrlist.h +++ b/src/include/access/rmgrlist.h @@ -47,3 +47,8 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL) PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask) PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL) +PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL) +PG_RMGR(RM_ZHEAP_ID, "Zheap", zheap_redo, zheap_desc, zheap_identify, NULL, NULL, zheap_mask) +PG_RMGR(RM_ZHEAP2_ID, "Zheap2", zheap2_redo, zheap2_desc, zheap2_identify, NULL, NULL, zheap_mask) +PG_RMGR(RM_UNDOACTION_ID, "UndoAction", undoaction_redo, undoaction_desc, undoaction_identify, NULL, NULL, NULL) +PG_RMGR(RM_TPD_ID, "TPD", tpd_redo, tpd_desc, tpd_identify, NULL, NULL, zheap_mask) diff --git a/src/include/access/tpd.h b/src/include/access/tpd.h new file mode 100644 index 0000000000..e5de47a0e4 --- /dev/null +++ b/src/include/access/tpd.h @@ -0,0 +1,135 @@ +/*------------------------------------------------------------------------- + * + * tpd.h + * POSTGRES TPD definitions. + * + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/tpd.h + * + *------------------------------------------------------------------------- + */ +#ifndef TPD_H +#define TPD_H + +#include "postgres.h" + +#include "access/xlogutils.h" +#include "access/zheap.h" +#include "storage/block.h" +#include "utils/rel.h" + +/* TPD page information */ +typedef struct TPDPageOpaqueData +{ + BlockNumber tpd_prevblkno; + BlockNumber tpd_nextblkno; + uint32 tpd_latest_xid_epoch; + TransactionId tpd_latest_xid; +} TPDPageOpaqueData; + +typedef TPDPageOpaqueData *TPDPageOpaque; + +#define SizeofTPDPageOpaque (offsetof(TPDPageOpaqueData, tpd_latest_xid) + sizeof(TransactionId)) + +/* TPD entry information */ +#define INITIAL_TRANS_SLOTS_IN_TPD_ENTRY 8 +/* + * Number of item to trasaction slot mapping entries in addition to max + * itemid's in heap page. This is required to support newer inserts on the + * page, otherwise, we might immediately need to allocate a new bigger TPD + * entry. + */ +#define ADDITIONAL_MAP_ELEM_IN_TPD_ENTRY 8 + +typedef struct TPDEntryHeaderData +{ + BlockNumber blkno; /* Heap block number to which this TPD entry + * belongs. */ + uint16 tpe_num_map_entries; + uint16 tpe_num_slots; + uint16 tpe_flags; +} TPDEntryHeaderData; + +typedef TPDEntryHeaderData *TPDEntryHeader; + +#define SizeofTPDEntryHeader (offsetof(TPDEntryHeaderData, tpe_flags) + sizeof(uint16)) + +#define TPE_ONE_BYTE 0x0001 +#define TPE_FOUR_BYTE 0x0002 +#define TPE_DELETED 0x0004 + +#define OFFSET_MASK 0x3FFFFF + +#define TPDEntryIsDeleted(tpd_e_hdr) \ +( \ + (tpd_e_hdr.tpe_flags & TPE_DELETED) != 0 \ +) + +/* Maximum size of one TPD entry. */ +#define MaxTPDEntrySize \ + ((int) (BLCKSZ - SizeOfPageHeaderData - SizeofTPDPageOpaque - sizeof(ItemIdData))) + +/* + * MaxTPDTuplesPerPage is an upper bound on the number of tuples that can + * fit on one zheap page. + */ +#define MaxTPDTuplesPerPage \ + ((int) ((BLCKSZ - SizeOfPageHeaderData - SizeofTPDPageOpaque) / \ + (SizeofTPDEntryHeader + sizeof(ItemIdData)))) + +extern OffsetNumber TPDPageAddEntry(Page tpdpage, char *tpd_entry, Size size, + OffsetNumber offset); +extern void SetTPDLocation(Buffer heapbuffer, Buffer tpdbuffer, uint16 offset); +extern void ClearTPDLocation(Buffer heapbuf); +extern void TPDInitPage(Page page, Size pageSize); +extern bool TPDFreePage(Relation rel, Buffer buf, BufferAccessStrategy bstrategy); +extern int TPDAllocateAndReserveTransSlot(Relation relation, Buffer buf, + OffsetNumber offnum, UndoRecPtr *urec_ptr); +extern TransInfo *TPDPageGetTransactionSlots(Relation relation, Buffer heapbuf, + OffsetNumber offnum, bool keepTPDBufLock, + bool checkOffset, int *num_map_entries, + int *num_trans_slots, int *tpd_buf_id, + bool *tpd_e_pruned, bool *alloc_bigger_map); +extern int TPDPageReserveTransSlot(Relation relation, Buffer heapbuf, + OffsetNumber offset, UndoRecPtr *urec_ptr, bool *lock_reacquired); +extern int TPDPageGetSlotIfExists(Relation relation, Buffer heapbuf, OffsetNumber offnum, + uint32 epoch, TransactionId xid, UndoRecPtr *urec_ptr, + bool keepTPDBufLock, bool checkOffset); +extern int TPDPageGetTransactionSlotInfo(Buffer heapbuf, int trans_slot, + OffsetNumber offset, uint32 *epoch, TransactionId *xid, + UndoRecPtr *urec_ptr, bool NoTPDBufLock, bool keepTPDBufLock); +extern void TPDPageSetTransactionSlotInfo(Buffer heapbuf, int trans_slot_id, + uint32 epoch, TransactionId xid, UndoRecPtr urec_ptr); +extern void TPDPageSetUndo(Buffer heapbuf, int trans_slot_id, + bool set_tpd_map_slot, uint32 epoch, TransactionId xid, + UndoRecPtr urec_ptr, OffsetNumber *usedoff, int ucnt); +extern void TPDPageSetOffsetMapSlot(Buffer heapbuf, int trans_slot_id, + OffsetNumber offset); +extern void TPDPageGetOffsetMap(Buffer heapbuf, char *tpd_entry_data, + int map_size); +extern int TPDPageGetOffsetMapSize(Buffer heapbuf); +extern void TPDPageSetOffsetMap(Buffer heapbuf, char *tpd_offset_map); +extern bool TPDPageLock(Relation relation, Buffer heapbuf); +extern XLogRedoAction XLogReadTPDBuffer(XLogReaderState *record, + uint8 block_id); +extern uint8 RegisterTPDBuffer(Page heappage, uint8 block_id); +extern void TPDPageSetLSN(Page heappage, XLogRecPtr recptr); +extern void UnlockReleaseTPDBuffers(void); +extern Size PageGetTPDFreeSpace(Page page); +extern void ResetRegisteredTPDBuffers(void); + +/* interfaces exposed via prunetpd.c */ +extern int TPDPagePrune(Relation rel, Buffer tpdbuf, BufferAccessStrategy strategy, + OffsetNumber target_offnum, Size space_required, bool can_free, + bool *update_tpd_inplace, bool *tpd_e_pruned); +extern void TPDPagePruneExecute(Buffer tpdbuf, OffsetNumber *nowunused, + int nunused); +extern void TPDPageRepairFragmentation(Page page, Page tmppage, + OffsetNumber target_offnum, Size space_required); + +/* Reset globals related to TPD buffers. */ +extern void ResetTPDBuffers(void); +#endif /* TPD_H */ diff --git a/src/include/access/tpd_xlog.h b/src/include/access/tpd_xlog.h new file mode 100644 index 0000000000..f16cd63c9a --- /dev/null +++ b/src/include/access/tpd_xlog.h @@ -0,0 +1,81 @@ +/*------------------------------------------------------------------------- + * + * tpd_xlog.h + * POSTGRES tpd XLOG definitions. + * + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/tpd_xlog.h + * + *------------------------------------------------------------------------- + */ +#ifndef TPD_XLOG_H +#define TPD_XLOG_H + +#include "postgres.h" + +#include "access/xlogreader.h" +#include "lib/stringinfo.h" +#include "storage/off.h" + +/* + * WAL record definitions for tpd.c's WAL operations + */ +#define XLOG_ALLOCATE_TPD_ENTRY 0x00 +#define XLOG_TPD_CLEAN 0x10 +#define XLOG_TPD_CLEAR_LOCATION 0x20 +#define XLOG_INPLACE_UPDATE_TPD_ENTRY 0x30 +#define XLOG_TPD_FREE_PAGE 0x40 +#define XLOG_TPD_CLEAN_ALL_ENTRIES 0x50 + +#define XLOG_TPD_OPMASK 0x70 + +/* + * When we insert 1st tpd entry on new page during reserve slot, we can (and + * we do) restore entire page in redo. + */ +#define XLOG_TPD_INIT_PAGE 0x80 + +#define XLOG_OLD_TPD_BUF_EQ_LAST_TPD_BUF 0x01 + +/* This is what we need to know about tpd entry allocation */ +typedef struct xl_tpd_allocate_entry +{ + /* tpd entry related info */ + BlockNumber prevblk; + BlockNumber nextblk; + OffsetNumber offnum; /* inserted entry's offset */ + + uint8 flags; + /* TPD entry data in backup block 0 */ +} xl_tpd_allocate_entry; + +#define SizeOfTPDAllocateEntry (offsetof(xl_tpd_allocate_entry, flags) + sizeof(uint8)) + +/* This is what we need to know about tpd entry cleanup */ +#define XL_TPD_CONTAINS_OFFSET (1<<0) + +typedef struct xl_tpd_clean +{ + uint8 flags; +} xl_tpd_clean; + +#define SizeOfTPDClean (offsetof(xl_tpd_clean, flags) + sizeof(uint8)) + +/* This is what we need to know about tpd free page */ + +typedef struct xl_tpd_free_page +{ + BlockNumber prevblkno; + BlockNumber nextblkno; +} xl_tpd_free_page; + +#define SizeOfTPDFreePage (offsetof(xl_tpd_free_page, nextblkno) + sizeof(BlockNumber)) + +extern void tpd_redo(XLogReaderState *record); +extern void tpd_desc(StringInfo buf, XLogReaderState *record); +extern const char *tpd_identify(uint8 info); + +#endif /* TPD_XLOG_H */ diff --git a/src/include/access/transam.h b/src/include/access/transam.h index 83ec3f1979..7b983efba4 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -68,6 +68,10 @@ (AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \ (int32) ((id1) - (id2)) > 0) +/* Extract xid from a value comprised of epoch and xid */ +#define GetXidFromEpochXid(epochxid) \ + ((uint32) (epochxid) & 0XFFFFFFFF) + /* ---------- * Object ID (OID) zero is InvalidOid. * diff --git a/src/include/access/tupmacs.h b/src/include/access/tupmacs.h index 1c3741da65..70ce407d04 100644 --- a/src/include/access/tupmacs.h +++ b/src/include/access/tupmacs.h @@ -14,6 +14,7 @@ #ifndef TUPMACS_H #define TUPMACS_H +#include "access/genham.h" /* * check to see if the ATT'th bit of an array of 8-bit bytes is set. diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h index f99291e30d..7c0bc4f1e6 100644 --- a/src/include/access/tuptoaster.h +++ b/src/include/access/tuptoaster.h @@ -16,6 +16,8 @@ #include "access/htup_details.h" #include "storage/lockdefs.h" #include "utils/relcache.h" +#include "access/zheap.h" +#include "access/zhtup.h" /* * This enables de-toasting of index entries. Needed until VACUUM is @@ -136,6 +138,25 @@ extern HeapTuple toast_insert_or_update(Relation rel, HeapTuple newtup, HeapTuple oldtup, int options); +/* ---------- + * ztoast_insert_or_update - + * + * Called by zheap_insert() and zheap_update(). + * ---------- + */ + +extern ZHeapTuple ztoast_insert_or_update(Relation rel, + ZHeapTuple newtup, ZHeapTuple oldtup, + int options); + +/* ---------- + * ztoast_delete - + * + * Called by zheap_delete(). + * ---------- + */ +extern void ztoast_delete(Relation rel, ZHeapTuple oldtup, bool is_speculative); + /* ---------- * toast_delete - * @@ -236,4 +257,14 @@ extern Size toast_datum_size(Datum value); */ extern Oid toast_get_valid_index(Oid toastoid, LOCKMODE lock); +extern int toast_open_indexes(Relation toastrel, + LOCKMODE lock, + Relation **toastidxs, + int *num_indexes); +extern bool toastrel_valueid_exists(Relation toastrel, Oid valueid); +extern bool toastid_valueid_exists(Oid toastrelid, Oid valueid); +extern void toast_close_indexes(Relation *toastidxs, int num_indexes, + LOCKMODE lock); +extern void init_toast_snapshot(Snapshot toast_snapshot); + #endif /* TUPTOASTER_H */ diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h index 0e932daa48..d43d873442 100644 --- a/src/include/access/twophase.h +++ b/src/include/access/twophase.h @@ -18,6 +18,8 @@ #include "access/xact.h" #include "datatype/timestamp.h" #include "storage/lock.h" +#include "postmaster/undoloop.h" +#include "access/undolog.h" /* * GlobalTransactionData is defined in twophase.c; other places have no @@ -41,7 +43,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid, TimestampTz prepared_at, Oid owner, Oid databaseid); -extern void StartPrepare(GlobalTransaction gxact); +extern void StartPrepare(GlobalTransaction gxact, UndoRecPtr *, UndoRecPtr *); extern void EndPrepare(GlobalTransaction gxact); extern bool StandbyTransactionIdIsPrepared(TransactionId xid); diff --git a/src/include/access/undoaction_xlog.h b/src/include/access/undoaction_xlog.h new file mode 100644 index 0000000000..bfc64182eb --- /dev/null +++ b/src/include/access/undoaction_xlog.h @@ -0,0 +1,74 @@ +/*------------------------------------------------------------------------- + * + * undoaction_xlog.h + * undo action XLOG definitions + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/undoaction_xlog.h + * + *------------------------------------------------------------------------- + */ +#ifndef UNDOACTION_XLOG_H +#define UNDOACTION_XLOG_H + +#include "access/undolog.h" +#include "access/xlogreader.h" +#include "lib/stringinfo.h" +#include "storage/off.h" + +/* + * WAL record definitions for undoactions.c's WAL operations + */ +#define XLOG_UNDO_PAGE 0x00 +#define XLOG_UNDO_RESET_SLOT 0x10 +#define XLOG_UNDO_APPLY_PROGRESS 0x20 + +/* + * xl_undoaction_page flag values, 8 bits are available. + */ +#define XLU_PAGE_CONTAINS_TPD_SLOT (1<<0) +#define XLU_PAGE_CLEAR_VISIBILITY_MAP (1<<1) +#define XLU_CONTAINS_TPD_OFFSET_MAP (1<<2) +#define XLU_INIT_PAGE (1<<3) + +/* This is what we need to know about delete */ +typedef struct xl_undoaction_page +{ + UndoRecPtr urec_ptr; + TransactionId xid; + int trans_slot_id; /* transaction slot id */ +} xl_undoaction_page; + +#define SizeOfUndoActionPage (offsetof(xl_undoaction_page, trans_slot_id) + sizeof(int)) + +/* This is what we need to know about undo apply progress */ +typedef struct xl_undoapply_progress +{ + UndoRecPtr urec_ptr; + uint32 progress; +} xl_undoapply_progress; + +#define SizeOfUndoActionProgress (offsetof(xl_undoapply_progress, progress) + sizeof(uint32)) + +/* + * xl_undoaction_reset_slot flag values, 8 bits are available. + */ +#define XLU_RESET_CONTAINS_TPD_SLOT (1<<0) + +/* This is what we need to know about delete */ +typedef struct xl_undoaction_reset_slot +{ + UndoRecPtr urec_ptr; + int trans_slot_id; /* transaction slot id */ + uint8 flags; +} xl_undoaction_reset_slot; + +#define SizeOfUndoActionResetSlot (offsetof(xl_undoaction_reset_slot, flags) + sizeof(uint8)) + +extern void undoaction_redo(XLogReaderState *record); +extern void undoaction_desc(StringInfo buf, XLogReaderState *record); +extern const char *undoaction_identify(uint8 info); + +#endif /* UNDOACTION_XLOG_H */ diff --git a/src/include/access/undodiscard.h b/src/include/access/undodiscard.h new file mode 100644 index 0000000000..4234c0cb54 --- /dev/null +++ b/src/include/access/undodiscard.h @@ -0,0 +1,38 @@ +/*------------------------------------------------------------------------- + * + * undoinsert.h + * undo discard definitions + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/undodiscard.h + * + *------------------------------------------------------------------------- + */ +#ifndef UNDODISCARD_H +#define UNDODISCARD_H + +#include "access/undolog.h" +#include "access/xlogdefs.h" +#include "catalog/pg_class.h" +#include "storage/lwlock.h" + +/* + * Discard the undo for all the transaction whose xid is smaller than xmin + * + * Check the DiscardInfo memory array for each slot (every undo log) , process + * the undo log for all the slot which have xid smaller than xmin or invalid + * xid. Fetch the record from the undo log transaction by transaction until we + * find the xid which is not smaller than xmin. + */ +extern void UndoDiscard(TransactionId xmin, bool *hibernate); + +/* To calculate the size of the hash table size for rollabcks. */ +extern int RollbackHTSize(void); + +/* To initialize the hash table in shared memory for rollbacks. */ +extern void InitRollbackHashTable(void); + +#endif /* UNDODISCARD_H */ + diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h new file mode 100644 index 0000000000..f0b5a24099 --- /dev/null +++ b/src/include/access/undoinsert.h @@ -0,0 +1,110 @@ +/*------------------------------------------------------------------------- + * + * undoinsert.h + * entry points for inserting undo records + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/undoinsert.h + * + *------------------------------------------------------------------------- + */ +#ifndef UNDOINSERT_H +#define UNDOINSERT_H + +#include "access/undolog.h" +#include "access/undorecord.h" +#include "access/xlogdefs.h" +#include "catalog/pg_class.h" + +/* + * Typedef for callback function for UndoFetchRecord. + * + * This checks whether an undorecord satisfies the given conditions. + */ +typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec, + BlockNumber blkno, + OffsetNumber offset, + TransactionId xid); + +/* + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you + * intended to insert. Upon return, the necessary undo buffers are pinned and + * locked. + * This should be done before any critical section is established, since it + * can fail. + * + * If not in recovery, 'xid' should refer to the top transaction id because + * undo log only stores mapping for the top most transactions. + * If in recovery, 'xid' refers to the transaction id stored in WAL. + */ +extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid, + UndoPersistence, xl_undolog_meta *); + +/* + * Insert a previously-prepared undo record. This will write the actual undo + * record into the buffers already pinned and locked in PreparedUndoInsert, + * and mark them dirty. For persistent undo, this step should be performed + * after entering a critical section; it should never fail. + */ +extern void InsertPreparedUndo(void); + +/* + * Unlock and release undo buffers. This step performed after exiting any + * critical section where we have prepared the undo record. + */ +extern void UnlockReleaseUndoBuffers(void); + +/* + * Forget about any previously-prepared undo record. Error recovery calls + * this, but it can also be used by other code that changes its mind about + * inserting undo after having prepared a record for insertion. + */ +extern void CancelPreparedUndo(void); + +/* + * Fetch the next undo record for given blkno and offset. Start the search + * from urp. Caller need to call UndoRecordRelease to release the resources + * allocated by this function. + */ +extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, + BlockNumber blkno, + OffsetNumber offset, + TransactionId xid, + UndoRecPtr *urec_ptr_out, + SatisfyUndoRecordCallback callback); + +/* + * Release the resources allocated by UndoFetchRecord. + */ +extern void UndoRecordRelease(UnpackedUndoRecord *urec); + +/* + * Set the value of PrevUndoLen. + */ +extern void UndoRecordSetPrevUndoLen(uint16 len); + +/* + * Call UndoSetPrepareSize to set the value of how many maximum prepared can + * be done before inserting the prepared undo. If size is > MAX_PREPARED_UNDO + * then it will allocate extra memory to hold the extra prepared undo. + */ +extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords, + TransactionId xid, UndoPersistence upersistence, + xl_undolog_meta *undometa); + +/* + * return the previous undo record pointer. + */ +extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen); + +extern void UndoRecordOnUndoLogChange(UndoPersistence persistence); + +extern void PrepareUpdateUndoActionProgress(UndoRecPtr urecptr, int progress); +extern void UndoRecordUpdateTransInfo(void); + +/* Reset globals related to undo buffers */ +extern void ResetUndoBuffers(void); + +#endif /* UNDOINSERT_H */ diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h new file mode 100644 index 0000000000..e36faaa75d --- /dev/null +++ b/src/include/access/undolog.h @@ -0,0 +1,332 @@ +/*------------------------------------------------------------------------- + * + * undolog.h + * + * PostgreSQL undo log manager. This module is responsible for lifecycle + * management of undo logs and backing files, associating undo logs with + * backends, allocating and managing space within undo logs. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/undolog.h + * + *------------------------------------------------------------------------- + */ +#ifndef UNDOLOG_H +#define UNDOLOG_H + +#include "access/xlogreader.h" +#include "catalog/pg_class.h" +#include "common/relpath.h" +#include "storage/bufpage.h" + +#ifndef FRONTEND +#include "storage/lwlock.h" +#endif + +/* The type used to identify an undo log and position within it. */ +typedef uint64 UndoRecPtr; + +/* The type used for undo record lengths. */ +typedef uint16 UndoRecordSize; + +/* Undo log statuses. */ +typedef enum +{ + UNDO_LOG_STATUS_UNUSED = 0, + UNDO_LOG_STATUS_ACTIVE, + UNDO_LOG_STATUS_EXHAUSTED, + UNDO_LOG_STATUS_DISCARDED +} UndoLogStatus; + +/* + * Undo log persistence levels. These have a one-to-one correspondence with + * relpersistence values, but are small integers so that we can use them as an + * index into the "logs" and "lognos" arrays. + */ +typedef enum +{ + UNDO_PERMANENT = 0, + UNDO_UNLOGGED = 1, + UNDO_TEMP = 2 +} UndoPersistence; + +#define UndoPersistenceLevels 3 + +/* + * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence + * enumerator. + */ +#define UndoPersistenceForRelPersistence(rp) \ + ((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT : \ + (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP) + +/* + * Convert from UndoPersistence to a relpersistence value. + */ +#define RelPersistenceForUndoPersistence(up) \ + ((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT : \ + (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED : \ + RELPERSISTENCE_TEMP) + +/* + * Get the appropriate UndoPersistence value from a Relation. + */ +#define UndoPersistenceForRelation(rel) \ + (UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence)) + +/* Type for offsets within undo logs */ +typedef uint64 UndoLogOffset; + +/* printf-family format string for UndoRecPtr. */ +#define UndoRecPtrFormat "%016" INT64_MODIFIER "X" + +/* printf-family format string for UndoLogOffset. */ +#define UndoLogOffsetFormat UINT64_FORMAT + +/* Number of blocks of BLCKSZ in an undo log segment file. 128 = 1MB. */ +#define UNDOSEG_SIZE 128 + +/* Size of an undo log segment file in bytes. */ +#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE) + +/* The width of an undo log number in bits. 24 allows for 16.7m logs. */ +#define UndoLogNumberBits 24 + +/* The width of an undo log offset in bits. 40 allows for 1TB per log.*/ +#define UndoLogOffsetBits (64 - UndoLogNumberBits) + +/* Special value for undo record pointer which indicates that it is invalid. */ +#define InvalidUndoRecPtr ((UndoRecPtr) 0) + +/* End-of-list value when building linked lists of undo logs. */ +#define InvalidUndoLogNumber -1 + +/* + * The maximum amount of data that can be stored in an undo log. Can be set + * artificially low to test full log behavior. + */ +#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits) + +/* Type for numbering undo logs. */ +typedef int UndoLogNumber; + +/* Extract the undo log number from an UndoRecPtr. */ +#define UndoRecPtrGetLogNo(urp) \ + ((urp) >> UndoLogOffsetBits) + +/* Extract the offset from an UndoRecPtr. */ +#define UndoRecPtrGetOffset(urp) \ + ((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1)) + +/* Make an UndoRecPtr from an log number and offset. */ +#define MakeUndoRecPtr(logno, offset) \ + (((uint64) (logno) << UndoLogOffsetBits) | (offset)) + +/* The number of unusable bytes in the header of each block. */ +#define UndoLogBlockHeaderSize SizeOfPageHeaderData + +/* The number of usable bytes we can store per block. */ +#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize) + +/* The pseudo-database OID used for undo logs. */ +#define UndoLogDatabaseOid 9 + +/* Length of undo checkpoint filename */ +#define UNDO_CHECKPOINT_FILENAME_LENGTH 16 + +/* + * UndoRecPtrIsValid + * True iff undoRecPtr is valid. + */ +#define UndoRecPtrIsValid(undoRecPtr) \ + ((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr)) + +/* Extract the relnode for an undo log. */ +#define UndoRecPtrGetRelNode(urp) \ + UndoRecPtrGetLogNo(urp) + +/* The only valid fork number for undo log buffers. */ +#define UndoLogForkNum MAIN_FORKNUM + +/* Compute the block number that holds a given UndoRecPtr. */ +#define UndoRecPtrGetBlockNum(urp) \ + (UndoRecPtrGetOffset(urp) / BLCKSZ) + +/* Compute the offset of a given UndoRecPtr in the page that holds it. */ +#define UndoRecPtrGetPageOffset(urp) \ + (UndoRecPtrGetOffset(urp) % BLCKSZ) + +/* Compare two undo checkpoint files to find the oldest file. */ +#define UndoCheckPointFilenamePrecedes(file1, file2) \ + (strcmp(file1, file2) < 0) + +/* What is the offset of the i'th non-header byte? */ +#define UndoLogOffsetFromUsableByteNo(i) \ + (((i) / UndoLogUsableBytesPerPage) * BLCKSZ + \ + UndoLogBlockHeaderSize + \ + ((i) % UndoLogUsableBytesPerPage)) + +/* How many non-header bytes are there before a given offset? */ +#define UndoLogOffsetToUsableByteNo(offset) \ + (((offset) % BLCKSZ - UndoLogBlockHeaderSize) + \ + ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage) + +/* Add 'n' usable bytes to offset stepping over headers to find new offset. */ +#define UndoLogOffsetPlusUsableBytes(offset, n) \ + UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n)) + +/* Find out which tablespace the given undo log location is backed by. */ +extern Oid UndoRecPtrGetTablespace(UndoRecPtr insertion_point); + +/* Populate a RelFileNode from an UndoRecPtr. */ +#define UndoRecPtrAssignRelFileNode(rfn, urp) \ + do \ + { \ + (rfn).spcNode = UndoRecPtrGetTablespace(urp); \ + (rfn).dbNode = UndoLogDatabaseOid; \ + (rfn).relNode = UndoRecPtrGetRelNode(urp); \ + } while (false); + +/* + * Control metadata for an active undo log. Lives in shared memory inside an + * UndoLogControl object, but also written to disk during checkpoints. + */ +typedef struct UndoLogMetaData +{ + UndoLogStatus status; + Oid tablespace; + UndoPersistence persistence; /* permanent, unlogged, temp? */ + UndoLogOffset insert; /* next insertion point (head) */ + UndoLogOffset end; /* one past end of highest segment */ + UndoLogOffset discard; /* oldest data needed (tail) */ + UndoLogOffset last_xact_start; /* last transactions start undo offset */ + + /* + * If the same transaction is split over two undo logs then it stored the + * previous log number, see file header comments of undorecord.c for its + * usage. + * + * Fixme: See if we can find other way to handle it instead of keeping + * previous log number. + */ + UndoLogNumber prevlogno; /* Previous undo log number */ + bool is_first_rec; + + /* + * last undo record's length. We need to save this in undo meta and WAL + * log so that the value can be preserved across restart so that the first + * undo record after the restart can get this value properly. This will be + * used going to the previous record of the transaction during rollback. + * In case the transaction have done some operation before checkpoint and + * remaining after checkpoint in such case if we can't get the previous + * record prevlen which which before checkpoint we can not properly + * rollback. And, undo worker is also fetch this value when rolling back + * the last transaction in the undo log for locating the last undo record + * of the transaction. + */ + uint16 prevlen; +} UndoLogMetaData; + +/* Record the undo log number used for a transaction. */ +typedef struct xl_undolog_meta +{ + UndoLogMetaData meta; + UndoLogNumber logno; + TransactionId xid; +} xl_undolog_meta; + +#ifndef FRONTEND + +/* + * The in-memory control object for an undo log. As well as the current + * meta-data for the undo log, we also lazily maintain a snapshot of the + * meta-data as it was at the redo point of a checkpoint that is in progress. + * + * Conceptually the set of UndoLogControl objects is arranged into a very + * large array for access by log number, but because we typically need only a + * smallish number of adjacent undo logs to be active at a time we arrange + * them into smaller fragments called 'banks'. + */ +typedef struct UndoLogControl +{ + UndoLogNumber logno; + UndoLogMetaData meta; /* current meta-data */ + XLogRecPtr lsn; + bool need_attach_wal_record; /* need_attach_wal_record */ + pid_t pid; /* InvalidPid for unattached */ + LWLock mutex; /* protects the above */ + TransactionId xid; + /* State used by undo workers. */ + TransactionId oldest_xid; /* cache of oldest transaction's xid */ + uint32 oldest_xidepoch; + UndoRecPtr oldest_data; + LWLock discard_lock; /* prevents discarding while reading */ + LWLock rewind_lock; /* prevent rewinding while reading */ + + UndoLogNumber next_free; /* protected by UndoLogLock */ +} UndoLogControl; + +#endif + +/* Space management. */ +extern UndoRecPtr UndoLogAllocate(size_t size, + UndoPersistence level); +extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid, + size_t size, + UndoPersistence persistence); +extern void UndoLogAdvance(UndoRecPtr insertion_point, + size_t size, + UndoPersistence persistence); +extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid); +extern bool UndoLogIsDiscarded(UndoRecPtr point); + +/* Initialization interfaces. */ +extern void StartupUndoLogs(XLogRecPtr checkPointRedo); +extern void UndoLogShmemInit(void); +extern Size UndoLogShmemSize(void); +extern void UndoLogInit(void); +extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, + char *path); +extern void ResetUndoLogs(UndoPersistence persistence); + +/* Interface use by tablespace.c. */ +extern bool DropUndoLogsInTablespace(Oid tablespace); + +/* GUC interfaces. */ +extern void assign_undo_tablespaces(const char *newval, void *extra); + +/* Checkpointing interfaces. */ +extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo, + XLogRecPtr priorCheckPointRedo); + +#ifndef FRONTEND + +extern UndoLogControl *UndoLogGet(UndoLogNumber logno); +extern UndoLogControl *UndoLogNext(UndoLogControl *log); +extern bool AmAttachedToUndoLog(UndoLogControl *log); + +#endif + +extern void UndoLogSetLastXactStartPoint(UndoRecPtr point); +extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno); +extern UndoRecPtr UndoLogGetCurrentLocation(UndoPersistence persistence); +extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogNumber logno); +extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno, + TransactionId xid); +extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen); +extern bool IsTransactionFirstRec(TransactionId xid); +extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen); +extern uint16 UndoLogGetPrevLen(UndoLogNumber logno); +extern bool NeedUndoMetaLog(XLogRecPtr redo_point); +extern void UndoLogSetLSN(XLogRecPtr lsn); +extern void LogUndoMetaData(xl_undolog_meta *xlrec); +void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno); +/* Redo interface. */ +extern void undolog_redo(XLogReaderState *record); +/* Discard the undo logs for temp tables */ +extern void TempUndoDiscard(UndoLogNumber); +extern Oid UndoLogStateGetDatabaseId(void); + +#endif diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h new file mode 100644 index 0000000000..c6354bc7b5 --- /dev/null +++ b/src/include/access/undolog_xlog.h @@ -0,0 +1,71 @@ +/*------------------------------------------------------------------------- + * + * undolog_xlog.h + * undo log access XLOG definitions. + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/undolog_xlog.h + * + *------------------------------------------------------------------------