From: Tom Lane Date: Sat, 25 Jun 2005 22:47:49 +0000 (+0000) Subject: Force a checkpoint before committing a CREATE DATABASE command. This X-Git-Url: http://git.postgresql.org/gitweb/static/gitweb.js?a=commitdiff_plain;h=7352a30e88361e2b3b83f7bc8032ab8cc27f066b;p=users%2Fbernd%2Fpostgres.git Force a checkpoint before committing a CREATE DATABASE command. This should fix the recent reports of "index is not a btree" failures, as well as preventing a more obscure race condition involving changes to a template database just after copying it with CREATE DATABASE. --- diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index fcd7fe2393..c1365644b4 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -177,7 +177,7 @@ pg_dumpall > outfile The resulting dump can be restored with psql: -psql template1 < infile +psql -f infile template1 (Actually, you can specify any existing database name to start from, but if you are reloading in an empty cluster then template1 @@ -256,7 +256,7 @@ cat filename* | psql -pg_dump -Fc dbname > filename +pg_dump -Fc dbname > filename A custom-format dump is not a script for psql, but @@ -364,15 +364,29 @@ tar -cf backup.tar /usr/local/pgsql/data - If your database is spread across multiple volumes (for example, - data files and WAL log on different disks) there may not be any way - to obtain exactly-simultaneous frozen snapshots of all the volumes. + If your database is spread across multiple file systems, there may not + be any way to obtain exactly-simultaneous frozen snapshots of all + the volumes. For example, if your data files and WAL log are on different + disks, or if tablespaces are on different file systems, it might + not be possible to use snapshot backup because the snapshots must be + simultaneous. Read your file system documentation very carefully before trusting to the consistent-snapshot technique in such situations. The safest approach is to shut down the database server for long enough to establish all the frozen snapshots. + + Another option is to use rsync to perform a file + system backup. This is done by first running rsync + while the database server is running, then shutting down the database + server just long enough to do a second rsync. The + second rsync will be much quicker than the first, + because it has relatively little data to transfer, and the end result + will be consistent because the server was down. This method + allows a file system backup to be performed with minimal downtime. + + Note that a file system backup will not necessarily be smaller than an SQL dump. On the contrary, it will most likely be @@ -674,7 +688,13 @@ SELECT pg_start_backup('label'); SELECT pg_stop_backup(); - If this returns successfully, you're done. + This should return successfully. + + + + + Once the WAL segment files used during the backup are archived as part + of normal database activity, you are done. @@ -710,23 +730,34 @@ SELECT pg_stop_backup(); To make use of this backup, you will need to keep around all the WAL - segment files generated at or after the starting time of the backup. + segment files generated during and after the file system backup. To aid you in doing this, the pg_stop_backup function - creates a backup history file that is immediately stored - into the WAL archive area. This file is named after the first WAL - segment file that you need to have to make use of the backup. For - example, if the starting WAL file is 0000000100001234000055CD - the backup history file will be named something like - 0000000100001234000055CD.007C9330.backup. (The second part of - this file name stands for an exact position within the WAL file, and can - ordinarily be ignored.) Once you have safely archived the backup dump - file, you can delete all archived WAL segments with names numerically - preceding this one. The backup history file is just a small text file. - It contains the label string you gave to pg_start_backup, as - well as the starting and ending times of the backup. If you used the - label to identify where the associated dump file is kept, then the - archived history file is enough to tell you which dump file to restore, - should you need to do so. + creates a backup history file that is immediately + stored into the WAL archive area. This file is named after the first + WAL segment file that you need to have to make use of the backup. + For example, if the starting WAL file is + 0000000100001234000055CD the backup history file will be + named something like + 0000000100001234000055CD.007C9330.backup. (The second + number in the file name stands for an exact position within the WAL + file, and can ordinarily be ignored.) Once you have safely archived + the file system backup and the WAL segment files used during the + backup (as specified in the backup history file), all archived WAL + segments with names numerically less are no longer needed to recover + the file system backup and may be deleted. However, you should + consider keeping several backup sets to be absolutely certain that + you are can recover your data. Keep in mind that only completed WAL + segment files are archived, so there will be delay between running + pg_stop_backup and the archiving of all WAL segment + files needed to make the file system backup consistent. + + + The backup history file is just a small text file. It contains the + label string you gave to pg_start_backup, as well as + the starting and ending times of the backup. If you used the label + to identify where the associated dump file is kept, then the + archived history file is enough to tell you which dump file to + restore, should you need to do so. @@ -1111,6 +1142,31 @@ restore_command = 'copy /mnt/server/archivedir/%f "%p"' # Windows such index after completing a recovery operation. + + + + If a CREATE DATABASE command is executed while a base + backup is being taken, and then the template database that the + CREATE DATABASE copied is modified while the base backup + is still in progress, it is possible that recovery will cause those + modifications to be propagated into the created database as well. + This is of course undesirable. To avoid this risk, it is best not to + modify any template databases while taking a base backup. + + + + + + CREATE TABLESPACE commands are WAL-logged with the literal + absolute path, and will therefore be replayed as tablespace creations + with the same absolute path. This might be undesirable if the log is + being replayed on a different machine. It can be dangerous even if + the log is being replayed on the same machine, but into a new data + directory: the replay will still overwrite the contents of the original + tablespace. To avoid potential gotchas of this sort, the best practice + is to take a new base backup after creating or dropping tablespaces. + + @@ -1121,7 +1177,9 @@ restore_command = 'copy /mnt/server/archivedir/%f "%p"' # Windows since we may need to fix partially-written disk pages. It is not necessary to store so many page copies for PITR operations, however. An area for future development is to compress archived WAL data by - removing unnecessary page copies. + removing unnecessary page copies. In the meantime, administrators + may wish to reduce the number of page snapshots included in WAL by + increasing the checkpoint interval parameters as much as feasible. @@ -1203,14 +1261,14 @@ pg_dumpall -p 5432 | psql -d template1 -p 6543 version, start the new server, restore the data. For example: -pg_dumpall > backup +pg_dumpall > backup pg_ctl stop mv /usr/local/pgsql /usr/local/pgsql.old cd ~/postgresql-&version; gmake install initdb -D /usr/local/pgsql/data postmaster -D /usr/local/pgsql/data -psql template1 < backup +psql -f backup template1 See about ways to start and stop the diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index cd174bb757..a25f474c99 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -418,23 +418,17 @@ createdb(const CreatedbStmt *stmt) /* Record the filesystem change in XLOG */ { xl_dbase_create_rec xlrec; - XLogRecData rdata[3]; + XLogRecData rdata[1]; xlrec.db_id = dboid; + xlrec.tablespace_id = dsttablespace; + xlrec.src_db_id = src_dboid; + xlrec.src_tablespace_id = srctablespace; + rdata[0].buffer = InvalidBuffer; rdata[0].data = (char *) &xlrec; - rdata[0].len = offsetof(xl_dbase_create_rec, src_path); - rdata[0].next = &(rdata[1]); - - rdata[1].buffer = InvalidBuffer; - rdata[1].data = (char *) srcpath; - rdata[1].len = strlen(srcpath) + 1; - rdata[1].next = &(rdata[2]); - - rdata[2].buffer = InvalidBuffer; - rdata[2].data = (char *) dstpath; - rdata[2].len = strlen(dstpath) + 1; - rdata[2].next = NULL; + rdata[0].len = sizeof(xl_dbase_create_rec); + rdata[0].next = NULL; (void) XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE, rdata); } @@ -506,6 +500,36 @@ createdb(const CreatedbStmt *stmt) /* Close pg_database, but keep exclusive lock till commit */ heap_close(pg_database_rel, NoLock); + + /* + * We force a checkpoint before committing. This effectively means + * that committed XLOG_DBASE_CREATE operations will never need to be + * replayed (at least not in ordinary crash recovery; we still have + * to make the XLOG entry for the benefit of PITR operations). + * This avoids two nasty scenarios: + * + * #1: When PITR is off, we don't XLOG the contents of newly created + * indexes; therefore the drop-and-recreate-whole-directory behavior + * of DBASE_CREATE replay would lose such indexes. + * + * #2: Since we have to recopy the source database during DBASE_CREATE + * replay, we run the risk of copying changes in it that were committed + * after the original CREATE DATABASE command but before the system + * crash that led to the replay. This is at least unexpected and at + * worst could lead to inconsistencies, eg duplicate table names. + * + * (Both of these were real bugs in releases 8.0 through 8.0.3.) + * + * In PITR replay, the first of these isn't an issue, and the second + * is only a risk if the CREATE DATABASE and subsequent template + * database change both occur while a base backup is being taken. + * There doesn't seem to be much we can do about that except document + * it as a limitation. + * + * Perhaps if we ever implement CREATE DATABASE in a less cheesy + * way, we can avoid this. + */ + RequestCheckpoint(true); } @@ -717,8 +741,8 @@ RenameDatabase(const char *oldname, const char *newname) aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_DATABASE, oldname); - /* must have createdb */ - if (!have_createdb_privilege()) + /* must have createdb rights */ + if (!superuser() && !have_createdb_privilege()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("permission denied to rename database"))); @@ -882,8 +906,7 @@ AlterDatabaseOwner(const char *dbname, AclId newOwnerSysId) bool isNull; HeapTuple newtuple; - /* changing owner's database for someone else: must be superuser */ - /* note that the someone else need not have any permissions */ + /* must be superuser to change ownership */ if (!superuser()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), @@ -1004,24 +1027,22 @@ get_db_info(const char *name, Oid *dbIdP, int4 *ownerIdP, return gottuple; } +/* Check if current user has createdb privileges */ static bool have_createdb_privilege(void) { + bool result = false; HeapTuple utup; - bool retval; utup = SearchSysCache(SHADOWSYSID, Int32GetDatum(GetUserId()), 0, 0, 0); - - if (!HeapTupleIsValid(utup)) - retval = false; - else - retval = ((Form_pg_shadow) GETSTRUCT(utup))->usecreatedb; - - ReleaseSysCache(utup); - - return retval; + if (HeapTupleIsValid(utup)) + { + result = ((Form_pg_shadow) GETSTRUCT(utup))->usecreatedb; + ReleaseSysCache(utup); + } + return result; } /* @@ -1066,18 +1087,15 @@ remove_dbtablespaces(Oid db_id) /* Record the filesystem change in XLOG */ { xl_dbase_drop_rec xlrec; - XLogRecData rdata[2]; + XLogRecData rdata[1]; xlrec.db_id = db_id; + xlrec.tablespace_id = dsttablespace; + rdata[0].buffer = InvalidBuffer; rdata[0].data = (char *) &xlrec; - rdata[0].len = offsetof(xl_dbase_drop_rec, dir_path); - rdata[0].next = &(rdata[1]); - - rdata[1].buffer = InvalidBuffer; - rdata[1].data = (char *) dstpath; - rdata[1].len = strlen(dstpath) + 1; - rdata[1].next = NULL; + rdata[0].len = sizeof(xl_dbase_drop_rec); + rdata[0].next = NULL; (void) XLogInsert(RM_DBASE_ID, XLOG_DBASE_DROP, rdata); } @@ -1180,6 +1198,86 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record) if (info == XLOG_DBASE_CREATE) { xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record); + char *src_path; + char *dst_path; + struct stat st; + +#ifndef WIN32 + char buf[2 * MAXPGPATH + 100]; +#endif + + src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id); + dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); + + /* + * Our theory for replaying a CREATE is to forcibly drop the + * target subdirectory if present, then re-copy the source data. + * This may be more work than needed, but it is simple to + * implement. + */ + if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode)) + { + if (!rmtree(dst_path, true)) + ereport(WARNING, + (errmsg("could not remove database directory \"%s\"", + dst_path))); + } + + /* + * Force dirty buffers out to disk, to ensure source database is + * up-to-date for the copy. (We really only need to flush buffers for + * the source database, but bufmgr.c provides no API for that.) + */ + BufferSync(-1, -1); + +#ifndef WIN32 + + /* + * Copy this subdirectory to the new location + * + * XXX use of cp really makes this code pretty grotty, particularly + * with respect to lack of ability to report errors well. Someday + * rewrite to do it for ourselves. + */ + + /* We might need to use cp -R one day for portability */ + snprintf(buf, sizeof(buf), "cp -r '%s' '%s'", + src_path, dst_path); + if (system(buf) != 0) + ereport(ERROR, + (errmsg("could not initialize database directory"), + errdetail("Failing system command was: %s", buf), + errhint("Look in the postmaster's stderr log for more information."))); +#else /* WIN32 */ + if (copydir(src_path, dst_path) != 0) + { + /* copydir should already have given details of its troubles */ + ereport(ERROR, + (errmsg("could not initialize database directory"))); + } +#endif /* WIN32 */ + } + else if (info == XLOG_DBASE_DROP) + { + xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record); + char *dst_path; + + dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); + + /* + * Drop pages for this database that are in the shared buffer + * cache + */ + DropBuffers(xlrec->db_id); + + if (!rmtree(dst_path, true)) + ereport(WARNING, + (errmsg("could not remove database directory \"%s\"", + dst_path))); + } + else if (info == XLOG_DBASE_CREATE_OLD) + { + xl_dbase_create_rec_old *xlrec = (xl_dbase_create_rec_old *) XLogRecGetData(record); char *dst_path = xlrec->src_path + strlen(xlrec->src_path) + 1; struct stat st; @@ -1235,9 +1333,9 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record) } #endif /* WIN32 */ } - else if (info == XLOG_DBASE_DROP) + else if (info == XLOG_DBASE_DROP_OLD) { - xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record); + xl_dbase_drop_rec_old *xlrec = (xl_dbase_drop_rec_old *) XLogRecGetData(record); /* * Drop pages for this database that are in the shared buffer @@ -1268,14 +1366,29 @@ dbase_desc(char *buf, uint8 xl_info, char *rec) if (info == XLOG_DBASE_CREATE) { xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec; + + sprintf(buf + strlen(buf), "create db: copy dir %u/%u to %u/%u", + xlrec->src_db_id, xlrec->src_tablespace_id, + xlrec->db_id, xlrec->tablespace_id); + } + else if (info == XLOG_DBASE_DROP) + { + xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec; + + sprintf(buf + strlen(buf), "drop db: dir %u/%u", + xlrec->db_id, xlrec->tablespace_id); + } + else if (info == XLOG_DBASE_CREATE_OLD) + { + xl_dbase_create_rec_old *xlrec = (xl_dbase_create_rec_old *) rec; char *dst_path = xlrec->src_path + strlen(xlrec->src_path) + 1; sprintf(buf + strlen(buf), "create db: %u copy \"%s\" to \"%s\"", xlrec->db_id, xlrec->src_path, dst_path); } - else if (info == XLOG_DBASE_DROP) + else if (info == XLOG_DBASE_DROP_OLD) { - xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec; + xl_dbase_drop_rec_old *xlrec = (xl_dbase_drop_rec_old *) rec; sprintf(buf + strlen(buf), "drop db: %u directory: \"%s\"", xlrec->db_id, xlrec->dir_path);