PW User Guide
PW User Guide
0)
Contents
1 Introduction 1
1.1 What can PWscf do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 People . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Terms of use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Compilation 4
3 Using PWscf 6
3.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Data files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Electronic structure calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Optimization and dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Direct interface with CASINO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Performances 12
4.1 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Memory requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 File space requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Parallelization issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Understanding the time report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5.1 Serial execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5.2 Parallel execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Troubleshooting 17
5.1 Compilation problems with PLUMED . . . . . . . . . . . . . . . . . . . . . . . . . 24
1 Introduction
This guide covers the usage of the PWscf (Plane-Wave Self-Consistent Field) package, a core
component of the Quantum ESPRESSO distribution. Further documentation, beyond what
is provided in this guide, can be found in the directory PW/Doc/, containing a copy of this guide.
1
This guide assumes that you know the physics that PWscf describes and the methods it
implements. It also assumes that you have already installed, or know how to install, Quantum
ESPRESSO. If not, please read the general Users Guide for Quantum ESPRESSO, found
in directory Doc/ two levels above the one containing this guide; or consult the web site:
http://www.quantum-espresso.org.
People who want to modify or contribute to PWscf should read the Developer Manual:
Doc/developer man.pdf.
macroscopic polarization and finite electric fields via the modern theory of polarization
(Berry Phases).
All of the above works for both insulators and metals, in any crystal structure, for many
exchange-correlation (XC) functionals (including spin polarization, DFT+U, nonlocal VdW
functional, hybrid functionals), for norm-conserving (Hamann-Schluter-Chiang) PPs (NCPPs)
in separable form or Ultrasoft (Vanderbilt) PPs (USPPs) or Projector Augmented Waves (PAW)
method. Noncollinear magnetism and spin-orbit interactions are also implemented. An imple-
mentation of finite electric fields with a sawtooth potential in a supercell is also available. Please
note that NEB calculations are no longer performed by pw.x, but are instead carried out by
neb.x (see main user guide), a dedicated code for path optimization which can use PWscf as
computational engine.
1.2 People
The PWscf package (which included PHonon and PostProc in earlier releases) was originally
developed by Stefano Baroni, Stefano de Gironcoli, Andrea Dal Corso (SISSA), Paolo Giannozzi
(Univ. Udine), and many others. We quote in particular:
2
Ralph Gebauer (ICTP, Trieste) and Adriano Mosca Conte (SISSA, Trieste) for non-
collinear magnetism;
Lorenzo Paulatto (Univ.Paris VI) for PAW implementation, built upon previous work by
Guido Fratesi (Univ.Milano Bicocca) and Riccardo Mazzarello (ETHZ-USI Lugano);
Dario Alfe, Mike Towler (University College London), Norbert Nemec (U.Cambridge) for
interface with CASINO;
Gabriele Sclauzero (IRRMA Lausanne) for DFT+U with on-site occupations obtained
from pseudopotential projectors;
Axel Kohlmeyer contributed libraries and utilities to call PWscffrom external codes (see
the COUPLE sub-directory), made the parallelization more modular and usable by external
codes.
Alberto Otero de la Roza (Merced Univ.) contributed XDM (exchange-hole dipole mo-
ment) model of dispersions, PW86 (unrevised) and B86B functionals.
Minoru Otani (AIST), Yoshio Miura (Tohoku U.), Nicephore Bonet (MIT), Nicola Marzari
(Univ. Oxford), Brandon Wood (LLNL), Tadashi Ogitsu (LLNL), contributed Effective
Screening Method (PRB 73, 115407 [2006])
3
Brian Kolb and Timo Thonhauser (Wake Forest University) implemented the vdW-DF
and vdW-DF2 functionals, with support from Riccardo Sabatini and Stefano de Gironcoli
(SISSA and DEMOCRITOS);
The initial BlueGene porting was done by Costas Bekas and Alessandro Curioni (IBM
Zurich).
This guide was mostly written by Paolo Giannozzi. Mike Towler wrote the PWscf to CASINO
subsection.
References for all exchange-correlation functionals can be found in the header of file
Modules/funct.f90.
Note the form Quantum ESPRESSO for textual citations of the code. Pseudopotentials
should be cited as (for instance)
2 Compilation
PWscf is included in the core Quantum ESPRESSO distribution. Instruction on how to in-
stall it can be found in the general documentation (Users Guide) for Quantum ESPRESSO.
Typing make pw from the main Quantum ESPRESSO directory or make from the PW/
subdirectory produces the pw.x executable in PW/src and a link to the bin/ directory. In
addition, several utility programs, and related links in bin/, are produced in PW/tools:
4
PW/tools/dist.x reads input data for PWscf, calculates distances and angles between
atoms in a cell, taking into account periodicity
PW/tools/pwi2xsf.sh, pwo2xsf.sh process respectively input and output files (not data
files!) for pw.xand produce an XSF-formatted file suitable for plotting with XCrySDen:
http://www.xcrysden.org/, powerful crystalline and molecular structure visualization
program. BEWARE: the pwi2xsf.sh shell script requires the pwi2xsf.x executables to
be located somewhere in your PATH.
PW/tools/bs.awk, PW/tools/mv.awk are scripts that process the output of pw.x (not
data files!). Usage:
The files so produced are suitable for use with xbs, a very simple X-windows utility to
display molecules, available at:
http://www.ccl.net/cca/software/X-WINDOW/xbsa/README.shtml
PW/tools/kvecs FS.x, PW/tools/bands FS.x: utilities for Fermi Surface plotting using
XCrySDen (contributed by the late Prof. Eyvaz)
Documentation for the auxiliary codes can be found in the codes themselves, e.g. in the header
of files.
5
3 Using PWscf
Input files for pw.x may be either written by hand or produced via the PWgui graphical interface
by Anton Kokalj, included in the Quantum ESPRESSO distribution. See PWgui-x.y.z/INSTALL
(where x.y.z is the version number) for more info on PWgui, or GUI/README if you are using
SVN sources.
You may take the tests and examples distributed with Quantum ESPRESSO as templates
for writing your own input files. In the following, whenever we mention Example N, we refer
to those. Input files are those in the results/ subdirectories, with names ending with .in
(they will appear after you have run the examples).
ATOMIC SPECIES
ATOMIC POSITIONS
K POINTS
CELL PARAMETERS (optional)
OCCUPATIONS (optional)
6
The keywords may be followed on the same line by an option. Unknown fields are ignored. See
the files mentioned above for details on the available cards.
Comments lines in cards can be introduced by either a ! or a # character in the first
position of a line.
Note about k points: The k-point grid can be either automatically generated or manually
provided as a list of k-points and a weight in the Irreducible Brillouin Zone only of the Bravais
lattice of the crystal. The code will generate (unless instructed not to do so: see variable nosym)
all required k-points and weights if the symmetry of the system is lower than the symmetry of
the Bravais lattice. The automatic generation of k-points follows the convention of Monkhorst
and Pack.
Band structure calculation First perform a SCF calculation as above; then do a non-SCF
calculation with the desired k-point grid and number nbnd of bands. Use calculation=bands
if you are interested in calculating only the Kohn-Sham states for the given set of k-points (e.g.
along symmetry lines: see for instance http://www.cryst.ehu.es/cryst/get kvec.html).
Specify instead calculation=nscf if you are interested in further processing of the results of
non-SCF calculations (for instance, in DOS calculations). In the latter case, you should specify
a uniform grid of points. For DOS calculations you should choose occupations=tetrahedra,
together with an automatically generated uniform k-point grid (card K POINTS with option
automatic). Specify nosym=.true. to avoid generation of additional k-points in low symme-
try cases. Variables prefix and outdir, which determine the names of input or output files,
should be the same in the two runs. See Examples 01, 06, 07,
NOTA BENE: Since v.4.1, both atomic positions and the scf potential are read from the
data file so that consistency is guaranteed.
7
noncolin
lspinorb
starting magnetization (one for each type of atoms)
To make a spin-orbit calculation noncolin must be true. If starting magnetization is set
to zero (or not given) the code makes a spin-orbit calculation without spin magnetization (it
assumes that time reversal symmetry holds and it does not calculate the magnetization). The
states are still two-component spinors but the total magnetization is zero.
If starting magnetization is different from zero, it makes a noncollinear spin polarized
calculation with spin-orbit interaction. The final spin magnetization might be zero or different
from zero depending on the system.
Furthermore to make a spin-orbit calculation you must use fully relativistic pseudopoten-
tials at least for the atoms in which you think that spin-orbit interaction is large. If all the
pseudopotentials are scalar relativistic the calculation becomes equivalent to a noncollinear cal-
culation without spin orbit. (Andrea Dal Corso, 2007-07-27) See Example 06 for noncollinear
magnetism, Example 07 for spin-orbit interactions.
DFT+U DFT+U (formerly known as LDA+U) calculation can be performed within a sim-
plified rotationally invariant form of the U Hubbard correction. Note that for all atoms having
a U value there should be an item in function flib/set hubbard l.f90 and one in subroutine
PW/src/tabd.f90, defining respectively the angular momentum and the occupancy of the or-
bitals with the Hubbard correction. If your Hubbard-corrected atoms are not there, you need
to edit these files and to recompile.
See Example 08 and its README.
Hartree-Fock and Hybrid functionals Since v.5.0, calculations in the Hartree-Fock ap-
proximation, or using hybrid XC functionals that include some Hartree-Fock exchange, no
longer require a special preprocessing before compilation. See EXX example/ and its README
file.
Dispersion interaction with non-local functional (vdwDF) See example vdwDF example
and references quoted in file README therein.
Polarization via Berry Phase See Example 04, its file README, the documentation in
the header of PW/src/bp c phase.f90.
Finite electric fields There are two different implementations of macroscopic electric fields
in pw.x: via an external sawtooth potential (input variable tefield=.true.) and via the
modern theory of polarizability (lelfield=.true.). The former is useful for surfaces, especially
in conjunction with dipolar corrections (dipfield=.true.): see examples/dipole example for
an example of application. Electric fields via modern theory of polarization are documented
in example 10. The exact meaning of the related variables, for both cases, is explained in the
general input documentation.
8
Orbital magnetization Modern theory of orbital magnetization [Phys. Rev. Lett. 95,
137205 (2005)] for insulators. The calculation is performed by setting input variable lorbm=.true.
in nscf run. If finite electric field is present (lelfield=.true.) only Kubo terms are computed
[see New J. Phys. 12, 053032 (2010) for details].
Molecular Dynamics Specify calculation=md, the time step dt, and possibly the num-
ber of MD stops nstep. Use variable ion dynamics in namelist &IONS for a fine-grained
control of the kind of dynamics. Other options for setting the initial temperature and for
thermalization using velocity rescaling are available. Remember: this is MD on the electronic
ground state, not Car-Parrinello MD. See Example 03.
Free-energy surface calculations Once PWscf is patched with the PLUMED plug-in, it is
possible to use most PLUMED functionalities by running PWscf as: ./pw.x -plumed plus the
other usual PWscf arguments. The input file for PLUMED must be found in the specified outdir
with fixed name plumed.dat.
Variable-cell optimization Since v.4.2 the newer BFGS algorithm covers the case of variable-
cell optimization as well. Note however that variable-cell calculations (both optimization and
dynamics) are performed with plane waves and G-vectors calculated for the starting cell. This
means that if you re-run a self-consistent calculation for the final cell and atomic positions
using the same cutoff ecutwfc (and/or ecutrho if applicable), you may not find exactly the
same results, unless your final and initial cells are very similar, or unless your cutoff(s) are very
high. In order to provide a further check, a last step is performed in which a scf calculation is
performed for the converged structure, with plane waves and G-vectors calculated for the final
cell. Small differences between the two last steps are thus to be expected and give an estimate
of the reliability of the variable-cell optimization. If you get a large difference, you are likely
quite far from convergence in the plane-wave basis set and you need to increase the cutoff(s).
Variable-cell molecular dynamics A common mistake many new users make is to set the
time step dt improperly to the same order of magnitude as for CP algorithm, or not setting dt
at all. This will produce a not evolving dynamics. Good values for the original RMW (RM
Wentzcovitch) dynamics are dt = 50 70. The choice of the cell mass is a delicate matter.
An off-optimal mass will make convergence slower. Too small masses, as well as too long time
steps, can make the algorithm unstable. A good cell mass will make the oscillation times for
internal degrees of freedom comparable to cell degrees of freedom in non-damped Variable-Cell
MD. Test calculations are advisable before extensive calculation. I have tested the damping
algorithm that I have developed and it has worked well so far. It allows for a much longer
time step (dt=100 150) than the RMW one and is much more stable with very small cell
9
masses, which is useful when the cell shape, not the internal degrees of freedom, is far out of
equilibrium. It also converges in a smaller number of steps than RMW. (Info from Cesar Da
Silva: the new damping algorithm is the default since v. 3.1).
Practicalities The interface between PWscf and CASINO is provided through a file with a
standard format containing geometry, basis set, and orbital coefficients, which PWscf will pro-
duce on demand. For SCF calculations, the name of this file may be pwfn.data, bwfn.data or
bwfn.data.b1 depending on user requests (see below). If the files are produced from an MD
run, the files have a suffix .0001, .0002, .0003 etc. corresponding to the sequence of timesteps.
CASINO support is implemented by three routines in the PW directory of the espresso distri-
bution:
Relevant behavior of PWscf may be modified through an optional auxiliary input file, named
pw2casino.dat (see below).
Note that in versions prior to 4.3, this functionality was provided through separate post-
processing utilities available in the PP directory: these are no longer supported. For QMC-MD
runs, PWSCF etc previously needed to be patched using the patch script PP/pw2casino-
MDloop.sh - this is no longer necessary.
How to generate xwfn.data files with PWscf Use the -pw2casino option when invoking
pw.x, e.g.:
10
&inputpp
blip_convert=.true.
blip_binary=.true.
blip_single_prec=.false.
blip_multiplicity=1.d0
n_points_for_test=0
/
Some or all of the 5 keywords may be provided, in any order. The default values are as given
above (and these are used if the pw2casino.dat file is not present.
The meanings of the keywords are as follows:
blip convert : reexpand the converged plane-wave orbitals in localized blip functions prior to
writing the CASINO wave function file. This is almost always done, since wave functions
expanded in blips are considerably more efficient in quantum Monte Carlo calculations. If
blip convert=.false. a pwfn.data file is produced (orbitals expanded in plane waves);
if blip convert=.true., either a bwfn.data file or a bwfn.data.b1 file is produced,
depending on the value of blip binary (see below).
blip binary : if true, and if blip convert is also true, write the blip wave function as an un-
formatted binary bwfn.data.b1 file. This is much smaller than the formatted bwfn.data
file, but is not generally portable across all machines.
blip single prec : if .false. the orbital coefficients in bwfn.data(.b1) are written out in
double precision; if the user runs into hardware limits blip single prec can be set to
.true. in which case the coefficients are written in single precision, reducing the memory
and disk requirements at the cost of a small amount of accuracy..
blip multiplicity : the quality of the blip expansion (i.e., the fineness of the blip grid) can be
improved by increasing the grid multiplicity parameter given by this keyword. Increasing
the grid multiplicity results in a greater number of blip coefficients and therefore larger
memory requirements and file size, but the CPU time should be unchanged. For very
accurate work, one may want to experiment with grid multiplicity larger that 1.0. Note,
however, that it might be more efficient to keep the grid multiplicity to 1.0 and increase
the plane wave cutoff instead.
n points for test : if this is set to a positive integer greater than zero, PWscf will sample the
wave function, the Laplacian and the gradient at a large number of random points in the
simulation cell and compute the overlap of the blip orbitals with the original plane-wave
orbitals:
< BW |P W >
= q
< BW |BW >< P W |P W >
The closer is to 1, the better the blip representation. By increasing blip multiplicity,
or by increasing the plane-wave cutoff, one ought to be able to make as close to 1 as
desired. The number of random points used is given by n points for test.
Finally, note that DFT trial wave functions produced by PWSCF must be generated using
the same pseudopotential as in the subsequent QMC calculation. This requires the use of tools
to switch between the different file formats used by the two codes.
11
CASINO uses the CASINO tabulated format, PWSCF officially supports the UPFv2 format
(though it will read other deprecated formats). This can be done through the casino2upf
and upf2casino tools included in the upftools directory (see the upftools/README file for
instructions). An alternative converter casinogon is included in the CASINO distribution which
produces the deprecated GON format but which can be useful when using non-standard grids.
4 Performances
4.1 Execution time
The following is a rough estimate of the complexity of a plain scf calculation with pw.x, for
NCPP. USPP and PAW give raise additional terms to be calculated, that may add from a
few percent up to 30-40% to execution time. For phonon calculations, each of the 3Nat modes
requires a time of the same order of magnitude of self-consistent calculation in the same system
(possibly times a small multiple). For cp.x, each time step takes something in the order of
Th + Torth + Tsub defined below.
The time required for the self-consistent solution at fixed ionic positions, Tscf , is:
where Niter = number of self-consistency iterations (niter), Titer = time for a single iteration,
Tinit = initialization time (usually much smaller than the first term).
The time required for a single self-consistency iteration Titer is:
where Nk = number of k-points, Tdiag = time per Hamiltonian iterative diagonalization, Trho
= time for charge density calculation, Tscf = time for Hartree and XC potential calculation.
The time for a Hamiltonian iterative diagonalization Tdiag is:
Th = a1 M N + a2 M N1 N2 N3 log(N1 N2 N3 ) + a3 M P N.
The first term comes from the kinetic term and is usually much smaller than the others. The
second and third terms come respectively from local and nonlocal potential. a1 , a2 , a3 are
prefactors (i.e. small numbers O(1)), M = number of valence bands (nbnd), N = number of
PW (basis set dimension: npw), N1 , N2 , N3 = dimensions of the FFT grid for wavefunctions
(nr1s, nr2s, nr3s; N1 N2 N3 8N ), P = number of pseudopotential projectors, summed on
all atoms, on all values of the angular momentum l, and m = 1, ..., 2l + 1.
The time Torth required by orthonormalization is
Torth = b1 N Mx2
12
and the time Tsub required by subspace diagonalization is
Tsub = b2 Mx3
where b1 and b2 are prefactors, Mx = number of trial wavefunctions (this will vary between M
and 2 4M , depending on the algorithm).
The time Trho for the calculation of charge density from wavefunctions is
Trho = c1 M Nr1 Nr2 Nr3 log(Nr1 Nr2 Nr3 ) + c2 M Nr1 Nr2 Nr3 + Tus
where c1 , c2 , c3 are prefactors, Nr1 , Nr2 , Nr3 = dimensions of the FFT grid for charge density
(nr1, nr2, nr3; Nr1 Nr2 Nr3 8Ng , where Ng = number of G-vectors for the charge density,
ngm), and Tus = time required by PAW/USPPs contribution (if any). Note that for NCPPs
the FFT grids for charge and wavefunctions are the same.
The time Tscf for calculation of potential from charge density is
Tscf = d2 Nr1 Nr2 Nr3 + d3 Nr1 Nr2 Nr3 log(Nr1 Nr2 Nr3 )
with m, p, q = small factors; all other variables have the same meaning as above. Note that if
the point only (k = 0) is used to sample the Brillouin Zone, the value of N will be cut into
half.
The memory required by the phonon code follows the same patterns, with somewhat larger
factors m, p, q.
13
the size and type of the system under study;
Ideally one would like to have linear scaling, i.e. T T0 /Np for Np processors, where T0 is
the estimated time for serial execution. In addition, one would like to have linear scaling of
the RAM per processor: ON O0 /Np , so that large-memory systems fit into the RAM of each
processor.
Parallelization on k-points:
guarantees (almost) linear scaling if the number of k-points is a multiple of the number
of pools;
does not reduce the required memory per processor (unsuitable for large-memory jobs).
Parallelization on PWs:
yields good to very good scaling, especially if the number of processors in a pool is a
divisor of N3 and Nr3 (the dimensions along the z-axis of the FFT grids, nr3 and nr3s,
which coincide for NCPPs);
yields almost linear reduction of memory per processor with the number of processors in
the pool.
A note on scaling: optimal serial performances are achieved when the data are as much as
possible kept into the cache. As a side effect, PW parallelization may yield superlinear (better
than linear) scaling, thanks to the increase in serial speed coming from the reduction of data
size (making it easier for the machine to keep data in the cache).
VERY IMPORTANT: For each system there is an optimal range of number of processors on
which to run the job. A too large number of processors will yield performance degradation. If
the size of pools is especially delicate: Np should not exceed N3 and Nr3 , and should ideally be
no larger than 1/2 1/4N3 and/or Nr3 . In order to increase scalability, it is often convenient
to further subdivide a pool of processors into task groups. When the number of processors
exceeds the number of FFT planes, data can be redistributed to task groups so that each
group can process several wavefunctions at the same time.
The optimal number of processors for linear-algebra parallelization, taking care of mul-
tiplication and diagonalization of M M matrices, should be determined by observing the
performances of cdiagh/rdiagh (pw.x) or ortho (cp.x) for different numbers of processors in
the linear-algebra group (must be a square integer).
Actual parallel performances will also depend on the available software (MPI libraries) and
on the available communication hardware. For PC clusters, OpenMPI (http://www.openmpi.org/)
seems to yield better performances than other implementations (info by Kostantin Kudin). Note
however that you need a decent communication hardware (at least Gigabit ethernet) in order
14
to have acceptable performances with PW parallelization. Do not expect good scaling with
cheap hardware: PW calculations are by no means an embarrassing parallel problem.
Also note that multiprocessor motherboards for Intel Pentium CPUs typically have just one
memory bus for all processors. This dramatically slows down any code doing massive access to
memory (as most codes in the Quantum ESPRESSO distribution do) that runs on processors
of the same motherboard.
For a typical job with norm-conserving PPs, the total (wall) time is mostly spent in
routine electrons, calculating the self-consistent solution.
Most of the time spent in electrons is used by routine c bands, calculating Kohn-
Sham states. sum band (calculating the charge density), v of rho (calculating the
potential), mix rho (charge density mixing) should take a small fraction of the time.
Most of the time spent in c bands is used by routines cegterg (k-points) or regterg
(Gamma-point only), performing iterative diagonalization of the Kohn-Sham Hamiltonian
in the PW basis set.
Most of the time spent in *egterg is used by routine h psi, calculating H products.
cdiaghg (k-points) or rdiaghg (Gamma-only), performing subspace diagonalization,
should take only a small fraction.
Among the general routines, most of the time is spent in FFT on Kohn-Sham states:
fftw, and to a smaller extent in other FFTs, fft and ffts, and in calbec, calculating
h|i products.
Forces and stresses typically take a fraction of the order of 10 to 20% of the total time.
For PAW and Ultrasoft PP, you will see a larger contribution by sum band and a nonnegligible
newd contribution to the time spent in electrons, but the overall picture is unchanged. You
may drastically reduce the overhead of Ultrasoft PPs by using input option tqr=.true..
Parallelization on FFT speeds up (with varying efficiency) almost all routines, with the
notable exception of cdiaghg and rdiaghg.
15
Parallelization on k-points speeds up (almost linearly) c bands and called routines;
speeds up partially sum band; does not speed up at all v of rho, newd, mix rho.
OpenMP parallelization speeds up fftw, plus selected parts of the calculation, plus
(depending on the availability of OpenMP-aware libraries) some linear algebra operations
and on RAM:
Parallelization on FFT distributes most arrays across processors (i.e. all G-space and R-
spaces arrays) but not all of them (in particular, not subspace Hamiltonian and overlap
matrices)
Time usage is still dominated by the same routines as for the serial run
Routine fft scatter (called by parallel FFT) takes a sizable part of the time spent in
FFTs but does not dominate it.
FFT parallelization shouldnt exceed N3 processors, i.e. if you run with -nk Nk, use
N = Nk N3 MPI processes at most (mpirun -np N ...)
You will need to experiment a bit to find the best compromise. In order to have good load
balancing among MPI processes, the number of k-point pools should be an integer divisor of
Nk ; the number of processors for FFT parallelization should be an integer divisor of N3 .
16
Typical symptoms of bad/inadequate parallelization
a disproportionate time is spent in fft scatter, or in fft scatter the difference between
CPU and wall time is large. Solutions:
if you do not have fast (better than Gigabit ethernet) communication hardware, do
not try FFT parallelization on more than 4 or 8 procs.
use (also) k-point parallelization if possible
the time doesnt scale well or doesnt scale at all by increasing the number of processors
for FFT parallelization. Solutions:
use task groups: try command-line option -ntg 4 or -ntg 8. This may improve
your scaling.
5 Troubleshooting
pw.x says error while loading shared libraries or cannot open shared object file
and does not start Possible reasons:
If you are running on the same machines on which the code was compiled, this is a library
configuration problem. The solution is machine-dependent. On Linux, find the path to
the missing libraries; then either add it to file /etc/ld.so.conf and run ldconfig (must
be done as root), or add it to variable LD LIBRARY PATH and export it. Another
possibility is to load non-shared version of libraries (ending with .a) instead of shared
ones (ending with .so).
If you are not running on the same machines on which the code was compiled: you need
either to have the same shared libraries installed on both machines, or to load statically all
libraries (using appropriate configure or loader options). The same applies to Beowulf-
style parallel machines: the needed shared libraries must be present on all PCs.
errors in examples with parallel execution If you get error messages in the example
scripts i.e. not errors in the codes on a parallel machine, such as e.g.: run example: -n:
command not found you may have forgotten the in the definitions of PARA PREFIX and
PARA POSTFIX.
17
pw.x prints the first few lines and then nothing happens (parallel execution) If
the code looks like it is not reading from input, maybe it isnt: the MPI libraries need to be
properly configured to accept input redirection. Use pw.x -i and the input file name (see
Sec.??), or inquire with your local computer wizard (if any). Since v.4.2, this is for sure the
reason if the code stops at Waiting for input....
pw.x stops with error while reading data There is an error in the input data, typically
a misspelled namelist variable, or an empty input file. Unfortunately with most compilers the
code just reports Error while reading XXX namelist and no further useful information. Here
are some more subtle sources of trouble:
Input data files containing M (Control-M) characters at the end of lines, or non-ASCII
characters (e.g. non-ASCII quotation marks, that at a first glance may look the same
as the ASCII character). Typically, this happens with files coming from Windows or
produced with smart editors.
Both may cause the code to crash with rather mysterious error messages. If none of the above
applies and the code stops at the first namelist (&CONTROL) and you are running in parallel,
see the previous item.
pw.x mumbles something like cannot recover or error reading recover file You are
trying to restart from a previous job that either produced corrupted files, or did not do what
you think it did. No luck: you have to restart from scratch.
pw.x stops with inconsistent DFT error As a rule, the flavor of DFT used in the
calculation should be the same as the one used in the generation of pseudopotentials, which
should all be generated using the same flavor of DFT. This is actually enforced: the type of
DFT is read from pseudopotential files and it is checked that the same DFT is read from all
PPs. If this does not hold, the code stops with the above error message. Use at your own
risk input variable input dft to force the usage of the DFT you like.
pw.x stops with error in cdiaghg or rdiaghg Possible reasons for such behavior are not
always clear, but they typically fall into one of the following cases:
serious error in data, such as bad atomic positions or bad crystal structure/supercell;
18
buggy libraries. Machine-optimized mathematical libraries are very fast but sometimes
not so robust from a numerical point of view. Suspicious behavior: you get an error
that is not reproducible on other architectures or that disappears if the calculation is
repeated with even minimal changes in parameters. Known cases: HP-Compaq alphas
with cxml libraries, Mac OS-X with system BLAS/LAPACK. Try to use compiled BLAS
and LAPACK (or better, ATLAS) instead of machine-optimized libraries.
pw.x crashes with no error message at all This happens quite often in parallel execu-
tion, or under a batch queue, or if you are writing the output to a file. When the program
crashes, part of the output, including the error message, may be lost, or hidden into error files
where nobody looks into. It is the fault of the operating system, not of the code. Try to run
interactively and to write to the screen. If this doesnt help, move to next point.
pw.x crashes with segmentation fault or similarly obscure messages Possible rea-
sons:
too much RAM memory or stack requested (see next item).
if you are using highly optimized mathematical libraries, verify that they are designed for
your hardware.
If you are using aggressive optimization in compilation, verify that you are using the
appropriate options for your machine
The executable was not properly compiled, or was compiled on a different and incompat-
ible environment.
buggy compiler or libraries: this is the default explanation if you have problems with the
provided tests and examples.
pw.x works for simple systems, but not for large systems or whenever more RAM
is needed Possible solutions:
Increase the amount of RAM you are authorized to use (which may be much smaller than
the available RAM). Ask your system administrator if you dont know what to do. In
some cases the stack size can be a source of problems: if so, increase it with command
limits or ulimit).
Reduce nbnd to the strict minimum (for insulators, the default is already the minimum,
though).
Reduce the work space for Davidson diagonalization to the minimum by setting diago david ndim=
also consider using conjugate gradient diagonalization (diagonalization=cg), slow but
very robust, which requires almost no work space.
If the charge density takes a significant amount of RAM, reduce mixing ndim from its
default value (8) to 4 or so.
In parallel execution, use more processors, or use the same number of processors with less
pools. Remember that parallelization with respect to k-points (pools) does not distribute
memory: only parallelization with respect to R- (and G-) space does.
19
If none of the above is sufficient or feasible, you have to either reduce the cutoffs and/or
the cell size, or to use a machine with more RAM.
pw.x crashes with error in davcio davcio is the routine that performs most of the I/O
operations (read from disk and write to disk) in pw.x; error in davcio means a failure of an
I/O operation.
If the error is reproducible and happens at the beginning of a calculation: check if you
have read/write permission to the scratch directory specified in variable outdir. Also:
check if there is enough free space available on the disk you are writing to, and check your
disk quota (if any).
If the error is irreproducible: your might have flaky disks; if you are writing via the
network using NFS (which you shouldnt do anyway), your network connection might be
not so stable, or your NFS implementation is unable to work under heavy load
If it happens while restarting from a previous calculation: you might be restarting from
the wrong place, or from wrong data, or the files might be corrupted.
If you are running two or more instances of pw.x at the same time, check if you are using
the same file names in the same temporary directory. For instance, if you submit a series
of jobs to a batch queue, do not use the same outdir and the same prefix, unless you
are sure that one job doesnt start before a preceding one has finished.
pw.x crashes in parallel execution with an obscure message related to MPI errors
Random crashes due to MPI errors have often been reported, typically in Linux PC clusters.
We cannot rule out the possibility that bugs in Quantum ESPRESSO cause such behavior,
but we are quite confident that the most likely explanation is a hardware problem (defective
RAM for instance) or a software bug (in MPI libraries, compiler, operating system).
Debugging a parallel code may be difficult, but you should at least verify if your problem is
reproducible on different architectures/software configurations/input data sets, and if there is
some particular condition that activates the bug. If this doesnt seem to happen, the odds are
that the problem is not in Quantum ESPRESSO. You may still report your problem, but
consider that reports like it crashes with...(obscure MPI error) contain 0 bits of information
and are likely to get 0 bits of answers.
pw.x stops with error message the system is metallic, specify occupations You did
not specify state occupations, but you need to, since your system appears to have an odd number
of electrons. The variable controlling how metallicity is treated is occupations in namelist
&SYSTEM. The default, occupations=fixed, occupies the lowest (N electrons)/2 states
and works only for insulators with a gap. In all other cases, use smearing (tetrahedra
for DOS calculations). See input reference documentation for more details.
serious error in data, such as bad number of electrons, insufficient number of bands,
absurd value of broadening;
20
the Fermi energy is found by bisection assuming that the integrated DOS N(E ) is an in-
creasing function of the energy. This is not guaranteed for Methfessel-Paxton smearing of
order 1 and can give problems when very few k-points are used. Use some other smearing
function: simple Gaussian broadening or, better, Marzari-Vanderbilt cold smearing.
pw.x yields internal error: cannot bracket Ef message but does not stop This may
happen under special circumstances when you are calculating the band structure for selected
high-symmetry lines. The message signals that occupations and Fermi energy are not correct
(but eigenvalues and eigenvectors are). Remove occupations=tetrahedra in the input data
to get rid of the message.
pw.x yields weird results If results are really weird (as opposed to misinterpreted):
if this happens after a change in the code or in compilation or preprocessing options, try
make clean, recompile. The make command should take care of all dependencies, but do
not rely too heavily on it. If the problem persists, recompile with reduced optimization
level.
maybe your input data are weird.
FFT grid is machine-dependent Yes, they are! The code automatically chooses the small-
est grid that is compatible with the specified cutoff in the specified cell, and is an allowed value
for the FFT library used. Most FFT libraries are implemented, or perform well, only with
dimensions that factors into products of small numbers (2, 3, 5 typically, sometimes 7 and 11).
Different FFT libraries follow different rules and thus different dimensions can result for the
same system on different machines (or even on the same machine, with a different FFT). See
function allowed in Modules/fft scalar.f90.
As a consequence, the energy may be slightly different on different machines. The only
piece that explicitly depends on the grid parameters is the XC part of the energy that is
computed numerically on the grid. The differences should be small, though, especially for LDA
calculations.
Manually setting the FFT grids to a desired value is possible, but slightly tricky, using
input variables nr1, nr2, nr3 and nr1s, nr2s, nr3s. The code will still increase them if not
acceptable. Automatic FFT grid dimensions are slightly overestimated, so one may try very
carefully to reduce them a little bit. The code will stop if too small values are required, it will
waste CPU time and memory for too large values.
Note that in parallel execution, it is very convenient to have FFT grid dimensions along z
that are a multiple of the number of processors.
21
pw.x does not find all the symmetries you expected pw.x determines first the symmetry
operations (rotations) of the Bravais lattice; then checks which of these are symmetry operations
of the system (including if needed fractional translations). This is done by rotating (and
translating if needed) the atoms in the unit cell and verifying if the rotated unit cell coincides
with the original one.
Assuming that your coordinates are correct (please carefully check!), you may not find all
the symmetries you expect because:
the number of significant figures in the atomic positions is not large enough. In file
PW/eqvect.f90, the variable accep is used to decide whether a rotation is a symmetry
operation. Its current value (105 ) is quite strict: a rotated atom must coincide with
another atom to 5 significant digits. You may change the value of accep and recompile.
they are not acceptable symmetry operations of the Bravais lattice. This is the case
for C60 , for instance: the Ih icosahedral group of C60 contains 5-fold rotations that are
incompatible with translation symmetry.
the system is rotated with respect to symmetry axis. For instance: a C60 molecule in the
fcc lattice will have 24 symmetry operations (Th group) only if the double bond is aligned
along one of the crystal axis; if C60 is rotated in some arbitrary way, pw.x may not find
any symmetry, apart from inversion.
they contain a fractional translation that is incompatible with the FFT grid (see next
paragraph). Note that if you change cutoff or unit cell volume, the automatically com-
puted FFT grid changes, and this may explain changes in symmetry (and in the number
of k-points as a consequence) for no apparent good reason (only if you have fractional
translations in the system, though).
Self-consistency is slow or does not converge at all Bad input data will often result in
bad scf convergence. Please carefully check your structure first, e.g. using XCrySDen.
Assuming that your input data is sensible :
22
1. Verify if your system is metallic or is close to a metallic state, especially if you have few
k-points. If the highest occupied and lowest unoccupied state(s) keep exchanging place
during self-consistency, forget about reaching convergence. A typical sign of such behavior
is that the self-consistency error goes down, down, down, than all of a sudden up again,
and so on. Usually one can solve the problem by adding a few empty bands and a small
broadening.
2. Reduce mixing beta to 0.3 0.1 or smaller. Try the mixing mode value that is more
appropriate for your problem. For slab geometries used in surface problems or for elon-
gated cells, mixing mode=local-TF should be the better choice, dampening charge
sloshing. You may also try to increase mixing ndim to more than 8 (default value).
Beware: this will increase the amount of memory you need.
3. Specific to USPP: the presence of negative charge density regions due to either the
pseudization procedure of the augmentation part or to truncation at finite cutoff may
give convergence problems. Raising the ecutrho cutoff for charge density will usually
help.
I do not get the same results in different machines! If the difference is small, do not
panic. It is quite normal for iterative methods to reach convergence through different paths
as soon as anything changes. In particular, between serial and parallel execution there are
operations that are not performed in the same order. As the numerical accuracy of computer
numbers is finite, this can yield slightly different results.
It is also normal that the total energy converges to a better accuracy than its terms, since
only the sum is variational, i.e. has a minimum in correspondence to ground-state charge
density. Thus if the convergence threshold is for instance 108 , you get 8-digit accuracy on
the total energy, but one or two less on other terms (e.g. XC and Hartree energy). It this
is a problem for you, reduce the convergence threshold for instance to 1010 or 1012 . The
differences should go away (but it will probably take a few more iterations to converge).
Execution time is time-dependent! Yes it is! On most machines and on most operating
systems, depending on machine load, on communication load (for parallel machines), on various
other factors (including maybe the phase of the moon), reported execution times may vary quite
a lot for the same job.
Warning : N eigenvectors not converged This is a warning message that can be safely
ignored if it is not present in the last steps of self-consistency. If it is still present in the last
steps of self-consistency, and if the number of unconverged eigenvector is a significant part of
the total, it may signal serious trouble in self-consistency (see next point) or something badly
wrong in input data.
23
real space. This happens only with core corrections and with USPPs. In some cases it may
be a source of trouble (see next point) but it is usually solved by increasing the cutoff for the
charge density.
24
with
Calling C from fortran PLUMED assumes that fortran compilers add a single at the end
of C routines. You may get an error message as :
eliminate the from the definition of init metadyn and meta force calculation, i. e. change at
line 529
void meta_force_calculation_(real *cell, int *istep, real *xxx, real *yyy, real *zzz,
with
void meta_force_calculation(real *cell, int *istep, real *xxx, real *yyy, real *zzz,
25