0% found this document useful (0 votes)
173 views

Why Is My Linux ECS Not Booting and Going Into Emergency Mode

This document provides steps to troubleshoot a Linux server that is failing to boot and entering emergency mode. The key steps include: 1. Entering the root password in emergency mode to access the system. 2. Checking for errors in the /etc/fstab file or file systems that may be preventing mounting. 3. Running commands to mount all file systems and check for errors. 4. If errors are found, commands are provided to repair damaged file systems which may involve data loss. Backing up data first is advised.

Uploaded by

iftikhar ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views

Why Is My Linux ECS Not Booting and Going Into Emergency Mode

This document provides steps to troubleshoot a Linux server that is failing to boot and entering emergency mode. The key steps include: 1. Entering the root password in emergency mode to access the system. 2. Checking for errors in the /etc/fstab file or file systems that may be preventing mounting. 3. Running commands to mount all file systems and check for errors. 4. If errors are found, commands are provided to repair damaged file systems which may involve data loss. Backing up data first is advised.

Uploaded by

iftikhar ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Why Is My Linux ECS Not Booting and Going

Into Emergency Mode?


Symptom
Your Linux ECS enters the emergency mode during startup, and displays the message "Welcome to
emergency mode", asking you to enter the password of user root for maintenance.
Figure 1 Emergency mode

Possible Causes
The emergency mode allows you to recover the system even if the system fails to enter the rescue
mode. In emergency mode, the system installs only the root file system for data reading. It does not
attempt to install any other local file systems or activate network interfaces.

The system enters the emergency mode when:

 An error occurred in the /etc/fstab file, leading to the failure in mounting the file system.

 An error occurred in the file system.

Constraints
The operations in this section are applicable to Linux. The operations involve recovering the file
system, which may lead to data loss. Therefore, back up data before recovering the file system.

Solution
1. Enter the password of user root and press Enter to enter the recovery mode.

2. Run the following command to mount the root partition in read-write mode to modify the files in the
root directory:

# mount -o rw,remount /
3. Run the following command to try to mount all unmounted file systems:

# mount -a
o If the message "mount point does not exist" is displayed, the mount point is unavailable. In such a
case, create the mount point.
o If the message "no such device" is displayed, the file system device is unavailable. In such a case,
comment out or delete the mount line.

o If the message "an incorrect mount option was specified" is displayed, the mount parameters have
been incorrectly set. In such a case, correct the parameter setting.

o If no error occurs and the message "UNEXPECTED INCONSISTENCY;RUN fsck MANUALLY" is


displayed, the file system is faulty. In such a case, go to 7.
4. Run the following command to open the /etc/fstab file and correct the error:

# vi /etc/fstab

The /etc/fstab file contains the following parameters separated by space:

[file system] [dir] [type] [options] [dump] [fsck]

Table 1 /etc/fstab parameters

Parameter Description

[file system] Specifies the partition or storage device to be mounted.

You are advised to set file system in UUID format. To obtain the UUID of a device file system,
run the blkid command.

Format for reference:

# <device> <dir> <type> <options> <dump> <fsck>

UUID=b411dc99-f0a0-4c87-9e05-184977be8539 /home ext4 defaults 0 2

UUIDs are independent from the disk order. If the sequence of storage devices is changed
manually or undergoes some random changes by some BIOSs, or the storage devices are removed
and installed again, UUIDs are more effective in identifying the storage devices.

[dir] Specifies the mount point of a file system.

[type] Specifies the type of the file system to which a device or partition is mounted. The following file
systems are supported: ext2, ext3, ext4, reiserfs, xfs, jfs, smbfs, iso9660, vfat, ntfs, swap, and auto.

If type is set to auto, the mount command will speculate on the type of the file system that is
used, which is useful for mobile devices, such as CD-ROM and DVD.

[options] Specifies the parameters used for mounting. Some parameters are available only for specific file
systems. For example, defaults indicates the default mounting parameters of a file system will be
Table 1 /etc/fstab parameters

Parameter Description

used. The default parameters of the ext4 file system are rw, suid, dev, exec, auto, nouser,


and async.

For more parameters, run the # man mount command to view the man manual.

[dump] Specifies whether file system data will be backed up.

The value can be 0 or 1. 0 indicates that data will not be backed up, and 1 indicates that data will
be backed up. If you have not installed dump, set the parameter to 0.

[fsck] Specifies the sequence of checking file systems.

The parameter value can be 0, 1, or 2. 0 indicates that the file systems will not be checked by
fsck. 1 indicates the highest priority of the root directory to be checked by fsck, and 2 indicates the
lower priority of other systems to be checked.

5. After the modification, run the following command to check the fstab file:

# mount -a

6. Run the following command to restart the ECS:

# reboot
7. Run the following command to check for file system errors:
# dmesg |egrep "ext[2..4]|xfs" |grep -i error

NOTE:

o If the error message "I/O error... inode" is displayed, the fault is caused by a file system error.

o If no error is found in the logs, the fault is generally caused by the damaged superblock. The
superblock is the header of the file system. It records the status, size, and idle disk blocks of the file
system.

o If the superblock of a file system is damaged, for example, data is written to the superblock of the file
system by mistake, the system may fail to identify the file system. As a result, the system enters the
emergency mode during startup. The ext2fs file system backs up the superblock and stores the
backup at the blockgroup boundary of the driver.
8. Run the following command to unmount the directory where the file system error occurred:

# Unmount Mount point
9. Recover the damaged file system.

NOTICE:

Recovering the file system may lead to data loss. Back up data before the recovery.

o For the ext file system, run the following command to check whether the file system is faulty:

# fsck -n /dev/vdb1

NOTE:

If the message "The super block Cloud no be read or does not describe a correct ext2 filesystem" is
displayed, go to step 10.
To recover the file system, run the following command:

# fsck /dev/vdb1

o For the xfs file system, run the following command to check whether the file system is faulty:

# xfs_repair -n /dev/vdb1
To recover the file system, run the following command:

# xfs_repair /dev/vdb1

10. (Optional) If the message "The super block Cloud no be read or does not describe a correct ext2
filesystem" is displayed, the superblock is damaged. In such a case, use the superblock backup for
recovery.
Figure 2 Damaged superblock

Run the following command to replace the damaged superblock with the superblock backup:

# e2fsck -b 8193 Device name


In the preceding command, device name is the disk name but not the partition name.

As shown in Figure 3, the damaged superblock has been replaced.


Figure 3 Replacing the damaged superblock

NOTE:

-b 8193 indicates that the backup of superblock 8193 in the file system is used.

The location of the superblock backup varies depending on the block size of the file system. For a
file system with a 1 KB block size, locate the backup at superblock 8193; for a 2 KB block size,
locate the backup at superblock 16384; for a 4 KB block size, locate the backup at superblock
32768.

11. Run the following command to restart the ECS:

# reboot

The first 5 things to do when your Linux server keels over

Even Linux servers can go haywire some days.


Here's the first steps you should take in
troubleshooting and fixing them.
I've seen plenty of Linux servers run day in and day out for years, with
nary a reboot. But any server can suffer from hardware, software, and
connectivity problems. Here's how to find out what's wrong so you can
get them working again.
One pre-troubleshooting issue is the meta-question of whether you
should fix the server at all.
When I started as a Unix system administrator in the 1980s—long
before Linux was a twinkle in Linus Torvalds' eye—if a server went
bad, you had a real problem. There were relatively few debugging
tools, so it could take a long time to get a malfunctioning server back
into production.

Why troubleshooting is different now


It's different today. One sysadmin told me quite seriously that he'd
"blow it away and build another one."
In a world where IT is built around virtual machines (VMs) and
containers, this makes sense. The cloud, after all, depends on being
able to roll out new instances as needed.
Plus, DevOps tools such as Chef and Puppet make it easier to start
over than to fix anything. With higher level DevOps tools such as
Docker Swarm, Mesosphere, and Kubernetes, your servers can go
down and be brought back up before you even know they failed.
This concept has become so widespread that it has a
name: serverless computing, which includes AWS Lambda,  Iron.io,
and Google Cloud Functions. With this technique, the cloud service
handles all the capacity, scaling, patching, and administration of the
server you need to run your program.
While serverless computing makes servers invisible to users and, to
some extent, sysadmins, underneath all those layers of abstraction—
VMs, containers, serverless—you still have physical hardware and the
operating system. And at the end of the day, someone still has to fix
them when things break.
As one system operator told me, "'Just reinstall it' is a terrible practice.
It doesn't tell you anything about why the server broke or how to
prevent it from breaking again. No halfway-decent admin should start
with a reinstall."
I agree. Until you actually work out why a problem happened in the
first place, the issue isn't resolved.
Here's my suggestions on how to start that process.

Over 1M people read enterprise.nxt. Are you one of them?


Subscribe today

1. Check the hardware!


First—and I know this is going to sound really stupid, but do it anyway
—check the hardware. In particular, go to the rack in person and make
sure all the cables are plugged in correctly.
I cannot begin to count the number of times a problem could be
tracked back to cables when just a quick glance at the blinkenlights
could have told you the power was off or a network cable had come
unplugged.
Of course, you don't have to look at the hardware. For example, this
shell command tells you if your Ethernet device link is detectable:
$ sudo ethtool eth0
If the answer is yes, you know the port is talking to the network.
Yet it’s a good idea to physically look at the gear to make sure
someone didn't pull the Big Red Switch and turn off the server or
rack's power. Yes, this is simple, but it's amazing how many times you
can thumb-finger a total system outage.
Other common hardware problems can't be spotted by a mark one
eyeball. For example, bad RAM causes all kinds of problems. VMs
and containers can hide these problems, but if you see a pattern of
failures linked to a specific bare-metal server, check its memory.
To see what a server's BIOS/UEFI reports about its hardware,
including memory, use the dmidecode command:
 $ sudo dmidecode --type memory
If this looks right—it may not be, as SMBIOS data isn't always
accurate—and you still suspect a memory problem, it's time to
deploy Memtest86. This is the essential memory checking program,
but it's slow. If you're running it on a server, don't expect to use that
machine for anything else while the checks are running.
If you run into a lot of memory problems—which I've seen in places
with dirty power—you should load the edac_core module. This Linux
kernel module constantly checks for bad memory. To load it, use the
command:
$ sudo modprobe edac_core

Wait for a while and then check to see if anything shows up when you
type in the command:
$ sudo grep "[0-9]"
/sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

This presents you with a list of the memory controller's row (DIMM)
and error count. Combined with dmidecode data on memory channel,
slot, and part number, this process helps you find the corrupted
memory stick.

2. Define the exact problem


OK, so your server has gone haywire, but there's no magic smoke
coming out of it. Before you attempt to deal with the end result, you
need to lock down exactly what the problem is. For example, if your
users are complaining about a problem with a server application, first
make sure it's not actually failing at the client.
For instance, a friend told me his users reported that IBM Tivoli
Storage Manager had failed for them. Eventually, he discovered the
problem wasn't on the server side at all. Instead, it was a bad
Windows client patch, 3076895. But the manner in which the security
patch messed up made it look like a problem on the server side.
You should also determine whether the problem is with the server per
se or the server application. For example, a server program can go
awry while the server keeps humming along.
There are numerous ways to check to see if an application is running.
Two of my favorites are:
$ sudo ps -ef | grep apache2

$ sudo netstat -plunt | grep apache2

If it turns out that, say, the Apache web server isn't running, you can
start it with this:
$ sudo service apache2 start

In short, before jumping in to work out what's wrong, make sure you
work out which element is at fault. Only once you're sure you know
what a problem is do you know the right questions to ask or the next
level of troubleshooting to investigate.
I mean, sure, you know your car doesn't run, but first you need to
make sure there's gas in the tank before hauling the car off to the
shop for repairs.

3. Top
Another useful system debugging step is top, to check load average,
swap, and which processes are using resources. Top shows all of a
Linux server's currently running processes.
Specifically, top displays:
Line 1:
 The time
 How long the computer has been running
 Number of users
 Load average (the system load time for the last minute, last 5
minutes, and last 15 minutes)
Line 2:
 Total number of tasks
 Number of running tasks
 Number of sleeping tasks
 Number of stopped tasks
 Number of zombie tasks
Line 3:
 CPU usage as a percentage by the user
 CPU usage as a percentage by system
 CPU usage as a percentage by low-priority processes
 CPU usage as a percentage by idle processes
 CPU usage as a percentage by I/O wait
 CPU usage as a percentage by hardware interrupts
 CPU usage as a percentage by software interrupts
 CPU usage as a percentage by steal time
 Total system memory
 Free memory
 Memory used
 Buffer cache
Line 4:
 Total swap available
 Total swap free
 Total swap used
 Available memory
This is followed by a line for each running application. It includes:
 Process ID
 User
 Priority
 Nice level
 Virtual memory used by process
 Resident memory used by process
 Shareable memory
 CPU used by process as a percentage
 Memory used by process as a percentage
 Time process has been running
 Command
That’s a wealth of useful troubleshooting information. Here are some
useful ways to get at it.
To find the process consuming the most memory, sort the process list
by pressing the M key. To see which applications are using the most
CPU, press P; and to sort by running time, press T. To more easily see
which column you're using for sorting, press the b key.
You can also interactively filter top's results by pressing o or O, which
displays the following prompt:
add filter #1 (ignoring case) as: [!]FLD?VAL

You can then enter a search for a particular process, for


example, COMMAND=apache, whereupon top displays only Apache
processes.
Another useful top command is to display each process’s full
command path and arguments. To do this, press c.
A related top command is Forest mode, which you activate with V.
This displays the processes in a parent-child hierarchy.
You can also view a specific user's processes with u or U, or get rid of
the idle processes' display with i.
While top has long been the most popular Linux interactive activity
viewer, htop adds even more features and has an easier
graphical Ncurses interface. For example, with htop you can use the
mouse and scroll the process list vertically and horizontally to see all
processes and complete command lines.
I don’t expect top to tell me what the problem is; rather, I use it to find
behavior that makes me say, “That’s funny,” and inspires further
investigation. Based on what top tells me, I know which logs to look at
first. The logs themselves I inspect using combinations of less, grep,
and tail -f.

4. What's up with disk space?


Even today, when you can carry a terabyte in your pocket, a server
can run out of disk space without anyone noticing. When that
happens, really wonky problems can show up.
To track these down, the good old df command—which stands for
“disk filesystem”—is your friend. You use df to view a full summary of
available and used disk space.
It's typically used in two ways:
 $ sudo df -h presents data about your hard drives in a human-
readable format. For example, it displays storage as gigabytes (G)
rather than an exact number of bytes.
 $ sudo df -i displays the number of used inodes and their
percentage for the file system.
Another useful flag is T. This displays your storage's file system types.
So, $ sudo df -hT shows both the amount of used space in your
storage and its file system type.
If something seems off, you can look deeper by using the
command Iostat. This command is part of the sysstat advanced
system performance monitoring tools collection. It reports on CPU
statistics and I/O statistics for block storage devices, partitions, and
network file systems.
Perhaps the most useful version of this command is:
$ iostat -xz 1

This displays the delivered reads, writes, read KB, and write KB per
second to the device. It also shows you the average time for the I/O in
milliseconds (await). The bigger the await number, the more likely it is
that the drive is saturated with data requests, or it has a hardware
problem. Which is it? You might use top to see if MySQL (or whatever
DBMS you're using) is keeping your server busy. If there's no
application burning the midnight oil, then chances are your drive is
turning sour.
Another important result is found under %util, which measures device
utilization. This shows how hard the device is doing work. Values
greater than 60% indicate poor storage performance. If the value is
close to 100%, the drive is nearing saturation.
Be careful of what you're looking at. A logical disk device fronting
multiple back-end disks with 100% utilization may just mean that some
I/O is always being processed. What matters is what's happening on
those back-end disks. So, when you're looking at a logical drive, keep
in mind that the disk utilities aren't going to giving you useful
information.

5. Check the logs


Last, but never least, check the server logs. These are usually
in /var/log in a subdirectory specific to the service.
For Linux newcomers, log files can be scary. They record in text files
everything Linux or Linux-based applications do. There are two kinds
of log records. One records what happens on a system or in a
program, such as every transaction or data movement. The other
records system or application error messages. Log files may contain
both. They can be enormous files.
Log file data tends to be cryptic, but you still need to learn your way
around them. Digital Ocean's "How to View and Configure Linux Logs
on Ubuntu and Centos" is an excellent introduction.
There are many tools to help you check logs.
One useful troubleshooting tool is dmesg. This displays all the kernel
messages. That's usually way too many, so use this simple shell
script to display the last 10 messages:
$ dmesg | tail.

Want to see what's happening as it happens? I know I do when I'm


troubleshooting. Then run tail with the -f option:
$ dmesg | tail -f /var/log/syslog

With the above command, tail continues to keep an eye on the


syslog file and prints out the next event recorded to syslog.
Another handy simple shell script is:
$ sudo find /var/log -type f -mtime -1 -exec tail -Fn0 {} +

This sweeps through the logs and shows possible problems.


In the unlikely chance you're using a server using systemd for its
system and server management, you need to use its built-in log
tool, Journalctl. Systemd centralizes log management with
the journald daemon. Unlike older Linux logs, journald stores data in a
binary rather than text format.
You can set journald to save logs from one reboot to the other with the
command:
$ sudo mkdir -p /var/log/journal

You need to enable persistent record keeping by


editing /etc/systemd/journald.conf to include the line:
[Journal] Storage=persistent

The most common way to access this log data is with the command:
journalctl -b

This shows you all the journal entries since the most recent reboot. If
your system required a reboot, you can track what happened the last
time by using the command:
$ journalctl -b -1

This looks at the log from the server's last session.


For more of an introduction on how to use journalctl, see "How to
Use Journalctl to View and Manipulate Systemd Logs."
Logs can be huge and difficult to work with. So, while you can work
through them with shell scripts using grep, awk, and other filters, you
may also want to use a log-viewing tool.
A favorite of mine is Graylog, an open source log management
system. It collects, indexes, and analyzes framed, systematic, and
disorganized data. To do this, it uses MongoDB for data
and Elasticsearch log file searches. Graylog makes it easy to track
what's what with your servers. It makes working with logs easier than
with Linux's built-in log tools. It also has the advantage of working with
multiple DevOps programs, such as Chef, Puppet, and Ansible.
Maybe your servers will never reach all time longevity records. But
fixing problems and setting servers to be as stable as possible are
always worthwhile goals. With all these methods, you should be well
on your way to finding and fixing your problem.

You might also like