Troubleshoot Hardware Issues
When the kernel starts, it loads the necessary hardware drivers and modules with all messages that include hardware failure details. But the messages scroll up way too quickly, and it’s impossible to get a good look at potential hardware problems. However, the messages displayed during the kernel boot process are saved into the kernel ring buffer.
As the system boots up, use the dmesg command to write these messages into an msg.txt file.
ubuntu@ubuntu:~$ less /tmp/kernel_msg.txt
The saved messages can be reviewed later or sent to someone to debug the problem.
Another way to read these messages is to check the /var/log/dmesg or /var/log/messages files if they exist.
Some Linux systems that support systemd store these messages in the systemd journal. Use the journalctl command to check the kernel messages:
Check for the messages that state the failure of hardware features or failed to load drivers.
GRUB Rescue
GRUB is a software program installed by distributions that boots the kernel-based OS. Currently, all Linux distributions are using the GRUB2 version. Sometimes it’s possible that when the BIOS starts the GRUB2, it lands into the problem of no ‘operating file system’ or “unknown file system”.
The error suggests that the GRUB can’t find the right operating system to load and locate the grub.cfg file in the wrong partition. This happens when the user installs Windows after Linux OS and BIOS identifies disks in the wrong order, as the windows start its bootloader on Master Boot Record (MBR).
The error appears like this:
grub rescue > _
In this section, we will discuss two ways to recover the distribution from Grub Rescue:
METHOD I
Enter the ls command in the grub rescue terminal to list all the drives and available partitions.
(hd0),(hd0,msdos1)(hd0,msdos2)
Select the partition that contains the installed distribution. Generally, it’s booted from the first partition; if not, it outputs an error message. Run the following commands to search the grub configuration file in the grub2 directory:
grub > ls (hd0,msdos1)/grub2
device.map fonts grub.cfg grub.cfg.1590068449.rpmsave grubenv i386-pc locale
Type set root=(hd0,msdos1) to boot the system. Now use the set prefix command to define the path to the grub2 directory. Type insmod normal command to reboot the system. After reboot, open the terminal to update GRUB.
The last step is to install GRUB on MBR (Master Boot Record) as windows initiate its bootloader on it. This step requires mounting the root partition /dev/sda1 on the /mnt directory.
ubuntu@ubuntu:~$ sudo grub-install --root-directory=/mnt/ /dev/sda
The system may fail to boot through the insmod normal command, which can happen due to a lousy file system of missing grub.conf file. The issue requires the user to boot into the system through a live USB/CD of the distribution. Let’s discuss another ideal technique to rescue GRUB2.
METHOD II
Boot-Repair is a graphical tool that offers an ideal solution for GRUB problems. Boot into the desktop through a live removable USB/CD. Make sure the device is connected to the internet and press Ctrl+Alt+T to open the terminal. Now install the boot repair tool:
ubuntu@ubuntu:~$ sudo apt-get update
ubuntu@ubuntu:~$ sudo apt-get install -y boot-repair && boot-repair
Follow the recommended options to repair the system. Restart your system after Boot Repair applies all changes. The OS will boot normally.
Network Troubleshooting
For regular users, network connectivity occurs automatically as soon as the user plugs in the Ethernet cable or provides login credentials for a Wi-Fi network. However, network management and troubleshooting are a crucial set of tasks for any system administrator. Hence, Linux offers command-line tools to deal with management and connectivity issues.
In this section, we discuss outgoing and incoming network connection problems and cover Linux tools to provide solutions to them in a convenient way.
Outgoing Connections
Linux offers IP command as an all-around network utility to configure the network and resolve connectivity issues. It manipulates all the network objects such as IP addresses, routes, and links, etc.
Before beginning, use the IP command to view the working network interface.
In case of no available interface, check if the hardware is disabled. However, if it’s up and still connected to the host, use the route command to check the host.
The default line represents the default gateway (router) accessed by the machine via a working interface card. Linux offers ping utility to test connectivity between your device and router.
The error suggests that the router is either physically not connected or turned off. However, if the ping is successful, try to reach an address beyond the router, for instance, global Google DNS server 8.8.8.8.
A successful ping suggests that the issue is with the hostname-to-address resolution. The DNS server used by the system is added either manually or automatically from the DHCP server when the network interface initiates. Check the details (names and IP addresses) of the DNS server from the /etc/resolve.conf file.
nameserver 192.168.11.253
We can resolve the hostname issues as follows:
It’s possible that the server is down or the user is assigned the wrong DNS server address. Note the nameserver addresses from the resolve.conf file and check if it’s accessible via a ping command.
Use Domain Information groper (DIG) utility to check if the DNS is working. That is, check if the DNS server address 192.168.11.253 resolves the hostname to an IP address.
Correcting a DNS server is a bit tricky. If the Network Manager is responsible for managing the connectivity task, it overrides the nameserver entries in /etc/resolve.conf file. Cd into the /etc/sysconfig/network-scripts directory to add the following line in the ifcfg file to resolve the issue.
PEERDNS=no
DNS1=<DNS_server_IP_add>
In the case of a separate network service, add the PEERDNS=no line to resolve.conf file.
Incoming Connections
For a Linux system configured as an Apache server, the webserver needs to get accessed by the client. If the client can’t reach the server via a web browser, you can use the above-discussed ping, dig, or traceroute commands from outside the server to track issues. Some of the other ways to troubleshoot incoming connections include:
Use nmap to check the availability of the service via open ports on the server. Use the nmap command with the hostname/IP address to inspect open ports.
The open port 80/443 STATE suggests that the network connectivity is fine. If not, the firewall is not accepting packets from those ports. Moreover, it isn’t filtered, and the state is closed, which means the service isn’t configured correctly, or it isn’t listening on 80/443 ports.
If the system uses ufw and sets to the default firewall policy, it will block every incoming connection. Set the firewall to enable clients access to tcp 80/443 ports:
ubuntu@ubuntu:~$ sudo ufw allow 443
If it’s still blocking incoming connections, use the sudo ufw status command to look for the denied hosts and access them via the following command.
If access to 80/443 ports is enabled and all incoming networks can access the server. It’s time to check the server status:
Lastly, check if the server is listening to suitable interfaces and ports. Hence, for the services like httpd that listen for requests on interfaces. Edit the main configuration file to enable the service to listen on port 80 for a specific address or all addresses.
Listen 80
Listen 192.168.11.10:80
Troubleshoot System Load
Linux comes with many utilities that watch system activities and figure out issues that have no apparent reasons. That is, the system is working all fine but begins to slow down and starts crashing applications. These various Linux utilities help find out processes consuming memory resources and draining the machine of its disk space, processors, and network bandwidth.
Some of the reasons behind system instability include limited capacity, i.e., low memory, disk space, network capacity, and processing power, with misconfigured applications. However, the utilities offer ways to manage, manipulate, and fix such issues. Let’s troubleshoot limited memory and excessive CPU consumption issues.
Memory Usage
Run the top command with capital M to classify process details by memory usage. The command output yields general information followed by the RAM, swap space, and CPU consumption. If it appears that the system is out-of-memory (OOM) space, look for these things:
- Notice the free space in the Mem line: it must be zero or near to it.
- Check the used swap space: it must be non-zero or growing.
- Since the top command redisplays information every 5 seconds, look for the process with a memory leak, that is, check if the RES memory continues to grow.
- The kernel starts to kill the process when swap space runs out.
The possible way to troubleshoot such issues is to either:
Killing the Process
The kill command sends a kill signal to end a process. The most commonly used signals to troubleshoot out-of-memory problems are SIGKILL and SIGTERM. However, different processes respond differently to signals.
For instance, note the PID and use the kill command to send the SIGTERM signal.
The SIGTERM/-15 signal aims to terminate the process, but occasionally it does not kill the process. Hence, this may require the SIGKILL/-9 signal to kill the process immediately.
Drop Page Caches
To clean the memory for the moment, drop inactive cache pages. Dropping cached pages, write a few memory pages to the disk as the system may want to retrieve it later while it discards the rest.
Leave the top command running in the terminal and run the given command in another terminal to view MEM line changing:
Use Alt+SysRq Keystroke
Memory exhaustion can sometimes make the GUI or shell completely unresponsive. This scenario calls for the use of Alt+SysRq keystroke on an unresponsive system. Such that the kernel processes its request before any other process.
Run the following command to check if it’s enabled:
076
The ‘0’ value shows that the keystroke isn’t enabled. To enable this keystroke, go to the /etc/sysctl.conf file and set the kernel.sysrq=1. Or set kernel.sysrq=1 by using the following command.
In most of the keyboards, the SysRq is a ‘PrtSc’ key.
Press Alt+SysRq+f from the text-based interface to kill the process with the highest OOM score. Keep pressing these keystrokes until the system returns to its normal usable state.
CPU Load
The above-discussed techniques can also check and fix the process consuming excessive CPU resources and depriving the system of its functionality. However, Linux offers another method that limits system processes from feeding off the CPU resources.
Renice the process
Use the top command to yield all the details and note the process ID (PID) requesting more CPU resources. Type the following command that sets the excellent value between -20 to 19, i.e., the higher the value, the lower the access process gets to the CPU.
Or notice the NI (nice) value of the PID. For a low NI value, decrease the CPU access privileges of that particular process by revoking the excellent value using renice command:
Conclusion
The article covers all the necessary Linux utilities to allow beginners to troubleshoot Linux issues relevant to system load, hardware issues, GRUB, and networking.