Telecu Knowledge Base - How to troubleshoot disk I/O performance issues on Linux

Disk I/O (Input/Output) performance issues can significantly affect the performance of a Linux server, especially in resource-heavy applications like databases or file servers. Identifying and troubleshooting disk I/O issues involves using a combination of tools and techniques to pinpoint bottlenecks, misconfigurations, or hardware failures. Below are several steps to help you troubleshoot disk I/O performance problems on your Linux server.

1. Check Disk Usage with `df`

Before diving deeper into I/O-specific troubleshooting, check if the disk is full or nearing full capacity, as this could impact performance.

$ df -h

This command shows disk space usage for all mounted filesystems.

Example output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   30G   18G  63% /
tmpfs            16G  1.6G   15G  10% /dev/shm
/dev/sdb1       100G   25G   70G  27% /data

If the disk usage is high (especially the "Use%" value), consider freeing up space or expanding storage.

2. Monitor Disk I/O with `iostat`

The iostat command, provided by the sysstat package, gives detailed information about disk performance, including read/write speeds, I/O operations, and the overall system load.

$ iostat -x 5

The -x option gives extended statistics, and the 5 specifies a 5-second interval between reports.

Example output:

Linux 5.4.0-74-generic (hostname)    12/26/2024      _x86_64_        (8 CPU)

Device            r/s     w/s   rkB/s   wkB/s   rrqm/s  wrqm/s   %rrqm   %wrqm   r_await w_await svctm  %util
sda              10.1    8.2   1040    750    0.0     0.0     0.1     0.1     10.0    15.0    1.2    25.0
sdb               3.5    2.2    400    250    0.0     0.0     0.1     0.0     30.0    35.0    1.5    10.0

r/s: Reads per second
w/s: Writes per second
rkB/s: Kilobytes read per second
wkB/s: Kilobytes written per second
%util: Percentage of time the device was busy (a high value suggests the disk is heavily utilized)

Look for high utilization or delays, which indicate a potential bottleneck.

3. Check for Disk Errors Using `dmesg`

Disk errors can significantly degrade performance. Use the dmesg command to check for any system messages related to disk errors or I/O issues.

$ dmesg | grep -i error

If there are disk-related errors, they will typically appear here, including issues like I/O timeouts or hardware failures.

Example output:

[42615.217683] sd 2:0:0:0: [sda] Unhandled sense code
[42615.217707] sd 2:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[42615.217719] sd 2:0:0:0: [sda]  Sense Key : Hardware Error [current]

4. Measure Disk Latency with `blktrace`

blktrace is a low-level tool that traces block layer I/O operations. This tool provides insights into how long it takes for the system to read or write data to the disk.

Install blktrace:
```
$ sudo apt-get install blktrace
```
Start tracing the disk (replace /dev/sda with your device):
```
$ sudo blktrace -d /dev/sda -o - | blkparse -i -
```

This will produce detailed information about I/O operations and their latencies.

Example output:

    0,0     10.984523  563  I/O  4096  READ
    0,0     10.984731  564  I/O  4096  WRITE
    0,0     10.985035  565  I/O  4096  READ

Look for high latency values, which indicate that disk operations are taking longer than usual.

5. Analyze Disk Queue Length with `sar`

The sar command, part of the sysstat package, can show historical disk performance metrics, including disk queue length.

$ sar -d 5 5

This will display disk activity every 5 seconds for 5 intervals, including the average queue length.

Example output:

Linux 5.4.0-74-generic (hostname)    12/26/2024      _x86_64_        (8 CPU)

Time        tps   rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz  await  svctm  %util
15:30:01    30.0   1024      1024      68.0      0.20     12.0   10.0    25.0
15:30:06    28.0   900       1000      60.0      0.15     11.0   9.0     22.0

avgrq-sz: Average request size
avgqu-sz: Average queue size
await: Average wait time per request
%util: Percentage of time the disk is busy

A high average queue size or wait time can indicate disk performance issues.

7. Check Disk Health with `smartctl`

Disk failures or health issues can cause I/O performance degradation. Use smartctl from the smartmontools package to check the health of your disks.

$ sudo smartctl -a /dev/sda

Example output:

SMART Status: OK
Temperature: 38 C (good)
Reallocated_Sector_Ct: 0 (good)
Power_On_Hours: 2400 (good)

If there are any SMART errors, it could indicate a failing disk.

8. Review Disk Configuration

If you're using software RAID or LVM, ensure that the configuration is optimal. Check for degraded RAID arrays or improperly configured volume groups that might affect performance.

For RAID, use:

$ cat /proc/mdstat

For LVM, use:

$ sudo vgs
$ sudo lvs

Conclusion

By using these tools and techniques, you can diagnose and troubleshoot disk I/O performance issues on your Linux server. Start with basic checks like disk space and CPU usage, then move on to more advanced tools such as iostat, dmesg, and blktrace. Identifying the root cause will help you optimize disk performance and prevent future issues.

How to troubleshoot disk I/O performance issues on Linux

1. Check Disk Usage with df

Example output:

2. Monitor Disk I/O with iostat

Example output:

3. Check for Disk Errors Using dmesg

Example output:

4. Measure Disk Latency with blktrace

Example output:

5. Analyze Disk Queue Length with sar

Example output:

7. Check Disk Health with smartctl

Example output:

8. Review Disk Configuration

Conclusion

1. Check Disk Usage with `df`

2. Monitor Disk I/O with `iostat`

3. Check for Disk Errors Using `dmesg`

4. Measure Disk Latency with `blktrace`

5. Analyze Disk Queue Length with `sar`

7. Check Disk Health with `smartctl`