Health monitoring of hard drives in Juniper switches and routers

The are two ways to monitor the health of HDD in Junos:

  • Using Self-Monitoring Analysis and Reporting Technology (SMART) system
  • Using iostat

Using the SMART system, HDDs incorporate a suite of advanced diagnostics that monitor the internal operations of a drive and provide an early warning for many types of potential problems.

The SMART system monitors the drive for anything that might seem out of the ordinary, documents it and analyzes the data.

SMART uses attributes(parameters of various parts of the HDD) to monitor the disk condition and to analyze its reliability.

The sets of attributes might be different across vendors. The attributes takes values between 1 and 253, with 1 being the worst. The value of each attribute shows the state of the part to which the attribute is assigned.

To ensure increased longevity of the HDD, it is recommended to have 30 minutes of continuous inactivity time every day. To measure how much time of inactivity the HDD had, “iostat” command from shell can be used.

In JUNOS, smartd is the daemon that interfaces with the hard disk’s Self-Monitoring Analysis and Reporting Technology (SMART) system.

smartd runs in the background and continuously monitors the hard drive and reports any errors. Periodic smartd messages (such as those about entering standby) are logged (with no associated facility) to the smartd.trace file. These messages will not be exported to the syslog server. However, errors will be logged to syslog via the daemon facility with a priority of 5 (error).

SMART tests can be executed from CLI or from shell.

Because the SMART CLI command is hidden, the command is not officially supported and the operator is strongly advised to perform this operation in a maintenance window.

Let’s first check the HDD that is present on the device, in this case, an EX9200:

 

{master}[edit]
root@EX9208-RE0# run show chassis hardware detail
Hardware inventory:
Item             Version  Part number  Serial number     Description
Chassis                                JN120A6F9RFB      EX9208
Midplane         REV 01   750-049760   ACAX1419          EX9208-BP
FPM Board        REV 01   750-049617   CAAS4218          Front Panel Display
PEM 0            Rev 10   740-029970   QCS1247U0U3       PS 1.4-2.52kW; 90-264V AC in
PEM 1            Rev 10   740-029970   QCS1247U0TL       PS 1.4-2.52kW; 90-264V AC in
PEM 2            Rev 10   740-029970   QCS1247U0S5       PS 1.4-2.52kW; 90-264V AC in
PEM 3            Rev 10   740-029970   QCS1247U0E6       PS 1.4-2.52kW; 90-264V AC in
Routing Engine 0 REV 02   740-049603   9009148893        RE-S-EX9200-1800X4
  ad0    3998 MB  Virtium - TuffDrive VCF P1T0200291740423 99 Compact Flash
  ad1   30533 MB  UGB94BPH32H0S1-KCI   11000087733       Disk 1
  usb0 (addr 1)  product 0x0000 0      vendor 0x0000     uhub0
  usb0 (addr 2)  product 0x0020 32     vendor 0x8087     uhub1
  DIMM 0         VL31B5263F-F8SD DIE REV-0 PCB REV-0     MFR ID-ce80
  DIMM 1         VL31B5263F-F8SD DIE REV-0 PCB REV-0     MFR ID-ce80
  DIMM 2         VL31B5263F-F8SD DIE REV-0 PCB REV-0     MFR ID-ce80
  DIMM 3         VL31B5263F-F8SD DIE REV-0 PCB REV-0     MFR ID-ce80
Routing Engine 1 REV 03   740-049603   9009168430        RE-S-EX9200-1800X4
  ad0    3998 MB  Virtium - TuffDrive VCF P1T0100307670827 95 Compact Flash
  ad1   28843 MB  UGB94BPH32H0S2-KCI   11000143428       Disk 1
  usb0 (addr 1)  product 0x0000 0      vendor 0x0000     uhub0
  usb0 (addr 2)  product 0x0020 32     vendor 0x8087     uhub1
  DIMM 0         SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80
  DIMM 1         SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80
  DIMM 2         SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80
  DIMM 3         SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80
CB 0             REV 05   750-049608   CABS0561          EX9200-SCBE

{master}[edit]
root@EX9208-RE0# 

 

You can check the status of the HDD or perform the SMART tests using the hidden command “hard-disk-test”:

 

{master}[edit]
root@EX9208-RE0# run request chassis routing-engine hard-disk-test ?
Possible completions:
  disk                 Name of hard disk
  long                 Run SMART extended self test
  short                Run short test
  show-status          Display status of test
{master}[edit]
root@EX9208-RE0#

 

This is how you can check the status of the HDD:

 

{master}[edit]
root@EX9208-RE0# run request chassis routing-engine hard-disk-test disk /dev/ad1 show-status
Device: UGB94BPH32H0S1-KCI  Supports ATA Version 8, Firmware version 2030
ATA/ATAPI revision    8
device model          UGB94BPH32H0S1-KCI
serial number         11000087733
firmware revision     2030
cylinders             16383
heads                 16
sectors/track         63
lba supported         11952 sectors
lba48 supported         -65609920401744 sectors
dma supported
overlap not supported

Feature                      Support  EnableValue   Vendor
write cache                    yes       yes
read ahead                     yes       yes
dma queued                     no       no      31/1F
SMART                          yes       yes
microcode download             yes       yes
security                       yes       no
power management               yes       yes
advanced power management      no       no      0/00
automatic acoustic management  no       no      0/00 0/00
Drive supports SMART and is enabled
Check SMART Passed

General Smart Values:
Off-line data collection status: (0x00) Offline data collection activity was
                                        never started

Self-test execution status:      ( 249) Self-test routine in progress
                                        90% of test remaining

Total time to complete off-line
data collection:                 (   0) Seconds

Offline data collection
Capabilities:                    (0x1d) SMART EXECUTE OFF-LINE IMMEDIATE
                                        NO Automatic timer ON/OFF support
                                        Abort Offline Collection upon new
                                        command
                                        Offline surface scan supported
                                        Self-test supported

Smart Capablilities:       (0x0003)     Saves SMART data before entering
                                        power-saving mode
                                        Supports SMART auto save timer

Error logging capability:        (0x00) Error logging NOT supported

Short self-test routine
recommended polling time:        (   0) Minutes

Extended self-test routine
recommended polling time:        (   0) Minutes

Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
     Attribute               Flag     Value Worst Threshold Raw Value
(  1)Raw Read Error Rate      0x0000   000   000   000       000000000007
(  9)Power On Hours Count     0x0000   000   000   000       000000003f29
( 12)Power Cycle Count        0x0000   000   000   000       0000000000a5
(184)Initial Bad Block Count  0x0000   000   000   000       000000000001
(195)Program Failure Block Ct 0x0000   000   000   000       000000000000
(196)Erase Failure Block Ct   0x0000   000   000   000       000000000000
(197)Read Failure Block Ct    0x0000   000   000   000       000000000000
(198)Total Read Sectors Ct    0x0000   000   000   000       00001d1c8293
(199)Total Write Sectors Ct   0x0000   000   000   000       0000674c2e83
(200)Total Read Commands Ct   0x0000   000   000   000       00000030905a
(201)Total Write Commands Ct  0x0000   000   000   000       0000019be43f
(202)Total Err Bits Flash Ct  0x0000   000   000   000       0000000372b5
(203)Rd Sect correc Bit err   0x0000   000   000   000       00000003615f
(204)Bad Block Full Flag      0x0000   000   000   000       000000000000
(205)Max PE Count Spec        0x0000   000   000   000       0000000186a0
(206)Erase Count Min          0x0000   000   000   000       000000000698
(207)Erase Count Max          0x0000   000   000   000       000000001462
(208)Erase Count Average      0x0000   000   000   000       000000000bad
(209)Remaining Life (%)       0x0000   000   000   000       000000000062
(211)Vendor Unique            0x0000   000   000   000       000000000000
(212)Vendor Unique            0x0000   000   000   000       000000000000
(213)Vendor Unique            0x0000   000   000   000       000000000000
Device does not support Error Logging
Device does not support Self Test Logging

{master}[edit]
root@EX9208-RE0#

 

As you can see in the above output, a SMART test is currently running and there is still 90% of the test remaining.

You can initiate a short or an extended SMART test. This is an example for an extended SMART test:

 

{master}[edit]
root@EX9208-RE0# run request chassis routing-engine hard-disk-test long disk /dev/ad1
Drive Command Successful, Extended Self test has begun
Please wait 0 minutes for test to complete
Use smartd -oA to abort test

{master}[edit]
root@EX9208-RE0#

 

There are corresponding shell commands that achieve the same thing and additionally allows to stop any running SMART tests.

These are the commands:

 

smartd -oA /dev/ad1 - Stops any currently running tests
smartd –oS /dev/ad1 - Runs short self test
smartd -oX /dev/ad1 - Runs extended diagnostic test
smartd -oa /dev/ad1 - Shows results of the test

 

The output of starting a short or an extended test and the results is the same as in the output from CLI commands.

This is how a test can be stopped:

 

root@EX9208-RE0% smartd -oA /dev/ad1
Drive Command Successful, self test aborted
root@EX9208-RE0%

 

Once you do this, then the status of the HDD will show up that the test was aborted:

 

root@EX9208-RE0% smartd -og /dev/ad1

General Smart Values:
Off-line data collection status: (0x00) Offline data collection activity was
                                        never started

Self-test execution status:      (  25) The self-test routine was aborted by
                                        the host

Total time to complete off-line
data collection:                 (   0) Seconds

Offline data collection
Capabilities:                    (0x1d) SMART EXECUTE OFF-LINE IMMEDIATE
                                        NO Automatic timer ON/OFF support
                                        Abort Offline Collection upon new
                                        command
                                        Offline surface scan supported
                                        Self-test supported

Smart Capablilities:       (0x0003)     Saves SMART data before entering
                                        power-saving mode
                                        Supports SMART auto save timer

Error logging capability:        (0x00) Error logging NOT supported

Short self-test routine
recommended polling time:        (   0) Minutes

Extended self-test routine
recommended polling time:        (   0) Minutes

root@EX9208-RE0%

 

One other way to monitor the health of the HDD is to use “iostat” command.

iostat command is a command that is used for monitoring system I/O device loading by observing the time the devices are active in relation to their average transfer rates.

This is how the read and write statistics are generated every 60 seconds:

 

root@EX9208-RE0% iostat -dsrw 60 ad1
        Time        Time  Write            Read      ad1
                    Diff  KB/t tps  MB/s   KB/t tps  MB/s
Nov 13 19:30:44       0s 30.63   0  0.01  34.44   0  0.00
Nov 13 19:31:44      60s 28.87   0  0.01   0.00   0  0.00
Nov 13 19:32:44      60s 32.76   0  0.01   0.00   0  0.00
^C
root@EX9208-RE0%

 

These are the parameters that “iostat” can take:

-d – Display only device statistics
-w – Pause wait seconds between each display
-r – Display read and write date
-s – Skips the lines with no activity and shows how long passed without any activity
-c – How many measurements to be taken

The above output shows that between 19:30:44 and 19:31:44 there were 60 seconds of inactivity.

Another tool to monitor the HDD operation is the smartd.trace file where smartd writes the log. This is some output from the file:

 

Oct 24 08:04:25 2016 smartd[2196]: Disk inactivity timer configured to 600 sec.
Oct 24 08:04:25 2016 smartd[2196]: Standby operation set to adaptive periodic
Oct 24 08:04:25 2016 smartd[2196]: Request disabled
Oct 24 08:04:25 2016 smartd[2196]: New config: Periodic power check disabled, Periodic Standby disabled: Interval 0 secs, Summary interval 86400 secs, loglevel 0
Oct 24 08:04:25 2016 smartd[2196]: Starting S.M.A.R.T
Oct 24 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactivity timer set to 600
Oct 25 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Oct 26 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Oct 27 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Oct 28 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Oct 31 08:04:30 2016 smartd[2196]: Disk inactivity timer configured to 600 sec.
Oct 31 08:04:30 2016 smartd[2196]: Standby operation set to adaptive periodic
Oct 31 08:04:30 2016 smartd[2196]: Request disabled
Oct 31 08:04:30 2016 smartd[2196]: New config: Periodic power check disabled, Periodic Standby disabled: Interval 0 secs, Summary interval 86400 secs, loglevel 0
Oct 31 08:04:30 2016 smartd[2196]: Starting S.M.A.R.T
Oct 31 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactivity timer set to 600
Nov  1 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Nov  2 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Nov  3 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Nov  4 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s
Nov  7 08:04:22 2016 smartd[2196]: Disk inactivity timer configured to 600 sec.
Nov  7 08:04:22 2016 smartd[2196]: Standby operation set to adaptive periodic
Nov  7 08:04:22 2016 smartd[2196]: Request disabled
Nov  7 08:04:22 2016 smartd[2196]: New config: Periodic power check disabled, Periodic Standby disabled: Interval 0 secs, Summary interval 86400 secs, loglevel 0
Nov  7 08:04:22 2016 smartd[2196]: Starting S.M.A.R.T
Nov  7 08:04:22 2016 smartd[2196]: Device /dev/ad1: inactivity timer set to 600

 

And these are the tools that you can use to monitor the health of the routers or switches HDD.

 

Reference:

  1. S.M.A.R.T.
  2. iostat

 

The following two tabs change content below.

Paris ARAU

Paris ARAU is a networking professional with strong background on routing and switching technologies. He is a holder of CCIE R&S and dual JNCIE(SP and ENT). The day to day work allows him to dive deeply in networking technologies. Part of the continuously training, he is focusing on Software Defined Network and cloud computing.

Comments

So empty here ... leave a comment!

Leave a Reply

Your email address will not be published. Required fields are marked *

Sidebar



%d bloggers like this: