Health monitoring of hard drives in Juniper switches and routers
The are two ways to monitor the health of HDD in Junos:
- Using Self-Monitoring Analysis and Reporting Technology (SMART) system
- Using iostat
Using the SMART system, HDDs incorporate a suite of advanced diagnostics that monitor the internal operations of a drive and provide an early warning for many types of potential problems.
The SMART system monitors the drive for anything that might seem out of the ordinary, documents it and analyzes the data.
SMART uses attributes(parameters of various parts of the HDD) to monitor the disk condition and to analyze its reliability.
The sets of attributes might be different across vendors. The attributes takes values between 1 and 253, with 1 being the worst. The value of each attribute shows the state of the part to which the attribute is assigned.
To ensure increased longevity of the HDD, it is recommended to have 30 minutes of continuous inactivity time every day. To measure how much time of inactivity the HDD had, “iostat” command from shell can be used.
In JUNOS, smartd is the daemon that interfaces with the hard disk’s Self-Monitoring Analysis and Reporting Technology (SMART) system.
smartd runs in the background and continuously monitors the hard drive and reports any errors. Periodic smartd messages (such as those about entering standby) are logged (with no associated facility) to the smartd.trace file. These messages will not be exported to the syslog server. However, errors will be logged to syslog via the daemon facility with a priority of 5 (error).
SMART tests can be executed from CLI or from shell.
Because the SMART CLI command is hidden, the command is not officially supported and the operator is strongly advised to perform this operation in a maintenance window.
Let’s first check the HDD that is present on the device, in this case, an EX9200:
{master}[edit] root@EX9208-RE0# run show chassis hardware detail Hardware inventory: Item Version Part number Serial number Description Chassis JN120A6F9RFB EX9208 Midplane REV 01 750-049760 ACAX1419 EX9208-BP FPM Board REV 01 750-049617 CAAS4218 Front Panel Display PEM 0 Rev 10 740-029970 QCS1247U0U3 PS 1.4-2.52kW; 90-264V AC in PEM 1 Rev 10 740-029970 QCS1247U0TL PS 1.4-2.52kW; 90-264V AC in PEM 2 Rev 10 740-029970 QCS1247U0S5 PS 1.4-2.52kW; 90-264V AC in PEM 3 Rev 10 740-029970 QCS1247U0E6 PS 1.4-2.52kW; 90-264V AC in Routing Engine 0 REV 02 740-049603 9009148893 RE-S-EX9200-1800X4 ad0 3998 MB Virtium - TuffDrive VCF P1T0200291740423 99 Compact Flash ad1 30533 MB UGB94BPH32H0S1-KCI 11000087733 Disk 1 usb0 (addr 1) product 0x0000 0 vendor 0x0000 uhub0 usb0 (addr 2) product 0x0020 32 vendor 0x8087 uhub1 DIMM 0 VL31B5263F-F8SD DIE REV-0 PCB REV-0 MFR ID-ce80 DIMM 1 VL31B5263F-F8SD DIE REV-0 PCB REV-0 MFR ID-ce80 DIMM 2 VL31B5263F-F8SD DIE REV-0 PCB REV-0 MFR ID-ce80 DIMM 3 VL31B5263F-F8SD DIE REV-0 PCB REV-0 MFR ID-ce80 Routing Engine 1 REV 03 740-049603 9009168430 RE-S-EX9200-1800X4 ad0 3998 MB Virtium - TuffDrive VCF P1T0100307670827 95 Compact Flash ad1 28843 MB UGB94BPH32H0S2-KCI 11000143428 Disk 1 usb0 (addr 1) product 0x0000 0 vendor 0x0000 uhub0 usb0 (addr 2) product 0x0020 32 vendor 0x8087 uhub1 DIMM 0 SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80 DIMM 1 SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80 DIMM 2 SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80 DIMM 3 SGU04G72H1BD2SA-BB DIE REV-52 PCB REV-54 MFR ID-ce80 CB 0 REV 05 750-049608 CABS0561 EX9200-SCBE {master}[edit] root@EX9208-RE0#
You can check the status of the HDD or perform the SMART tests using the hidden command “hard-disk-test”:
{master}[edit] root@EX9208-RE0# run request chassis routing-engine hard-disk-test ? Possible completions: disk Name of hard disk long Run SMART extended self test short Run short test show-status Display status of test {master}[edit] root@EX9208-RE0#
This is how you can check the status of the HDD:
{master}[edit] root@EX9208-RE0# run request chassis routing-engine hard-disk-test disk /dev/ad1 show-status Device: UGB94BPH32H0S1-KCI Supports ATA Version 8, Firmware version 2030 ATA/ATAPI revision 8 device model UGB94BPH32H0S1-KCI serial number 11000087733 firmware revision 2030 cylinders 16383 heads 16 sectors/track 63 lba supported 11952 sectors lba48 supported -65609920401744 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cache yes yes read ahead yes yes dma queued no no 31/1F SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 0/00 automatic acoustic management no no 0/00 0/00 Drive supports SMART and is enabled Check SMART Passed General Smart Values: Off-line data collection status: (0x00) Offline data collection activity was never started Self-test execution status: ( 249) Self-test routine in progress 90% of test remaining Total time to complete off-line data collection: ( 0) Seconds Offline data collection Capabilities: (0x1d) SMART EXECUTE OFF-LINE IMMEDIATE NO Automatic timer ON/OFF support Abort Offline Collection upon new command Offline surface scan supported Self-test supported Smart Capablilities: (0x0003) Saves SMART data before entering power-saving mode Supports SMART auto save timer Error logging capability: (0x00) Error logging NOT supported Short self-test routine recommended polling time: ( 0) Minutes Extended self-test routine recommended polling time: ( 0) Minutes Vendor Specific SMART Attributes with Thresholds: Revision Number: 16 Attribute Flag Value Worst Threshold Raw Value ( 1)Raw Read Error Rate 0x0000 000 000 000 000000000007 ( 9)Power On Hours Count 0x0000 000 000 000 000000003f29 ( 12)Power Cycle Count 0x0000 000 000 000 0000000000a5 (184)Initial Bad Block Count 0x0000 000 000 000 000000000001 (195)Program Failure Block Ct 0x0000 000 000 000 000000000000 (196)Erase Failure Block Ct 0x0000 000 000 000 000000000000 (197)Read Failure Block Ct 0x0000 000 000 000 000000000000 (198)Total Read Sectors Ct 0x0000 000 000 000 00001d1c8293 (199)Total Write Sectors Ct 0x0000 000 000 000 0000674c2e83 (200)Total Read Commands Ct 0x0000 000 000 000 00000030905a (201)Total Write Commands Ct 0x0000 000 000 000 0000019be43f (202)Total Err Bits Flash Ct 0x0000 000 000 000 0000000372b5 (203)Rd Sect correc Bit err 0x0000 000 000 000 00000003615f (204)Bad Block Full Flag 0x0000 000 000 000 000000000000 (205)Max PE Count Spec 0x0000 000 000 000 0000000186a0 (206)Erase Count Min 0x0000 000 000 000 000000000698 (207)Erase Count Max 0x0000 000 000 000 000000001462 (208)Erase Count Average 0x0000 000 000 000 000000000bad (209)Remaining Life (%) 0x0000 000 000 000 000000000062 (211)Vendor Unique 0x0000 000 000 000 000000000000 (212)Vendor Unique 0x0000 000 000 000 000000000000 (213)Vendor Unique 0x0000 000 000 000 000000000000 Device does not support Error Logging Device does not support Self Test Logging {master}[edit] root@EX9208-RE0#
As you can see in the above output, a SMART test is currently running and there is still 90% of the test remaining.
You can initiate a short or an extended SMART test. This is an example for an extended SMART test:
{master}[edit] root@EX9208-RE0# run request chassis routing-engine hard-disk-test long disk /dev/ad1 Drive Command Successful, Extended Self test has begun Please wait 0 minutes for test to complete Use smartd -oA to abort test {master}[edit] root@EX9208-RE0#
There are corresponding shell commands that achieve the same thing and additionally allows to stop any running SMART tests.
These are the commands:
smartd -oA /dev/ad1 - Stops any currently running tests smartd –oS /dev/ad1 - Runs short self test smartd -oX /dev/ad1 - Runs extended diagnostic test smartd -oa /dev/ad1 - Shows results of the test
The output of starting a short or an extended test and the results is the same as in the output from CLI commands.
This is how a test can be stopped:
root@EX9208-RE0% smartd -oA /dev/ad1 Drive Command Successful, self test aborted root@EX9208-RE0%
Once you do this, then the status of the HDD will show up that the test was aborted:
root@EX9208-RE0% smartd -og /dev/ad1 General Smart Values: Off-line data collection status: (0x00) Offline data collection activity was never started Self-test execution status: ( 25) The self-test routine was aborted by the host Total time to complete off-line data collection: ( 0) Seconds Offline data collection Capabilities: (0x1d) SMART EXECUTE OFF-LINE IMMEDIATE NO Automatic timer ON/OFF support Abort Offline Collection upon new command Offline surface scan supported Self-test supported Smart Capablilities: (0x0003) Saves SMART data before entering power-saving mode Supports SMART auto save timer Error logging capability: (0x00) Error logging NOT supported Short self-test routine recommended polling time: ( 0) Minutes Extended self-test routine recommended polling time: ( 0) Minutes root@EX9208-RE0%
One other way to monitor the health of the HDD is to use “iostat” command.
iostat command is a command that is used for monitoring system I/O device loading by observing the time the devices are active in relation to their average transfer rates.
This is how the read and write statistics are generated every 60 seconds:
root@EX9208-RE0% iostat -dsrw 60 ad1 Time Time Write Read ad1 Diff KB/t tps MB/s KB/t tps MB/s Nov 13 19:30:44 0s 30.63 0 0.01 34.44 0 0.00 Nov 13 19:31:44 60s 28.87 0 0.01 0.00 0 0.00 Nov 13 19:32:44 60s 32.76 0 0.01 0.00 0 0.00 ^C root@EX9208-RE0%
These are the parameters that “iostat” can take:
-d – Display only device statistics
-w – Pause wait seconds between each display
-r – Display read and write date
-s – Skips the lines with no activity and shows how long passed without any activity
-c – How many measurements to be taken
The above output shows that between 19:30:44 and 19:31:44 there were 60 seconds of inactivity.
Another tool to monitor the HDD operation is the smartd.trace file where smartd writes the log. This is some output from the file:
Oct 24 08:04:25 2016 smartd[2196]: Disk inactivity timer configured to 600 sec. Oct 24 08:04:25 2016 smartd[2196]: Standby operation set to adaptive periodic Oct 24 08:04:25 2016 smartd[2196]: Request disabled Oct 24 08:04:25 2016 smartd[2196]: New config: Periodic power check disabled, Periodic Standby disabled: Interval 0 secs, Summary interval 86400 secs, loglevel 0 Oct 24 08:04:25 2016 smartd[2196]: Starting S.M.A.R.T Oct 24 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactivity timer set to 600 Oct 25 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Oct 26 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Oct 27 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Oct 28 08:04:25 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Oct 31 08:04:30 2016 smartd[2196]: Disk inactivity timer configured to 600 sec. Oct 31 08:04:30 2016 smartd[2196]: Standby operation set to adaptive periodic Oct 31 08:04:30 2016 smartd[2196]: Request disabled Oct 31 08:04:30 2016 smartd[2196]: New config: Periodic power check disabled, Periodic Standby disabled: Interval 0 secs, Summary interval 86400 secs, loglevel 0 Oct 31 08:04:30 2016 smartd[2196]: Starting S.M.A.R.T Oct 31 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactivity timer set to 600 Nov 1 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Nov 2 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Nov 3 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Nov 4 08:04:30 2016 smartd[2196]: Device /dev/ad1: inactive (0,0,0,60,0)s Nov 7 08:04:22 2016 smartd[2196]: Disk inactivity timer configured to 600 sec. Nov 7 08:04:22 2016 smartd[2196]: Standby operation set to adaptive periodic Nov 7 08:04:22 2016 smartd[2196]: Request disabled Nov 7 08:04:22 2016 smartd[2196]: New config: Periodic power check disabled, Periodic Standby disabled: Interval 0 secs, Summary interval 86400 secs, loglevel 0 Nov 7 08:04:22 2016 smartd[2196]: Starting S.M.A.R.T Nov 7 08:04:22 2016 smartd[2196]: Device /dev/ad1: inactivity timer set to 600
And these are the tools that you can use to monitor the health of the routers or switches HDD.
Reference:
Paris ARAU
Latest posts by Paris ARAU (see all)
- Junos Fusion – Part IV – Satellite policies and uplink failure detection - 30 July 2018
- Junos Fusion – Part III – Satellite commands and traffic forwarding - 16 July 2018
- Junos Fusion – Part II – Configuration, Administration and Operation - 16 July 2018
- Junos Fusion – Part I – Overview, Components, Ports and Software - 11 July 2018
- Vagrant – Part IV – Network topology using Juniper and Cumulus - 26 April 2018
Comments
So empty here ... leave a comment!