Temperature Monitoring of 3Ware Controller with smartmontools, Nagios and NagiosGrapher

30.06.2009
von Mario Rasser

We have a 3Ware 9550SX-8LP installed in a Linuxserver running Ubuntu Server. What I am going to describe is, how we monitor the Harddisk Temperature via Nagios and graphing it via NagiosGrapher. This HowTo can easily be modified for other 3Ware Controllers.

Temperature graph from a 3Ware 9550SX-8LP with NagiosGrapher

Temperature graph from a 3Ware 9550SX-8LP with NagiosGrapher

Installation and configuration of needed packages

  • #aptitude install smartmontools snmpd
  • Copy the a wrapper script for smartmontools to /usr/local/bin/get_smart_value.sh, it will be used by SNMP later
    #!/bin/bash
    # Extract the Temperature Value from the SMART values gained by smartclt
    # the value 194 contains the HDD Temperature
    smartctl -a -d 3ware,${1} ${2} | grep ^194 | awk '{print $10}'

    … and do a chmod +x /usr/local/bin/get_smart_value.sh

  • Configure SNMPd to to run extend-Commands (/etc/snmp/snmpd.conf) and add at the end of the file:
    ...
    extend 3Ware_1_Port0 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 0 /dev/twa0'
    extend 3Ware_1_Port1 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 1 /dev/twa0'
    extend 3Ware_1_Port2 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 2 /dev/twa0'
    extend 3Ware_1_Port3 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 3 /dev/twa0'
    extend 3Ware_1_Port4 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 4 /dev/twa0'
    extend 3Ware_1_Port5 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 5 /dev/twa0'
    #in our system just 6 HDDs are connected so we will return 0 for the non used
    extend 3Ware_1_Port6 '/bin/echo 0'
    extend 3Ware_1_Port7 '/bin/echo 0'

    … and restart SNMPd with /etc/init.d/snmpd restart

  • # visudo

    and add the following line

    snmp    ALL = NOPASSWD: /usr/local/bin/get_smart_value.sh

    … to allow snmp run the script as SuperUser, which is needed to get the S.M.A.R.T. values via smartctl

Nagios Temperature Checking

define service{
        use                             generic-service
        host_name                       your-server-with-3ware.name.tld
        service_description             3Ware Temp. 9550SX-8LP
        flap_detection_enabled          1
        max_check_attempts              4
        check_period                    24x7
        notification_period             24x7
        check_command                   check_snmp!$HOSTADDRESS$!yourcommunitystring!2c!.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.48.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.49.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.50.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.51.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.52.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.53.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.54.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.55.1!NET-SNMP-EXTEND-MIB!45,45,45,45,45,45,2,2!50,50,50,50,50,50,4,4!'Port Temperatures'!'C','C','C','C','C','C','C','C'
}

Nagiosgrapher

in our config the Nagios Service is called “3Ware Temp. 9550SX-8LP”, so we put a config for nagiosgrapher into /etc/nagiosgrapher/ngraph.d called check_temp_3ware_smart_9550SX-8LP.ncfg that matches the serivce name and defines the needed RegExp to match the wanted data:

define ngraph{
        graph_legend    Port 0
        graph_legend_eol        none
        graph_perf_regex Port0\.1=([0-9]+)
        graph_units     degree C
        graph_value     Temp1
        hide            no
        rrd_color       ff0000
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
 
define ngraph{
        graph_legend    Port 1
        graph_legend_eol        LEFT
        graph_perf_regex Port1\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp2
        hide    no
        rrd_color       0000ff
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
 
define ngraph{
        graph_legend    Port 2
        graph_legend_eol        LEFT
        graph_perf_regex Port2\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp3
        hide    no
        rrd_color       00ffff
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
 
define ngraph{
        graph_legend    Port 3
        graph_legend_eol        LEFT
        graph_perf_regex Port3\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp4
        hide    no
        rrd_color       00ff00
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
define ngraph{
        graph_legend    Port 4
        graph_legend_eol        LEFT
        graph_perf_regex Port4\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp5
        hide    no
        rrd_color       45ff00
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
define ngraph{
        graph_legend    Port 5
        graph_legend_eol        LEFT
        graph_perf_regex Port5\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp6
        hide    no
        rrd_color       00ff33
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
define ngraph{
        graph_legend    Port 6
        graph_legend_eol        LEFT
        graph_perf_regex Port6\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp7
        hide    no
        rrd_color       DDFFDD
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
define ngraph{
        graph_legend    Port 7
        graph_legend_eol        LEFT
        graph_perf_regex Port7\.1=([0-9]+)
        graph_units     degree C
        graph_value     temp8
        hide    no
        rrd_color       aaffaa
        rrd_plottype    LINE1
        service_name    ^3Ware Temp. 9550SX-8LP
}
 
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    Temp1
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    Temp1
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    Temp1
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp2
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp2
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp2
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp3
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp3
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp3
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp4
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp4
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp4
        service_name    ^3Ware Temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp5
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp5
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp5
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp6
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp6
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp6
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp7
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp7
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp7
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
define ngraph{
        print_description       Latest:
        print_format    %2.2lf
        print_function  LAST
        print_source    temp8
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Maximum:
        print_format    %2.2lf
        print_function  MAX
        print_source    temp8
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}
 
define ngraph{
        print_description       Average:
        print_eol       left
        print_format    %2.2lf
        print_function  AVERAGE
        print_source    temp8
        service_name    ^3Ware temp. 9550SX-8LP
        type    GPRINT
}

BONUS: Short smartctl/3Ware excures

  • smartctl that comes with the smartmontools package is able to read the S.M.A.R.T. values from HDDs on a 3Ware Controller
  • it needs the 3Ware Port that should by queried for data and the 3ware blockdevice, on the newer controller series of 3Ware they are /dev/twaX, where X is the controller number and is starting at 0. So the first controller ist /dev/twa0 the second /dev/twa1 and so on.
  • older 3Ware controllers are using /dev/twe instead, just have a look and maybe modify the SNMP extend section to get the info from the correct controller/devive
  • Output on one of our System:
    # smartctl -a -d 3ware,0 /dev/twa0
    smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/
     
    === START OF INFORMATION SECTION ===
    Model Family:     Western Digital Caviar RE Serial ATA series
    Device Model:     WDC WD4000YS-01MPB1
    Serial Number:    WD-WCANU1569566
    Firmware Version: 09.02E09
    User Capacity:    400.088.457.216 bytes
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   7
    ATA Standard is:  Exact ATA specification draft version not indicated
    Local Time is:    Fri Jul 10 10:53:57 2009 CEST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
     
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
     
    General SMART Values:
    Offline data collection status:  (0x84) Offline data collection activity
                                            was suspended by an interrupting command from host.
                                            Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                 (11880) seconds.
    Offline data collection
    capabilities:                    (0x7b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        ( 149) minutes.
    Conveyance self-test routine
    recommended polling time:        (   6) minutes.
     
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   218   218   021    Pre-fail  Always       -       6100
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
      9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11445
     10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
     11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
    194 Temperature_Celsius     0x0022   114   113   000    Old_age   Always       -       38
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0
     
    SMART Error Log Version: 1
    No Errors Logged
     
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short captive       Interrupted (host reset)      90%      4274         -
    # 2  Short offline       Completed without error       00%      4122         -
    # 3  Short offline       Completed without error       00%      3954         -
    # 4  Short offline       Completed without error       00%      3786         -
    # 5  Short offline       Completed without error       00%      3618         -
    # 6  Short offline       Completed without error       00%      3451         -
    # 7  Short offline       Completed without error       00%      3283         -
    # 8  Short offline       Completed without error       00%      3115         -
    # 9  Short offline       Completed without error       00%      2947         -
    #10  Short offline       Completed without error       00%      2779         -
    #11  Short offline       Completed without error       00%      2611         -
    #12  Short offline       Completed without error       00%      2443         -
    #13  Short offline       Completed without error       00%      2275         -
    #14  Short offline       Completed without error       00%      2107         -
    #15  Short offline       Completed without error       00%      1940         -
    #16  Short offline       Completed without error       00%      1772         -
    #17  Short offline       Completed without error       00%      1604         -
    #18  Short offline       Completed without error       00%      1436         -
    #19  Short offline       Completed without error       00%      1268         -
    #20  Short offline       Completed without error       00%      1101         -
    #21  Short offline       Completed without error       00%       933         -
     
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

Professional Linux, Nagios and Monitoring support

Kommentare sind geschlossen.

© 2003-2014 Fa. ipunct - IT-Lösungen auf den Punkt gebracht