Temperature Monitoring of 3Ware Controller with smartmontools, Nagios and NagiosGrapher
von Mario Rasser
We have a 3Ware 9550SX-8LP installed in a Linuxserver running Ubuntu Server. What I am going to describe is, how we monitor the Harddisk Temperature via Nagios and graphing it via NagiosGrapher. This HowTo can easily be modified for other 3Ware Controllers.

Temperature graph from a 3Ware 9550SX-8LP with NagiosGrapher
Installation and configuration of needed packages
-
#aptitude install smartmontools snmpd - Copy the a wrapper script for smartmontools to /usr/local/bin/get_smart_value.sh, it will be used by SNMP later
#!/bin/bash # Extract the Temperature Value from the SMART values gained by smartclt # the value 194 contains the HDD Temperature smartctl -a -d 3ware,${1} ${2} | grep ^194 | awk '{print $10}'
… and do a chmod +x /usr/local/bin/get_smart_value.sh
- Configure SNMPd to to run extend-Commands (/etc/snmp/snmpd.conf) and add at the end of the file:
... extend 3Ware_1_Port0 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 0 /dev/twa0' extend 3Ware_1_Port1 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 1 /dev/twa0' extend 3Ware_1_Port2 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 2 /dev/twa0' extend 3Ware_1_Port3 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 3 /dev/twa0' extend 3Ware_1_Port4 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 4 /dev/twa0' extend 3Ware_1_Port5 '/usr/bin/sudo /usr/local/bin/get_smart_value.sh 5 /dev/twa0' #in our system just 6 HDDs are connected so we will return 0 for the non used extend 3Ware_1_Port6 '/bin/echo 0' extend 3Ware_1_Port7 '/bin/echo 0'
… and restart SNMPd with /etc/init.d/snmpd restart
-
# visudoand add the following line
snmp ALL = NOPASSWD: /usr/local/bin/get_smart_value.sh
… to allow snmp run the script as SuperUser, which is needed to get the S.M.A.R.T. values via smartctl
Nagios Temperature Checking
define service{ use generic-service host_name your-server-with-3ware.name.tld service_description 3Ware Temp. 9550SX-8LP flap_detection_enabled 1 max_check_attempts 4 check_period 24x7 notification_period 24x7 check_command check_snmp!$HOSTADDRESS$!yourcommunitystring!2c!.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.48.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.49.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.50.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.51.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.52.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.53.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.54.1,.1.3.6.1.4.1.8072.1.3.2.4.1.2.13.51.87.97.114.101.95.49.95.80.111.114.116.55.1!NET-SNMP-EXTEND-MIB!45,45,45,45,45,45,2,2!50,50,50,50,50,50,4,4!'Port Temperatures'!'C','C','C','C','C','C','C','C' }
Nagiosgrapher
in our config the Nagios Service is called “3Ware Temp. 9550SX-8LP”, so we put a config for nagiosgrapher into /etc/nagiosgrapher/ngraph.d called check_temp_3ware_smart_9550SX-8LP.ncfg that matches the serivce name and defines the needed RegExp to match the wanted data:
define ngraph{ graph_legend Port 0 graph_legend_eol none graph_perf_regex Port0\.1=([0-9]+) graph_units degree C graph_value Temp1 hide no rrd_color ff0000 rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 1 graph_legend_eol LEFT graph_perf_regex Port1\.1=([0-9]+) graph_units degree C graph_value temp2 hide no rrd_color 0000ff rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 2 graph_legend_eol LEFT graph_perf_regex Port2\.1=([0-9]+) graph_units degree C graph_value temp3 hide no rrd_color 00ffff rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 3 graph_legend_eol LEFT graph_perf_regex Port3\.1=([0-9]+) graph_units degree C graph_value temp4 hide no rrd_color 00ff00 rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 4 graph_legend_eol LEFT graph_perf_regex Port4\.1=([0-9]+) graph_units degree C graph_value temp5 hide no rrd_color 45ff00 rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 5 graph_legend_eol LEFT graph_perf_regex Port5\.1=([0-9]+) graph_units degree C graph_value temp6 hide no rrd_color 00ff33 rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 6 graph_legend_eol LEFT graph_perf_regex Port6\.1=([0-9]+) graph_units degree C graph_value temp7 hide no rrd_color DDFFDD rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ graph_legend Port 7 graph_legend_eol LEFT graph_perf_regex Port7\.1=([0-9]+) graph_units degree C graph_value temp8 hide no rrd_color aaffaa rrd_plottype LINE1 service_name ^3Ware Temp. 9550SX-8LP } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source Temp1 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source Temp1 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source Temp1 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp2 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp2 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp2 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp3 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp3 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp3 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp4 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp4 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp4 service_name ^3Ware Temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp5 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp5 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp5 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp6 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp6 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp6 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp7 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp7 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp7 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Latest: print_format %2.2lf print_function LAST print_source temp8 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Maximum: print_format %2.2lf print_function MAX print_source temp8 service_name ^3Ware temp. 9550SX-8LP type GPRINT } define ngraph{ print_description Average: print_eol left print_format %2.2lf print_function AVERAGE print_source temp8 service_name ^3Ware temp. 9550SX-8LP type GPRINT }
BONUS: Short smartctl/3Ware excures
- smartctl that comes with the smartmontools package is able to read the S.M.A.R.T. values from HDDs on a 3Ware Controller
- it needs the 3Ware Port that should by queried for data and the 3ware blockdevice, on the newer controller series of 3Ware they are /dev/twaX, where X is the controller number and is starting at 0. So the first controller ist /dev/twa0 the second /dev/twa1 and so on.
- older 3Ware controllers are using /dev/twe instead, just have a look and maybe modify the SNMP extend section to get the info from the correct controller/devive
- Output on one of our System:
# smartctl -a -d 3ware,0 /dev/twa0 smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Caviar RE Serial ATA series Device Model: WDC WD4000YS-01MPB1 Serial Number: WD-WCANU1569566 Firmware Version: 09.02E09 User Capacity: 400.088.457.216 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Jul 10 10:53:57 2009 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11880) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 149) minutes. Conveyance self-test routine recommended polling time: ( 6) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 218 218 021 Pre-fail Always - 6100 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11445 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18 194 Temperature_Celsius 0x0022 114 113 000 Old_age Always - 38 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short captive Interrupted (host reset) 90% 4274 - # 2 Short offline Completed without error 00% 4122 - # 3 Short offline Completed without error 00% 3954 - # 4 Short offline Completed without error 00% 3786 - # 5 Short offline Completed without error 00% 3618 - # 6 Short offline Completed without error 00% 3451 - # 7 Short offline Completed without error 00% 3283 - # 8 Short offline Completed without error 00% 3115 - # 9 Short offline Completed without error 00% 2947 - #10 Short offline Completed without error 00% 2779 - #11 Short offline Completed without error 00% 2611 - #12 Short offline Completed without error 00% 2443 - #13 Short offline Completed without error 00% 2275 - #14 Short offline Completed without error 00% 2107 - #15 Short offline Completed without error 00% 1940 - #16 Short offline Completed without error 00% 1772 - #17 Short offline Completed without error 00% 1604 - #18 Short offline Completed without error 00% 1436 - #19 Short offline Completed without error 00% 1268 - #20 Short offline Completed without error 00% 1101 - #21 Short offline Completed without error 00% 933 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

[...] Temperature Monitoring of 3Ware Controller (9550SX-8LP) with … [...]