CPU Temperature
Oct 1, 2008Is there a command I can issue using Putty, and logged in as root or admin, to see what temperature my server cpu is running at.
View 8 RepliesIs there a command I can issue using Putty, and logged in as root or admin, to see what temperature my server cpu is running at.
View 8 Repliesif there is a command to check the CPU temperature. Is the following the right way?
cat /proc/acpi/thermal_zone/THRM/temperature always gives 30 C.
I recently got a Intel Quad Core with 8 GB RAM. When the load is nearing 1.00, the kernel flashes the message below. It is always CPU1 and CPU2 while CPU3 and CPU0 is reported to be normal.
====================================================
Sep 22 00:07:47 server2 kernel: CPU2: Temperature above threshold, cpu clock throttled
Sep 22 00:07:47 server2 kernel: CPU3: Temperature/speed normal
Sep 22 00:07:49 server2 kernel: CPU1: Temperature above threshold, cpu clock throttled
Sep 22 00:07:49 server2 kernel: CPU0: Temperature/speed normal
=====================================================
and /proc/acpi/thermal_zone/THRM/* always gives the following
====================================
<setting not supported>
cooling mode: critical
<polling disabled>
state: ok
temperature: 30 C
critical (S5): 110 C
====================================
I have been loosely monitoring the system temperature on my co-located 1U server and have noticed fluctuations of up to 9 degrees Celsius (or around 18 degrees Fahrenheit) depending on the time of day, and the current weather in the city the data center is located.
In the dead of night the system usually reads around 28C but in mid afternoon it will get up to 34 - 38C, not terribly hot, but the effect of the constantly changing temps on the hard drives has me concerned. Server load doesn't seem to be a huge contributor to the temp increase since it's peak load times are usually from late evening until early morning, so I'm guessing this is the data center heating up and cooling down following the outside weather patterns.
do any of you others see temperature swings like this on your servers and how much would be normal?
how to read CPU Temperature on CentOS 4.6. and kernel 2.6.9 (CentOS kernel from yum)
View 4 Replies View Relatedwhat are the 'standards' for server temperatures.
We are testing some new DELL servers, and we're hitting 65 - 70degrees Celcius, was wondering if anyone experiences these temperatures.
We have a very small server/network/telecommunications room with one server rack housing 2 racked Dell servers, 2 3com router, 1 switch, 2 UPSes and 2 tower servers.
In addition, our phone system is housed in this room.
The temperature is normally about 77 degrees Fahrenheit. It is a VERY small room and central air does not reach the room. Their is only a portable A/C(I guess its fairly powerful) that we leave on all night and day at its max. However, the temperature stays about a constant 77 degrees.
I read in some articles that the temp should be about 58 degrees Fahrenheit. Is that true?
Is our equipment being damaged by the temperature in the room?
Is this behavior normal when running a utility such as bonnie++?
I'm running bonnie++ to check for the performance of my drive. When it gets to the part of Writing with putc()... the syslog starts to pop the message in the screen saying:
Message from syslogd@machine at Wed Jun 20 18:06:41 2007 ...
machine kernel: CPU0: Temperature/speed normal
I'm using the following OS:
OS CentOS 5
This is the uname information:
Linux machine.domain.com 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
This is the output of bonnie++
[root@machine ~]# bonnie++ -x 3 -u 0 -n1
Using uid:0, gid:0.
name,file_size,putc,putc_cpu,put_block,put_block_cpu,rewrite,rewrite_cpu,getc,getc_cpu,get_block,get_block_cpu,seeks,seeks_cpu,num_files,seq_create,se q_create_cpu,seq_stat,seq_stat_cpu,seq_del,seq_del_cpu,ran_create,ran_create_cpu,ran_stat,ran_stat_cpu,ran_del,ran_del_cpu
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
bigblue.diversityjobs.com,8G,63756,90,96753,25,43654,9,66384,94,104946,10,292.7,0,1,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
Writing with putc()...
Message from syslogd@bigblue at Wed Jun 20 18:06:41 2007 ...
bigblue kernel: CPU0: Temperature/speed normal
done
Message from syslogd@machine at Wed Jun 20 18:06:43 2007 ...
bigblue kernel: CPU1: Temperature above threshold, cpu clock throttled
Mountain View (CA) - As a company with one of the world's largest IT infrastructures, Google has an opportunity to do more than just search the Internet. From time to time, the company publishes the results of internal research. The most recent project one is sure to spark interest in exploring how and under what circumstances hard drives work - or not.
There is a rule of thumb for replacing hard drives, which taught customers to move data from one drive to another at least every five years. But especially the mechanical nature of hard drives makes these mass storage devices prone to error and some drives may fail and die long before that five-year-mark is reached. Traditionally, extreme environmental conditions are cited as the main reasons for hard drive failure, extreme temperatures and excessive activity being the most prominent ones.
A Google study presented at the currently held Conference on File and Storage Technologies questions these traditional failure explanations and concludes that there are many more factors impacting the life expectancy of a hard drive and that failure predictions are much more complex than previously thought. What makes this study interesting is the fact that Google's server infrastructure is estimated to exceed a number of 450,000 fairly mainstream systems that, in a large number, use consumer-grade devices with capacities ranging from 80 to 400 GB in capacity. According to the company, the project covered "more than 100,000" drives that were put into production in or after 2001. The drives ran at a platter rotation speed of 5400 and 7200 rpm, came from "many of the largest disk drive manufacturers and from at least nine different models."
Google said that it is collecting "vital information" about all of its systems every few minutes and stores the data for further analysis. For example, this information includes environmental factors (such as temperatures), activity levels and SMART parameters (Self-Monitoring Analysis and Reporting Technology) that are commonly considered to be good indicators to describe the health of disk drives.
In general, Google's hard drive population saw a failure rate that was increasing with the age of the drive. Within the group of hard drives up to one year old, 1.7% of the devices had to be replaced due to failure. The rate jumps to 8% in year 2 and 8.6% in year 3. The failure rate levels out thereafter, but Google believes that the reliability of drives older than 4 years is influenced more by "the particular models in that vintage than by disk drive aging effects."
Breaking out different levels of utilization, the Google study shows an interesting result. Only drives with an age of six months or younger show a decidedly higher probability of failure when put into a high activity environment. Once the drive survives its first months, the probability of failure due to high usage decreases in year 1, 2, 3 and 4 - and increases significantly in year 5. Google's temperature research found an equally surprising result: "Failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend," the authors of the study found.
In contrast the company discovered that certain SMART parameters apparently do have an effect drive failures. For example, drives typically scan the disk surface in the background and report errors as they discover them. Significant scan errors can hint to surface errors and Google reports that fewer than 2% of its drives show scan errors. However, drives with scan errors turned out to be ten times more likely to fail than drives without scan errors. About 70% of Google's drives with scan errors survived the first eight months after the first scan error was reported.
Similarly, reallocation counts, a number that results from the remapping of faulty sectors to a new physical sector, can have a dramatic impact on a hard drive's life: Google said that drives with one or more reallocations fail more often than those with none. The observed average impact on the average fail rate came in at a factor of 3-6, while about 85% of the drives survive past eight months after the first reallocation.
Google discovered similar effects on hard drives in other SMART categories, but them bottom line revealed that 56% of all failed drives had no count in either one of these categories - which means that more than half of all failed drives were put out of operation by factors other than scan errors, reallocation count, offline reallocation and probational counts.
In the end, Google's research does not solve the problem of predicting when hard drives are likely to fail. However, it shows that temperature and high usage alone are not responsible for failures by default. Also, the researcher pointed towards a trend they call "infant mortality phase" - a time frame early in a hard drive's life that shows increased probabilities of failure under certain circumstances. The report lacks a clear cut conclusion, but the authors indicate that there is no promising approach at this time than can predict failures of hard drives: "Powerful predictive models need to make use of signals beyond those provided by SMART."