Official GIGABYTE Forum
Questions about GIGABYTE products => Motherboards with AMD processors => Topic started by: j5 on April 04, 2011, 06:39:10 pm
-
System:
GA-880GMA-UD2H Rev: 2.0
AMD Phenom II x2 555 BE
G.Skill Ripjaws F3-12800CL9D-4GBRL kit loaded into the "blue" slots.
2x WD7501AALS in RAID 0
1x WD7501AALS standalone
BIOS set to optimum defaults.
What is happening is thus: After a few days/weeks of operation, the system will freeze with corrupted video. After the initial freeze, it takes a good bit of resetting/restarting before the system will boot and remain stable. Indicators point to something overheating since a hard power off for a few hours is normally required to get the unit back up and stable. The unit ran rock solid for 6 months until a RAM stick went bad and I had to RMA it.
Any ideas what this could be or how to test?
could the memory controller have gotten whacked?
maybe the onboard heatsink (TURBO 3D) came loose when I was in there pulling sticks?
-
Hi there,
this is one of those issues that might take a little time to resolve, especially as it seems to be intermittent!
My first guess would be to check that the Northbridge heatsink hasn't come loose. They do seem to be a bit wobbly (that's normal) but just double check that it is seated properly. It maybe worthwhile removing the NB heatsink and replacing the thermal paste as this can sometimes help.
Why am a suggesting this? Well as the problem seems to be video linked the Northbridge would be the logical place to start looking as the NB Chip handles the throughput of data for video.
Another thing to check is your BIOS. Is it the latest version F5? If not you can download from here: http://www.gigabyte.com/products/product-page.aspx?pid=3424&dl=1#bios and then use QFlash to update rather than the @BIOS utility. Also, it is worth checking that your video drivers are up to date.
Also, are you using any software to monitor your system temps? If so, what? A utility such as HW Monitor can help if the problem is temperature related as it will show you if anything is running very hot. But, a note of caution, it isn't perfect and sometimes the sensors on the motherboard do not work as they should so it is best used as a guide rather than as gospel.
If none of theses steps resolve the problem, post back and we will look at other things that may be causing the issue. To put your mind at rest, I would be very surprised if the Memory controller has been damaged unless you have been running your RAM at 1600 MHz for any period of time. Even a faulty RAM Stick shouldn't impact on the memory controller.
-
I do notice the NB heatsink getting pretty hot to the touch at times, not sure if that's a good sign of thermal contact or a bad sign of overheating components.
I was worried I'd void the warranty if I replaced the thermal paste and reseated the heatsink, so I haven't messed with it. Plus I have to take the board out, which I'm not excited about.
What I'm going to do is rotate my CPU cooler (Cooler Master GeminII S) to cover the NB heatsink instead of the RAM, and see if that helps, though I should probably be concerned that the radiant cooling isn't enough.
(BIOS is F5)
-
If the NB is getting hot to touch that means it is transferring the heat from the chip through to the heatsink so you probably won't benefit from removing it and replacing the thermal paste.
You may gain by reorienting your cooler, as you have suggested. Another thing to check is the amount of cool air passing through your PC Case, is there enough flow to actually cool the motherboard?
Adding a small 80mm to direct some cool air onto the Northbridge area of the motherboard is also a relatively easy and very effective mod to help bring temps down.
-
I've been running the fans on "auto" for both case and CPU since I'm not doing any overclocking.
Since they aren't much louder in full on mode, I'll run them that way for a bit and see how things improve (or otherwise)
thanks for the tips, stay tuned.
-
Made the above mentioned changes, but I still can't get this system stable.
It will either freeze or bluescreen (the last one being bad_pool_header) within 24 hours of use.
-
Have you tried running Memtest on your RAM? If not it might be worth doing that just to make sure you haven't got a dodgy module before we go much further down the overheating route.
As I said initially, the kind of issues you are facing can be the hardest to solve and it may take a little while before you can actually isolate the exact cause and solve the problem. :'(
-
I did run a memtest (Windows memory diagnostics, running extended mode, cache at default, unlimited passes) for 9 passes with no errors.
I have it running now, seeing if it can go 24 hours with no errors.
Afte the BSOD, windows crashed on boot with a video driver complaint, but I restarted again after a cold boot with no other changes and it came up fine.
??? ???
-
What make and model of power supply are you using?
-
What make and model of power supply are you using?
CORSAIR CMPSU-450VX 450W ATX12V V2.2 80 PLUS Certified Active PFC
-
And what GPU are you using?
-
And what GPU are you using?
On-board
-
Ok. I was considering your power supply requirements as the PSU is below the recommended 550W. Especially with all your drives as well but maybe you can get away with it if you are using the onboard GPU.
-
Ok. I was considering your power supply requirements as the PSU is below the recommended 550W.Especially with all your drives as well but maybe you can get away with it if you are using the onboard GPU.
Interesting point. I haven't really done a load analysis on the system....any software that can help or do I just pull out the calculator and start adding?
-
There are plenty of load calculators on various PSU manufacturers websites but I have found often that they are not accurate when it comes to real world scenarios. I would have looked at a PSU of at least 550W personally but as your is a good make and new it should be ok for now.
-
You haven't mentioned your temperatures but I'd try an external GPU just in case.
DM, his PSU is much more powerful than needed. He's using only a dual-core with on-board graphics. He's probably using about 150-200W under maximum load on everything.
-
bythway_r I know you are right if you add up the figures but often I have found that especially on startup when the load is greater a significantly larger PSU is sensible. :-\
-
I can't get a temp read off of the onboard GPU. I'm not sure if there's no sensor or if the softs won't recognize it (e.g. HW Monitor)
Using Sandra, board temps are reading about 40C and CPU temps are reading in the 30s.
-
Try GPU-Z. That should give you the tempertures of the card.
-
Well, the extended extended memtest came up snake eyes.
Now to determine if it's the stick or the controller.
blast!
ETA: When the first stick went bad, I ran extensive memory tests on the single stick with no issues. Either the stick deteriorated, or....something else.
-
It's always difficult when you have a problem with RAM knowing exactly what is best. If you bought the original RAM modules as a kit and one stick becomes faulty it is always better to RMA all of the sticks not just one of them. Whilst in the vast majority of cases you can get away with mixing and matching RAM in reality it is not the best way.
The fact that you have now run an extended memory test and errors have come up would suggest to me that you have been unlucky with your RAM. Although the memory controller on the CPU doesn't like running RAM above 1333MHz, the truth is it is pretty uncommon for it to fail.
The options you now face though are:
1: Replacing the RAM (again) this is the option I would go for as it is more likely to be faulty RAM than the CPU.
2: Contacting AMD and seek an RMA on the processor. If you chose to RMA the CPU, it will be tested by AMD and returned if it is not faulty but if there is a problem it will be replaced. Either way you will know that your CPU is OK.
As I said right at the beginning, these kind of problems are often the most frustrating and difficult to cure because there is a 50/50 chance that you will replace one thing only to find that it is something else causing the issue. :-\
-
Well, the extended extended memtest came up snake eyes.
ETA: When the first stick went bad, I ran extensive memory tests on the single stick with no issues. Either the stick deteriorated, or....something else.
It sounds like from your post that yoiu just returned the one faulty module for RMA when it went. If this is so then you are now trying to run with an unmatched pair of modules which is likely to give problems.
-
Yeah, I asked G.Skill if I should send both back, they didn't indicate it was needed. So I didn't.
I'm going to be systematic here with the following:
Run the new stick on memtest in channel 0
Run the new stick on memtest in channel 1
run the old stick on memtest
Relax the timings back to 9-9-9-24 and run both sticks on memtest again.
Depending on the results, I'll send both sticks back this time, regardless.
I hope the CPU is good since that is the most invasive replacement.
-
Yes as they are a matched set(or were anyway) they should always stay together. If you return any you return the whole set.
Try loosening the timings to 9-9-9-27
-
My battery of tests are complete.
I have a bad channel 0
After successful completion with memtest of both single sticks in channel1, I started fresh with channel 0 this morning.
When I returned from work, there was no video.
Hard power down, moved the stick to the back bank of channel 0, restarted, the screen went corrupt as windows started to load.
After that, the box wouldn't even post.
Moved the stick back to channel 1, all good.
Back to channel 0, no post.
Now I have both sticks in channel 1 with no issues so far.
So...CPU or MB?
-
Most likely a failed controller on the CPU I reckon. RMA it. ;)
-
I think I either have multiple issues or the second memory controller is going bad also.
Running all memory in Channel 1, the unit locked up (on the media center 7 screen saver) after about 2 days of continuous operation.
I was also running a disk synch on my RAID 1 partition at the time, not sure if that makes a difference.
???