[Yaffs] Re: [YAFFS1] Some bits are changed - systematically

Martin Egholm Nielsen martin at egholm-nielsen.dk
Fri Dec 16 08:05:34 GMT 2005


>> > Can you reproduce the problem?  Does the corruption hit the same
>> > file?  Is it similar in other files?  Do you know it's not a NAND
>> > or MTD problem -- i.e a corrupted write or a bad device.  Have
>> > you seen this problem on other instances of the h/w. etc.
>>
>>That's the only device I've seen it with - out of 20-30 pieces having
>>had the same "treatment" :-)
>>And no I haven't tried that device any more - I didn't want to ruin the
>>possibility to analyse what has happened...
>>
>>And I don't know if it's a NAND or MTD problem - I was hoping that some
>>could guide me...
>>
>>Can this occur, say, with a bad NAND? Would YAFFS/MTD puke up with a lot
>>of checksum errors?
> 
> 
> A few things that I can think of:
> 
> 1) A gross NAND failure. YAFFS/mtd are not magic and need reasonably reliable 
> media to do anything. ECC can fix for single bit errors, but nothing more. If 
> can't fix gross NAND errors any more than ReiserFS can work with a disk with 
> a 6 inch nail through it.
> 
> 2) Iffy timing. CHeck you NAND access timing. Marginal timing has a habit of 
> making some parts work OK and others not.
> 
> 3) Check that the ECC code is actually working OK. A poor ECC implementation 
> could cause more damage than it fixes.
> 
> 4) Bad block handling. If a bad block is not being flagged correctly then you 
> could end up retrying it on every mount. That would be a problem.

I haven't had the time to dig further into to this - we've been 
strugling with other critical issues - namely bad powerup and most 
noticeable of all: Memory failures! Some of our boards crashes and in 
"lightweight" situations the memory is just modified slightly. So for 
now I put all my faith in this being the reason for this systematic 
bit-changing...
But I guess, in order for this to be The Plausible Real Explanation 
(TM), the bits would have been modified writing the file. However, the 
error just occurred after some several reboots and additional writes to 
the NAND. But perhaps, the additional writing could trigger new 
instructions/code from the altered file (libc.so)?!
Does this sound likely?

BR,
  Martin Egholm




More information about the yaffs mailing list