[Yaffs] Re: [YAFFS1] Some bits are changed - systematically

Charles Manning manningc2 at actrix.gen.nz
Tue Nov 29 21:51:38 GMT 2005


On Wednesday 30 November 2005 03:28, Martin Egholm Nielsen wrote:
> Hi Ian,
>
> I take to the list - who knows if somebody might read this one day...
>
> Ian McDonnell wrote:
>  >>>>I'm sure Charles will ask for this when he wakes up in New
>  >>>>Zealand...
>  >>
>  >>I guess not :-)
>  >
>  > Yes, he is very quiet isn't he.

Yup I do take a break now and then.... and there are another around 200 people 
on the list who would hopefully have some opinions too.

> :
> :-)
> :
>  >>Can I enable some flags in the YAFFS core enabling some debug
>  >>information that'll help me investigating this problem?
>  >>I guess what I want is to have a list of nodes stating what
>  >>nodes belong to what file and when the node was written...

Fiddle with the yaffs_traceFlags trace mask. Set according to the flags 
defined in yportenv.h.

This really should be a thing you can se on the fly through procfs.

>  >
>  > Can you reproduce the problem?  Does the corruption hit the same
>  > file?  Is it similar in other files?  Do you know it's not a NAND
>  > or MTD problem -- i.e a corrupted write or a bad device.  Have
>  > you seen this problem on other instances of the h/w. etc.
>
> That's the only device I've seen it with - out of 20-30 pieces having
> had the same "treatment" :-)
> And no I haven't tried that device any more - I didn't want to ruin the
> possibility to analyse what has happened...
>
> And I don't know if it's a NAND or MTD problem - I was hoping that some
> could guide me...
>
> Can this occur, say, with a bad NAND? Would YAFFS/MTD puke up with a lot
> of checksum errors?

A few things that I can think of:

1) A gross NAND failure. YAFFS/mtd are not magic and need reasonably reliable 
media to do anything. ECC can fix for single bit errors, but nothing more. If 
can't fix gross NAND errors any more than ReiserFS can work with a disk with 
a 6 inch nail through it.

2) Iffy timing. CHeck you NAND access timing. Marginal timing has a habit of 
making some parts work OK and others not.

3) Check that the ECC code is actually working OK. A poor ECC implementation 
could cause more damage than it fixes.

4) Bad block handling. If a bad block is not being flagged correctly then you 
could end up retrying it on every mount. That would be a problem.

-- CHarles




More information about the yaffs mailing list