Re: [Yaffs] Can removing chunkErrorStrikes check cause yaffs…

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: CHEN XUEQIN
Date:  
To: Peter Barada
CC: yaffs@lists.aleph1.co.uk
Subject: Re: [Yaffs] Can removing chunkErrorStrikes check cause yaffs2 too many Block struck out ?
Hi Peter:
Thank you for your tip.

于 2012年02月15日 00:56, Peter Barada 写道:

> On 02/14/2012 11:47 AM, CHEN XUEQIN wrote:
>> Hi Peter:
>>
>> 于 2012年02月13日 23:09, Peter Barada 写道:
>>
>>>>>          Here is my question:
>>>>>              1. Is my patch wrong?
>>>>>              2. Why the official yaffs2 code assume 3 chunkErrorStrike to
>>>>>                 retire a block? Reduce to 1 chunkErrorStrike will wrongly
>>>>>                 mark the good block bad?
>>>>>              3. Should I remove the patch?

>>>>>
>>>>>          Thanks a lot for your advice.
>>> Yes, your patch is wrong as any read error will retire the block.

>>>
>>> If you see bit-flips from data read out of MTD, then your NAND driver
>>> isn't properly using ECC to correct the data. If MTD used ECC to
>>> correct the data you would see a -EUCLEAN return from MTD on read which
>>> will percolate through yaffs_HandleChunkError() - and increment the
>>> strike count.
>>
>>      Thanks for your reply. Now I know patch is wrong. I've read the samsung
>> nand chip data sheet and anylyse the kernel log. I think so many blocks struck
>> out are produced by errors in write operation. But it's very strange why those
>> block went into program error state.  According to chip datasheet, if program
>> operation results in an error, map out the block including the page in error
>> and copy the target data to another block. Then it's reasonable for yaffs to
>> retire the block in yaffs_HandleWriteChunkError even if chunk Error Strike count
>> only be one. But why so many program errors? Any ideas?

>>
>>      In addition, I used hardware ECC in MTD driver, the error correcting code
>> is hamming code. The nand chip is MLC mode, so hardware ECC can't correct multi
>> bit error and mtd return read error to yaffs, this may increase the number or
>> blocks struck out. I wondered how yaffs handle the uncorrectable bit error in
>> order to keep filesytem data reliability and integrality. If yaffs2 key data
>> read from nand is error in some bits, how can yaffs2 work without crash?

>>
> From all appearances your MTD driver is nor properly handling ECC,
> either in the write or the read. I assume that on reads if you see a
> single bit-flip and there's no error from MTD, then MTD is *not*
> applying ECC on the read to correct any flipped bits. Its the job of
> the MTD driver to properly compute and write the ECC, and then apply the
> ECC on the read to correct the possible flipped bits - this is why ECC
> is used in NAND, to improve the reliability of the data to make sure
> that the UBER (un-correctable bit error) rate is low (somewhere around
> 10E-15). Without proper ECC NAND can easily show a UBER of 10E-8 or
> higher which is what I think you are seeing.
>


From the kernel log, my MTD driver gave multi bits flip error and could
not correct the bits. The nand controler only support single bit
flip correction. But the rate of UBER is too high in my devices. My
deivces only worked for about half a year and then many error were generated.
May I try some software ECC such as BCH code to replace hardware ecc? I
wonder how about the CPU usage of software ECC?

> If YAFFS sees errors on reads it increments the strike count and if it
> hits the limit then it will mark the block bad. This may be what your
> seeing. You need to test your MTD driver implementation *independent*
> of YAFFS to make sure it is operating as expected. Once you *know* your
> MTD driver works correctly then YAFFS should work fine...
>


Yes, I should the the MTD driver implementation. I wrote some code to
fill the nand block, read the block, and erase block. Maybe the code was
too simple to find the problem. Any open source MTD test program available ?


Regards,
Xueqin Chen