Re: [Yaffs] Can removing chunkErrorStrikes check cause yaffs2 too many Block struck out ?

Attachments:
Message as email (text/plain)

Author: Peter Barada
Date:
To: CHEN XUEQIN
CC: yaffs@lists.aleph1.co.uk
Subject: Re: [Yaffs] Can removing chunkErrorStrikes check cause yaffs2 too many Block struck out ?

On 02/14/2012 09:14 PM, CHEN XUEQIN wrote:
> Hi Peter: > Thank you for your tip.

>
> 于 2012年02月15日 00:56, Peter Barada 写道:
>
>> On 02/14/2012 11:47 AM, CHEN XUEQIN wrote:
>>> Hi Peter:
>>>
>>> 于 2012年02月13日 23:09, Peter Barada 写道:
>>>
>>>>>> Here is my question: >>>>>> 1. Is my patch wrong? >>>>>> 2. Why the official yaffs2 code assume 3 chunkErrorStrike to >>>>>> retire a block? Reduce to 1 chunkErrorStrike will wrongly >>>>>> mark the good block bad? >>>>>> 3. Should I remove the patch?

>>>>>>
>>>>>> Thanks a lot for your advice. >>>> Yes, your patch is wrong as any read error will retire the block.

>>>>
>>>> If you see bit-flips from data read out of MTD, then your NAND driver >>>> isn't properly using ECC to correct the data. If MTD used ECC to >>>> correct the data you would see a -EUCLEAN return from MTD on read which >>>> will percolate through yaffs_HandleChunkError() - and increment the >>>> strike count. >>> Thanks for your reply. Now I know patch is wrong. I've read the samsung >>> nand chip data sheet and anylyse the kernel log. I think so many blocks struck >>> out are produced by errors in write operation. But it's very strange why those >>> block went into program error state. According to chip datasheet, if program >>> operation results in an error, map out the block including the page in error >>> and copy the target data to another block. Then it's reasonable for yaffs to >>> retire the block in yaffs_HandleWriteChunkError even if chunk Error Strike count >>> only be one. But why so many program errors? Any ideas?

>>>
>>> In addition, I used hardware ECC in MTD driver, the error correcting code >>> is hamming code. The nand chip is MLC mode, so hardware ECC can't correct multi >>> bit error and mtd return read error to yaffs, this may increase the number or >>> blocks struck out. I wondered how yaffs handle the uncorrectable bit error in >>> order to keep filesytem data reliability and integrality. If yaffs2 key data >>> read from nand is error in some bits, how can yaffs2 work without crash?

>>>
>> From all appearances your MTD driver is nor properly handling ECC,
>> either in the write or the read. I assume that on reads if you see a
>> single bit-flip and there's no error from MTD, then MTD is *not*
>> applying ECC on the read to correct any flipped bits. Its the job of
>> the MTD driver to properly compute and write the ECC, and then apply the
>> ECC on the read to correct the possible flipped bits - this is why ECC
>> is used in NAND, to improve the reliability of the data to make sure
>> that the UBER (un-correctable bit error) rate is low (somewhere around
>> 10E-15). Without proper ECC NAND can easily show a UBER of 10E-8 or
>> higher which is what I think you are seeing.
>>
> From the kernel log, my MTD driver gave multi bits flip error and could
> not correct the bits. The nand controler only support single bit
> flip correction. But the rate of UBER is too high in my devices. My
> deivces only worked for about half a year and then many error were generated.
> May I try some software ECC such as BCH code to replace hardware ecc? I
> wonder how about the CPU usage of software ECC?

To find out the CPU usage of software ECC you'll have to configure/code
it into your kernel, boot it and then measure it...

>> If YAFFS sees errors on reads it increments the strike count and if it
>> hits the limit then it will mark the block bad. This may be what your
>> seeing. You need to test your MTD driver implementation *independent*
>> of YAFFS to make sure it is operating as expected. Once you *know* your
>> MTD driver works correctly then YAFFS should work fine...
>>
> Yes, I should the the MTD driver implementation. I wrote some code to
> fill the nand block, read the block, and erase block. Maybe the code was
> too simple to find the problem. Any open source MTD test program available ?
The MTD drivers in the kernel include test modules; look at
http://www.linux-mtd.infradead.org/doc/general.html#L_mtd_tests for more
information.
>
> Regards,
> Xueqin Chen

--
Peter Barada
peter.barada@logicpd.com

This message is part of the following thread:
	the complete thread tree sorted by date
	CHEN XUEQIN at

Re: [Yaffs] Can removing chunkErrorStrikes check cause yaffs…