[Yaffs] Bad block management

Jacob Dall jacob.dall@operamail.com
Fri, 21 Jan 2005 14:17:53 +0100


Hello Charles,

Thank you very much for replying - I really appreciate it.

> On Thursday 20 January 2005 23:02, Jacob Dall wrote:
> > Hello yaffers,
> >
> > I've a few questions regarding why yaffs' bad block management is desig=
ned
> > the way it is.
> >
> > According to Toshiba, NAND failures can be distinguished as "permanent
> > failures" or "soft errors"
> >
> > 1) Permanent failures: this error occurs when programming or erasing, a=
nd
> > can be detected by reading the status register after operation.
> >
> > 2) Soft errors: this error occurs during a program, but can only be
> > detected by reads. The error is cleared by a block erase.
> >
> > Now, upon read, if yaffs detects an unfixable ECC error in a page, the
> > block holding that page is marked as bad. According to 2) it would be o=
k to
> > just mark the page as discarded and let the garbage collector do its jo=
b -
> > or have I missed something?
>=20
> This mechanism was designed before Toshiba shared their wonderful document
> with the world. I have considered changing this, but it has never been a =
very
> high priority and it does put data at risk.
>=20
> The "soft errors" are typically  write disturb failures that can (hopeful=
ly)
> be fixed by ECC.  My concern is that if a block displays write disturb
> problems then perhaps it is "going bad". ECC can only fix single bit erro=
rs.
> I don't want to wait until it has "gone bad" and lost data before I retire
> it. I'd prefer to retire dodgy looking blocks earlier.

Actually, having looked at the yaffs1 internals, I think it has already bee=
n changed - the RetireBlock() is only called from yaffs_BlockBecameDirty().=
=20

>=20
> >
> > In yaffs, a block is marked bad by writing 0 to byte 517 in page 0 / 1 =
in
> > the block. Why wasn't it decided to use another value (for instance, li=
ke
> > SmartMedia's 0xF0). Then it would have been possible to destinguish ini=
tial
> > bad blocks from operational bad blocks.
>=20
> This was considered. However I decided to use 0x00 because this would have
> the most likelihood of programming a block where the bits don't "stick"we=
ll.
> A sparse bit pattern  is less likely to program than all 0s.
>=20
> THis could be changed quite easily.
>=20
> Generally the factory marked bad blocks are not just marked with this byt=
e.
> Mostly the whole OOB area or even the whole block is marked zero. THis
> generally makes it easy enough to distnguish factor marked from YAFFS-mar=
ked
> bad blocks.
>=20
> >
> > I've an issue with some of my devices - bad blocks is increased very
> > rapidly. Beyond the fact that it's due to ECC read errors, I'm yet to
> > discover the root of the problem.
>=20
>=20
> I've done extensive lifetime testing on some devices. One test I did wrote
> approx 130GB stuff, read and verified it with not one ECC failure or bit
> getting munged.
>=20
> Some other people doing lifetime testing have expressed concern because t=
hey
> lose 1-2% of flash during the lifetime of a device.
>=20
> What do you mean by  rapidly? I assume it is far worse than either of the=
se!

Yes, it's far worse. Imagine having a system that, when looked at, has 2 ba=
d blocks. One hour later it has over 500!!
And this in a system that every 15 second writes approc. 10KB of data

>=20
> If you're using Linuxx, then the most likely cuases of the problem are a =
miss
> match between the ECC strategy you're using in YAFFS and what you have
> configured in mtd.

I'm using yaffs1/direct

>=20
> >
> > I'm not blaming yaffs - I'm sure the problem is to be found else where,=
 but
> > I'm thinking really hard of making those changes to yaffs, making me ab=
le
> > to get back to the state when the NAND was first taken into use.
> >
> > Please let me know your reasons / thoughts...
>=20
> Being able to change the bad block marker would help you with bench testi=
ng
> until you have fixed the real problem.
>=20
> There are two things you could try:
> 1) In yaffs_RetireBlock, change the blockstatus to some easy to detect va=
lue
> that has at least two zero bits (eg. 0xFC).
> 2) Or even turn off the writing of bad block markers completely.  This wo=
uld
> cause problems in the file system state, but that probably does not matter
> for you at the moment.
>=20
> Of course I'm assuming you just want to do these changes while you find a=
nd
> fix the real problem.  I would not suggest shipping product with either of
> these changes.
>=20
> >
> >
> > Thanks and regards,
> > Jacob Dall
> >
> > FYI: the 'According to Toshiba' stuff was taken from a document named '=
NAND
> > Flash Application Design Guide'
>=20
> Great doc. Should be required reading for anyone working with NAND.
>=20
> >
> >
> > _______________________________________________
> > yaffs mailing list
> > yaffs@stoneboat.aleph1.co.uk
> > http://stoneboat.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs