Hi Ian,

Thanks for taking the time to discuss this - I appreciate it!

>> [yaffs] retires the block at erase time
>I don't know about the real marking of bad blocks.  We have 
>actually disabled this in some versions of products where we
>were bitten by transient write errors causing large number of 
>blocks to be persistently marked bad (OOB) and taken out of 
>service.
>

... Meaning that in your experience it's OK to just defer to the "write fail" mechanism - if it fails a write to page <n> this time, then after erasure either it will fail the write to page <n> again, or if the write to page <n> happens to succeed then the data in page <n> is reliably OK.  Right? (with some trepitation) 

>> Is this right?  If so, it seems OK as long as bad pages within
>> an eraseblock does not imply unreliability of other pages
>> within the same eraseblock.
>
>The logic around declaring a block truly bad and broken is 
>lacking (both Yaffs and MTD).  IRCC, NAND vendors recommend that 
>blocks should be erased when there are write/read errors, and 
>only marked bad if the erase fails, and then perhaps only after 
>several attempts.  Neither Yaffs nor MTD to this.
>
[and from a later message]
>We'd need the NAND vendors to reveal that, but I think it 
>reasonable to suspect that if a block is improperly erased that 
>any data subsequently written to that block is liable to 
>failure.  But if an individual page is bad because of, say, 
>power loss at the time of the write, that the other pages within 
>that block would be solid.  But this is JUST A GUESS.

OK, the whole concept is a bit scary.  But I guess an erase fail is more probable in a questionable eraseblock than a write fail of a member page before erasure and subsequent unreliable write success after erasure.

If this is the case, then we're left with a discussion of how aggressive we should be about permanently retiring stuff, which is really just a discussion about how quickly the flash "wears out".  That's not a big issue - but the possibility of writing data which later proves to be unreliable is.

-Scott