Killian De Volder (Qantourisc) and I ended up having a long discussion about TCQ vs. write caching, stable pages, and other random stuff about the Linux IO stack - none of it bcachefs specific, but we thought it might be useful and/or interesting enough to be worth dumping into the wiki and possibly cleaning up later. IRC conversation is reproduced below: > http://brad.livejournal.com/2116715.html > drkshadow: sidenode: I think I will write a custom tool, to test this, that is less complicated to run > as one can use math, to verify the consistency of a file > heh > that stuff is a pain > py1hon: what stuff ? > making sure syncs are passed down correctly > py1hon: the writing the tool, or dealing with that shitstorm of cache-lies ? > i spotted a bug (hope it's fixed now...) in the DIO code - with O_DIRECT|O_SYNC aio the sync just wasn't happening > py1hon: i'd like to know what setups / disks work ... I run everything with write-cache off, it's painfull > i mean finding/testing every possible way code can fuck up > you read enough code you see scary shit > py1hon: ... still ? > i haven't checked > reminds me 2006 and lvm > really don't want to :p > "ow now we handle syncs" > i have enough bugs i'm responsible for thank you very much > i was like ... whaaaat ? > hah > yeaaaa > so I run all my servers in direct IO :/ > i think that guy's TCQ thing is a red herring though > TCQ is just another form of cache handeling > the trouble with the TCQ/barrier model is that it's utterly insane trying to preserve ordering all the way up and down the io stack > thus > py1hon: i'm just VERRY confused why handeling sync in kernel is so hard > software ends up being coded to just use explicit flushes > Qantourisc: i think mostly it's becuase a lot of storage developers rode in on the short bus > "short bus" ? > yes > what's that ? > the bus you ride to school if you have "special needs" > ugh > imo, each bio request should return AFTER the write is DONE > OR > return a promise > in an ideal world yea > the trouble is > write caching is really useful > how any other is allowed, I do no understand, me eyeballs linus > the reason is > until the write completes > that's what the promises are for > that memory buffer - you can't touch it > and when you request a sync, you wait until they are all completed (imo) > but yes it's some work > hold on i'm explaining :P > so this comes up most notably with the page cache > also "stable pages" was still a clusterfuck last i checked > but it'd be even worse without write caching > so > if you're a filesystem that cares about data and checksums crap > you checksum the data you're writing, and then the data you're writing better goddamn not change until the write completes - say userspace is changing it, because that's what they do > or if it does, your checksum is invalid > which means: > to write some data that's cached in the page cache, and mapped into potentially multiple userspace processes > Qantourisc makes a notes: if userspace writes, just update checksum, or are they writing in the buffer you are about to write ? > you have to mark those page table entries RO, flush tlbs etc. etc. so that userspace can't modify that crap > if userspace modifies the buffer while the write is in flight > you don't know if the old or the new version was the one that got written > it's a race > you're fucked > ... imo once you submit to the kernel what you will write: it's hands off > quess they wanted to prevent an extra page copy > ignore direct io for now (that has its own issues though) > yes, that extra copy is bad > py1hon: so is screwing up your data :) > not _that_ bad, that's actually what bcachefs does > but it's a shitty situation > also having to allocate memory to copy data jsut so you can do io, that's also not a great situation > screwing up data < extra memory copy > there are lots of good reasons why you don't want to bounce data you're writing if you don't have to > and mostly, you don't have to > anyways > Also what wrong with doing "fu" userspace ? > don't do it wrong ? > or is the writing to it' "allowed" ? > like i said bcachefs is just copying the data > which i'm sure someone is going to complain about eventually > sigh > so getting back to the tcq vs write caching thing > fundamental issue is, while a write is outstanding the device owns that buffer, you can't touch it or reuse it for anything else > py1hon: then you reply: "I am sorry the current design of what is allowed in the kernel API is to liberal, I can't write out the data you can constantly modify while i'm writing. Complain to Linus to fick this braindead design." > if doing a write is fast, because writes are cached, this isn't a huge deal > you can just wait until the write completes, and (potentially) avoid having to bounce > py1hon: quick break "bounce" ? > bounce = allocate a bounce buffer, copy data into bounce buffer, write using bounce buffer instead of original buffer > a ok > "copy page" in a sence > if doing a write is slow, because it's always waiting for the device to physically make it persistent > then, you're probably going to end up bouncing all the writes on the host - introducing an extra copy > but > this is stupid > because the writes are gonna get copied to the device's buffer regardless, so it can do the actual io > so if you have to copy the data to the device's buffer anyways > sound logical, everyone who promises to write something out later, should copy the data > just do that INSTEAD of bouncing > boom, done > except not really, there's shitty tradeoffs all around > why are they favoring speed over safety ? > anyways, there's really no good answers to the bouncing and tcq vs write buffering stuff > well they're not these days > excepting bugs of course > so they bounce more then ? > no you don't want to know about the bouncing situation > pretend i didn't bring that up because then i'd have to talk about stable pages and like i said that's another bag of shit > anywys > userspace has no concept of tcq > or when a write hits the device > or what > all userspace has is fsync and O_SYNC > and that's _fine_ > those interfaces are completely adequate in practice > the kernel just has to make sure it actually honors them, which (again, excluding bugs), it does these days > so who could write in the write-page then (in the earlier example with the checksum) ? > whether it honors fsync and O_SYNC by using flushes or by using TCQ doesn't matter one damn to userspace > ok, so you mmap a file > agreed, f/Osync is enough > how's that work? > or say multiple processes mmap the same file > MAP_SHARED > py1hon: the only way I see this working: cow > no > that would be MAP_PRIVATE > say they're all writing to the file > so all their changes, via the mmap() mapping, have to be written out to disk (and also seen by each other blah blah) > sec first, mmap works by pages correct ? and those pages are backed by FS ? > yes > and yes > so that file is cached in the page cache > hold on constructing mental model > it's cached just once > then > every time a process calls mmap() > py1hon: does the kernel know when one has written to a page ? > the kernel maps those same physical pages to locations in that process's address space by setting up page table entries and mappings and all that crap > yes > with help from the CPU > a nice way :) > page starts out clean, right? > yap > so when the kernel maps that clean page into the userspace's address space, it maps it read only > REGARDLESS of whether userspace wants read write access or not > then, when userspace writes to it, it traps > SIGBUS, except not, because the kernel sees what was going on > kernel switches it to read write, notes the page is now dirty, and continues on like nothing ever happened > but how does it detect a second write ? > doesn't need to > all it needs to know is that the page is now dirty > and, at some point in the future, has to be written > ow right, if you want to write it, lock it again first ? > make it read only, write it, mark it clean > userspace writes again, cycle repeats > clean == RO map ? > yea > or a bitmap somewhere ? > bit in struct page > however: that was all a lie > that's how it works conceptually > ok who has cut corners, and why :( > but, dirtying pages and crap like that is important enough that CPUs actually track this stuff for you without you having to map pages read only and having userspace go to the trouble of faulting > the end result is, the kernel can just check the page table entries that it sets up for the CPU to see if userspace has written to them > ok we can detect this differently, sound nice, and it blows up in our face how ? > (this actually isn't 100% my area, i would be reading docs to check the details on this stuff if i cared) > ok so, nice thing about this is > no more traps :) > pages are never read only! userspace can always write to them! > yes, no more traps! > (aka, minor faults) > annoying side effect: > if pages are never read only... > userspace can KEEP WRITING TO THEM WHILE THEY'RE BEING WRITTEN > remember > if we're not bouncing > and we don't want to bounce if we don't have to, so usually we don't > .... so why are they not marked RO ? > the buffer we're writing out is literally the same buffer that is mapped into userspace > because if they were marked RO, userspace would trap and things would be slow > now > py1hon: i'd argue it would NOT be slow > oh i'll get to that > either things are going bad: we are race-condition writing > this is where it starts to get hilarious > no not yet > and slow is "ok" > this is how it worked for many years > and it worked fine > reason is > or things are not race-ing and it should be fine > if there's a write in flight > and userspace is scribbling over that buffer with new data > who cares? we're going to overwrite that write with the new version later > it got marked dirty again > there is ofcours 1 asset of mmap files: write order not garnteed > if userspace cares about _which specific version_ gets written, they need to stop scribbling over stuff and call fsync() > if one looks at this api > no really, trust me, this actually does work completely fine > no data integrity is broken > py1hon: with fsync it works too yes > but if the app refuses to wait: not a clue what version you will get > yes but that's fine! > if the app isn't doing an fsync, it CANNOT care > PS: mind if I publish this conversation ? :) > go for it > verry informative > Might rewrite it later as a doc :p > i want to emphasize that this approach REALLY IS COMPLETELY OK > that is > here's the hilarious part > UNTIL THE FILESYSTEM STARTS CHECKSUMMING DATA > Qantourisc ponders > that is literally the ONLY THING throwing a giant monkey wrench into this approach > and it's fucking stupid, but it's where we are > py1hon: ... there is a fix I think > remember: if the filesystem is checksumming the data, then the FILESYSTEM damn well cares about which version gets written because it has to push down the correct checksum > but I don't know if it's allowed > but if userspace is scribbling over the data underneath everyone... oh fuck > don't write until you get a fsync > no that doesn't work, for a variety of reasons > performance will however be ... mwea > there really isn't a good solution > so > bounce is also a way > yes, but > if your solution is bouncing, then you have to preemptively bounce _everything_ > which is stupid > py1hon: just the dirty pages no ? > i mean every single write you do from the page cache you have to bounce > well you need a copy you can trust > well > end of story :/ > there's an alternative > I missed one ? > we talked about it earlier > Qantourisc feels inferior > you flip the pages to RO in userspace's mapping > just until the write completes > but ? > (other then extra work) > yeah > should be fine, right? i mean, we're not writing that many pages at once at any given time > writes are fast because most devices cache writes > what could go wrong? > write order, diks lie > power loss > no > we're only talking about flipping pages to RO in userspace for the duration of the write to avoid having to bounce them > nothing else changes > if app does fsync, we still issue a cache flush > looks fine, I missed anything ? > devices USUALLY don't lie about cache flushes in the past 10-20 years because people get very angry if they do > py1hon: PRO users get angry > desktop users ... myabe > i know but enterprise storage people are fucking morons > trust me > you don't even want to know > py1hon: i've seen my own morons > i know > they had 2 sans > and they're morons too > but dear god fuck enterprise storage > they where syncing 2 raids between 2 sans > 1 whas a RAID5 and the other RAID10 > so, stable pages: > and I was like ... guys ... this stack probably doesn't return writes are complete until all raids have hit the platter > ... downloaded the docs, of the san, and it was true > year or so back, they did that for stable pages, flipping them RO in the userspace mappings > SAN admin ...; ooow ...euu ... dang > btrfs needs stable pages for checksums > other stuff in the io stack would like stable pages > it was regarded as generally a good idea > stable page == page you can trust userspace will not modify (without you knowing) > so this was pushed out > yes > trouble is, after it was released > someone came up with a benchmark that got like 200x slower > py1hon: writing to the same page ? > yeah, just because of userspace having to wait if they tried to write to a page that was being written out > The correct reply would then be "We are sorry, we cannot garantee this without messing up your data, or we bounce" > and if you think about it > userspace having to block on IO > when all it's trying to do is change something in memory > that is kinda stupid > and adding latency like that actually is a shitty thing to do because then someone is going to have to debug a horrific latency bug years later and be really pissed when they figure out what it was > Well, there are 2 options here > wait no option > bouncing > today, it's pretty much just bouncing > now, what we COULD do, in THEORY > why is bouncing sooo bad ? > it's not that bouncing is bad, exactly > it's that bouncing EVERY SINGLE FUCKING WRITE when 99% of them won't be modified is retarded > py1hon: you could RO lock, when requesting an unlock, bounce > i was actually about to bring that up > that is what you'd like to be able to do > however > if you tell the VM developers that this is what you want to do > the guys who work on the page cache code and all that crap > Qantourisc says fucks sake, another but ? > i'm pretty sure they just run away screaming > py1hon: why ? > apparently (and this part of the code I know fuck all about) swapping out the page that's mapped into userspace like this would be a giant pain in the ass > because it's not easy ? > yeah > like i said, i don't know that code > i would actually imagine with all the stuff that's been going on for page migration it ought to be doable these days > but > i am not a VM developer > py1hon: btw howmuch room do you need to bounce ? > as in MB's > it's not the amount of memory you need, you only need it for the writes that are in flight > it's the overhead of all the memcpys and the additional cache pressure that sucks > yea > that or you disable checksums :D > yep > maybe this should be a ionctl option one day ? > prefebly yesterday :D > eventually > my priority is getting all the bugs fixed > so if the program doesn't care about checksum, and want's is 200x speed back at 0 bounce cost > he can have it > py1hon: this would be a general kernel feature :) > wich ... right you would need to add :p > and realistically it's not THAT big of a performance impact > py1hon: btw i'm still kinda set on writing code to stresstest the lot :p > I really don't trust IO stacks > many disks in the past have lied through their teeth > and so has the kernel > xfstests actually does have quite a few good tests for torture testing fsync > sweet > but i'm talking while yanking power :D > and nothing fundamental has changed w.r.t. fsync since early days of bcache > so that stuff has all been tested for a looong time > and bcache ain't perfect if you really hammer on it, but i know about those bugs and they're fixed in bcachefs :p > And it's just not about bcache kernel code > it's also about disks > yeah i don't want to know about whatever you find there :P