Matt Dillon had what he described as a “brainfart for threaded VFS and data passing between threads” based on Alan Cox’s FreeBSD 5 PIPE work that he has been importing, leading to a new concept he calls “XIO”. It’s a long ramble, so I’m reprinting it wholesale:
“The recent PIPE work adapted from Alan Cox’s work in FreeBSD-5 has really lit a fire under my seat. It’s amazing how such a simple concept can change the world as we know it :-)
Originally the writer side of the PIPE code was mapping the supplied user data into KVM and then signalling the reader side. The reader side would then copy the data out of KVM.
The concept Alan codified is quite different: Instead of having the originator map the data into KVM, simply supply an array of vm_page_t’s to the target and let the target map the data into KVM. In the case of the PIPE code, Alan used the SF_BUF API (which was originally developed by David Greenman for the sendfile() implementation) on the target side to handle the KVA mappings.
Seems simple, eh? But Alan got an unexpectedly huge boost in performance on IA32 when he did this. The performance boost turned out to be due to two facts:
- Avoiding the KVM mappings and the related kernel_object manipulations required for those mappings saves a lot of cpu cycles when all you want is a quick mapping into KVM.
- On SMP, KVM mappings generated IPIs to all cpus in order to invalidate the TLB. By avoiding KVM mappings all of those IPIs go away.
- When the target maps the page, it can often get away with doing a simple localized cpu_invlpg(). Most targets will NEVER HAVE TO SEND IPIs TO OTHER CPUS. The current SF_BUF implementation still does send IPIs in the uncached case, but I had an idea to fix that and Alan agrees that it is sound… and that is to store a cpumask in the sf_buf so a user of the sf_buf only invalidates the cached KVM mapping if it had not yet been accessed on that particular cpu.
- For PIPEs, the fact that SF_BUF’s cached their KVM mappings reduced the mapping overhead almost to zero.
Now when I heard about this huge performance increase I of course immediately decided that DragonFly needed this feature to, and so we now have it for DFly pipes.
Light Bulb goes off in head
But it also got me to thinking about a number of other sticky issues that we face, especially in our desire to thread major subsystems (such as Jeff’s threading of the network stack and my desire to thread VFS), and also issues related to how to efficiently pass data between threads, and how to efficiently pass data down through the I/O subsystem.
Until now, I think everyone here and in FreeBSD land were stuck on the concept of the originator mapping the data into KVM instead of the target for most things. But Alan’s work has changed all that.
This idea of using SF_BUF’s and making the target responsible for mapping the data has changed everything. Consider what this can be used for:
- For threaded VFS we can change the UIO API to a new API (I’ll call it XIO) which passes an array of vm_page_t’s instead of a user process pointer and userspace buffer pointer.
So ‘XIO’ would basically be our implementation of target-side mappings with SF_BUF capabilities.
- We can do away with KVM mappings in the buffer cache for the most prevalient[sic] buffers we cache… those representing file data blocks. We still need them for meta-data, and a few other circumstances, but the KVM load on the system from buffer cache would drop by like 90%.
- We can use the new XIO interface for all block data referencse from userland and get rid of the whole UIO_USERSPACE / UIO_SYSSPACE mess. (I’m gunning to get rid of UIO entirely, in fact).
- We can use the new XIO interface for the entire I/O path all the way down to busdma, yet still retain the option to map the data if/when we need to. I never liked the BIO code in FreeBSD-5, this new XIO concept is far superior and will solve the problem neatly in DragonFly.
- We can eventually use XIO and SF_BUF’s to codify copy-on-write at the vm_page_t level and no longer stall memory modifications to I/O buffers during I/O writes.
- I will be able to use XIO for our message passing IPC (our CAPS code), making it much, much faster then it currently is. I may do that as a second step to prove-out the first step (which is for me to create the XIO API).
- Once we have vm_page_t copy-on-write we can recode zero-copy TCP to use XIO, and won’t be a hack any more.
- XIO fits perfectly into the eventual pie-in-the-sky goal of implementing SSI/Clustering, because it means we can pass data references (vm_page_t equivalents) between machines instead of passing the data itself, and only actually copy the data across on the final target. e.g. if on an SSI system you were to do ‘cp file1 file2’, and both file1 and file2 are on the same filesystem, the actual *data* transfer might only occur on the machine housing the physical filesystem and not on the machine doing the ‘cp’. Not one byte. Can you imagine how fast that would be?
And many other things. XIO is the nutcracker, and the nut is virtually all the remaining big-ticket items we need to cover DragonFly.
This is very exciting to me.”
Heheh.
So this is kinda like if the Wright brothers woke up one morning with a vague idea how to build a Saturn V?
Cool, at any rate.