[mips-tls] A couple of potential changes to the MIPS TLS ABI
Daniel Jacobowitz
dan at codesourcery.com
Wed Feb 9 16:21:26 UTC 2005
On Wed, Feb 09, 2005 at 01:11:50PM +0000, Nigel Stephens wrote:
> Daniel Jacobowitz wrote:
> > Note that I haven't been doing thread benchamarking, but the
> > performance overhead from emulating rdhwr has not been significant
> > in casual testing. I'm not weeping for the lost speed either way.
>
> One reason why MIPS is concerned about using rdhwr is that it may
> condemn the whole MIPS architecture to poor multi-threading
> performance relative to other architectures for some number of
> years. If the architecture gains a poor reputation in this area, then
> the mud could stick.
>
> As part of our own experiments Maciej implemented a "fast path" rdhwr
> emulation, which he promises he will post to this list today. It has a
> typical emulation time of between 30 and 60 cycles, depending on the
> CPU, and assuming a fixed destination register for rdhwr (e.g. only
> rdhwr $2,$5). Not too bad, but not brilliant either if thread pointer
> access time turns out to be critical to the performance of some
> threaded applications.
Can you compare this to the normal cost of an emulated instruction?
I'm not sure if I've posted the rdhwr emulation patch anywhere; I know
Ralf has a copy. I'm not thrilled about hardcoding the target
register but if that's what ya gotta do...
> But I don't yet understand how important the thread pointer access
> time is for a typical threaded program. Can we get a better idea of
> the dynamic frequency of thread register loads? Does anyone know some
> suitable application programs or benchmarks which exercise the TLS
> mechanism, and from which we could extract some statistics?
TLS is a user level feature, so it's very difficult to predict access
patterns. Glibc uses it for:
- errno
- locale data
- pthread_self
There are, roughly, twentyish TLS traps in the startup of a typical
single-threaded application. I do not know anything about thread
benchmarking so I can't give you much there.
> With regard to TLS on other architectures, people might like to read
> this email conversation
> http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html
> and continued here
> http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620.
> Ah, I see that since I last looked, you've been contributing to that
> discussion already Daniel.
I wrote most of that except for Nico's reimplementation.
> > Want to talk to me more about using a parked TLB entry? I spoke with
> > someone (Ralf or Jun, probably) about the idea originally; I was told
> > it wasn't possible on MIPS SMP implementations to make this work, or
> > that there was some other reason why it was undesirable. If that's not
> > accurate then we could use a reserved memory location. However, that
> > makes the TLS model dependent on details of Linux's memory mapping -
> > not good for a hopefully generally useful ABI.
>
> While it's true that all threads within the same address space must
> share a single page table, which would prevent such a trick being
> implemented using the normal TLB refill mechanism, I think that it
> might indeed be possible to implement this by using a per-thread wired
> (parked) TLB entry, updated on a context switch. This would map a
> magic page containing a copy of the thread pointer to a fixed virtual
> address: if in the bottom 32K then it would need just one instruction
> (e.g. lw $v0,0x1000($0)).
>
> A wired TLB entry like this should be fairly straightforward to
> implement on even a simple RTOS, since it doesn't require a full-blown
> VM system. I don't see why it wouldn't work on an SMP system too,
> since each CPU has its own TLB, and therefore a unique wired TLB
> entry per active thread.
>
> To maintain single-threaded performance we'd perhaps want to increment
> the Wired count only when running real TLS code, so as not to reduce
> the number of TLB entries available for random replacement for the
> majority single-threaded applications.
>
> I've spoken to Ralf about this idea, and though he's not thrilled by
> it, he hasn't (yet) said that it couldn't work.
You can't do this for "only single-threaded code" - see above about
errno and locales. TLS is either used by glibc or not. When built for
NPTL, it's used.
I'm pretty dubious about the tradeoff of wiring a TLB entry for this.
Benchmarks will not accurately show the cost of having fewer available.
> But a notable downside of this technique is that it won't work on the
> new generation of explicitly multi-threaded CPUs, which could be
> executing instructions from many threads within the same address space
> "simultaneously", using a single, shared TLB. On such CPUs the
> memory-mapped thread pointer would require that the kernel not map the
> wired page, and then emulate a faulting thread pointer load in the
> page fault handler - rather defeating the point of having a CPU
> designed to accelerate multi-threading.
Well, that's a bit of a full stop! Now we've got _three_ TLS
mechanisms to contend with. That's too many.
> How do people feel about supporting two different TLS implementations
> on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs,
> and another for multi-threaded CPUs - similar to what ARM Linux seems
> to be considering for SMP?
I think it's a horrible, horrible idea. Complexity has a cost too.
> > Your point about the performance overhead of reading and decoding an
> > instruction is worth keeping in mind, but hasn't been a big
> > slowdown.
>
> Measured how?
As I said, only casually.
>From this discussion I am still inclined to stay with rdhwr. 30-60
cycles is not very long.
--
Daniel Jacobowitz
CodeSourcery, LLC
More information about the mips-tls
mailing list