[mips-tls] A couple of potential changes to the MIPS TLS ABI

Wed Feb 9 16:21:26 UTC 2005

On Wed, Feb 09, 2005 at 01:11:50PM +0000, Nigel Stephens wrote:
> Daniel Jacobowitz wrote:
> > Note that I haven't been doing thread benchamarking, but the
> > performance overhead from emulating rdhwr has not been significant
> > in casual testing.  I'm not weeping for the lost speed either way.
> 
> One reason why MIPS is concerned about using rdhwr is that it may
> condemn the whole MIPS architecture to poor multi-threading
> performance relative to other architectures for some number of
> years. If the architecture gains a poor reputation in this area, then
> the mud could stick.
> 
> As part of our own experiments Maciej implemented a "fast path" rdhwr
> emulation, which he promises he will post to this list today. It has a
> typical emulation time of between 30 and 60 cycles, depending on the
> CPU, and assuming a fixed destination register for rdhwr (e.g. only
> rdhwr $2,$5).  Not too bad, but not brilliant either if thread pointer
> access time turns out to be critical to the performance of some
> threaded applications.

Can you compare this to the normal cost of an emulated instruction? 
I'm not sure if I've posted the rdhwr emulation patch anywhere; I know
Ralf has a copy.  I'm not thrilled about hardcoding the target
register but if that's what ya gotta do...

> But I don't yet understand how important the thread pointer access
> time is for a typical threaded program. Can we get a better idea of
> the dynamic frequency of thread register loads? Does anyone know some
> suitable application programs or benchmarks which exercise the TLS
> mechanism, and from which we could extract some statistics?

TLS is a user level feature, so it's very difficult to predict access
patterns.  Glibc uses it for:
  - errno
  - locale data
  - pthread_self
There are, roughly, twentyish TLS traps in the startup of a typical
single-threaded application.  I do not know anything about thread
benchmarking so I can't give you much there.

> With regard to TLS on other architectures, people might like to read
> this email conversation
> http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html
> and continued here
> http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620.
> Ah, I see that since I last looked, you've been contributing to that
> discussion already Daniel.

I wrote most of that except for Nico's reimplementation.

> > Want to talk to me more about using a parked TLB entry?  I spoke with
> > someone (Ralf or Jun, probably) about the idea originally; I was told
> > it wasn't possible on MIPS SMP implementations to make this work, or
> > that there was some other reason why it was undesirable.  If that's not
> > accurate then we could use a reserved memory location.  However, that
> > makes the TLS model dependent on details of Linux's memory mapping -
> > not good for a hopefully generally useful ABI.
> 
> While it's true that all threads within the same address space must
> share a single page table, which would prevent such a trick being
> implemented using the normal TLB refill mechanism, I think that it
> might indeed be possible to implement this by using a per-thread wired
> (parked) TLB entry, updated on a context switch. This would map a
> magic page containing a copy of the thread pointer to a fixed virtual
> address: if in the bottom 32K then it would need just one instruction
> (e.g. lw $v0,0x1000($0)).
> 
> A wired TLB entry like this should be fairly straightforward to
> implement on even a simple RTOS, since it doesn't require a full-blown
> VM system.  I don't see why it wouldn't work on an SMP system too,
> since each CPU has its own TLB, and therefore a unique wired TLB
> entry per active thread. 
> 
> To maintain single-threaded performance we'd perhaps want to increment
> the Wired count only when running real TLS code, so as not to reduce
> the number of TLB entries available for random replacement for the
> majority single-threaded applications.
> 
> I've spoken to Ralf about this idea, and though he's not thrilled by
> it, he hasn't (yet) said that it couldn't work.

You can't do this for "only single-threaded code" - see above about
errno and locales.  TLS is either used by glibc or not.  When built for
NPTL, it's used.

I'm pretty dubious about the tradeoff of wiring a TLB entry for this.
Benchmarks will not accurately show the cost of having fewer available.

> But a notable downside of this technique is that it won't work on the
> new generation of explicitly multi-threaded CPUs, which could be
> executing instructions from many threads within the same address space
> "simultaneously", using a single, shared TLB.  On such CPUs the
> memory-mapped thread pointer would require that the kernel not map the
> wired page, and then emulate a faulting thread pointer load in the
> page fault handler - rather defeating the point of having a CPU
> designed to accelerate multi-threading.

Well, that's a bit of a full stop!  Now we've got _three_ TLS
mechanisms to contend with.  That's too many.

> How do people feel about supporting two different TLS implementations
> on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs,
> and another for multi-threaded CPUs - similar to what ARM Linux seems
> to be considering for SMP?

I think it's a horrible, horrible idea.  Complexity has a cost too.

> > Your point about the performance overhead of reading and decoding an
> > instruction is worth keeping in mind, but hasn't been a big
> > slowdown.
> 
> Measured how?

As I said, only casually.

>From this discussion I am still inclined to stay with rdhwr.  30-60
cycles is not very long.

-- 
Daniel Jacobowitz
CodeSourcery, LLC