[mips-tls] A couple of potential changes to the MIPS TLS ABI

Wed Feb 9 13:11:50 UTC 2005

Daniel Jacobowitz wrote:
> Note that I haven't been doing thread benchamarking, but the
> performance overhead from emulating rdhwr has not been significant
> in casual testing.  I'm not weeping for the lost speed either way.

One reason why MIPS is concerned about using rdhwr is that it may
condemn the whole MIPS architecture to poor multi-threading
performance relative to other architectures for some number of
years. If the architecture gains a poor reputation in this area, then
the mud could stick.

As part of our own experiments Maciej implemented a "fast path" rdhwr
emulation, which he promises he will post to this list today. It has a
typical emulation time of between 30 and 60 cycles, depending on the
CPU, and assuming a fixed destination register for rdhwr (e.g. only
rdhwr $2,$5).  Not too bad, but not brilliant either if thread pointer
access time turns out to be critical to the performance of some
threaded applications.

But I don't yet understand how important the thread pointer access
time is for a typical threaded program. Can we get a better idea of
the dynamic frequency of thread register loads? Does anyone know some
suitable application programs or benchmarks which exercise the TLS
mechanism, and from which we could extract some statistics?

With regard to TLS on other architectures, people might like to read
this email conversation
http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html
and continued here
http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620.
Ah, I see that since I last looked, you've been contributing to that
discussion already Daniel.

For other readers, here's a summary. ARM have developed a new ABI
(EABI) which adds a user-accessible coprocessor thread register, but
that register is only available on ARM architecture v6 and above
(shades of rdhwr!), and requires trap-based kernel emulation on older
CPUs.

Since the majority of ARM CPUs are pre-v6, the ARM Linux developers
aren't happy about the cost of this emulation and are discussing
alternative mechanisms which would work better on old ARM CPUs using
the old ABI. The favorite seems to be mapping a read-only page which
holds a copy of the thread pointer at a fixed virtual address. Since
the ARM MMU architecture apparently cannot handle this on SMP
configurations it looks like ARM Linux is considering splitting into
two TLS "universes", depending on the compiler/linker options used to
build the application and libraries:

1) Old ABI and memory-mapped thread pointer for pre ARM v6
   architectures, which can't run SMP anyway. For compatibility, old
   memory-mapped code running on ARM v6 SMP systems will generate a
   page fault when accessing the magic page, and the thread pointer
   load will be emulated by the kernel.

2) New EABI only for SMP ARM v6 and above, using the new user
   coprocessor reg. The new EABI can't run well on old CPUs anyway,
   because it uses other new instructions which aren't available on
   pre-ARM v6 (atomic instructions). Again, for compatibility
   purposes, access to the new thread coprocessor register can be
   trapped and emulated by the kernel.

So ARM Linux may before too long have well-performing TLS
implementations for all of its Linux-capable CPUs.

> Want to talk to me more about using a parked TLB entry?  I spoke with
> someone (Ralf or Jun, probably) about the idea originally; I was told
> it wasn't possible on MIPS SMP implementations to make this work, or
> that there was some other reason why it was undesirable.  If that's not
> accurate then we could use a reserved memory location.  However, that
> makes the TLS model dependent on details of Linux's memory mapping -
> not good for a hopefully generally useful ABI.

While it's true that all threads within the same address space must
share a single page table, which would prevent such a trick being
implemented using the normal TLB refill mechanism, I think that it
might indeed be possible to implement this by using a per-thread wired
(parked) TLB entry, updated on a context switch. This would map a
magic page containing a copy of the thread pointer to a fixed virtual
address: if in the bottom 32K then it would need just one instruction
(e.g. lw $v0,0x1000($0)).

A wired TLB entry like this should be fairly straightforward to
implement on even a simple RTOS, since it doesn't require a full-blown
VM system.  I don't see why it wouldn't work on an SMP system too,
since each CPU has its own TLB, and therefore a unique wired TLB
entry per active thread. 

To maintain single-threaded performance we'd perhaps want to increment
the Wired count only when running real TLS code, so as not to reduce
the number of TLB entries available for random replacement for the
majority single-threaded applications.

I've spoken to Ralf about this idea, and though he's not thrilled by
it, he hasn't (yet) said that it couldn't work.

But a notable downside of this technique is that it won't work on the
new generation of explicitly multi-threaded CPUs, which could be
executing instructions from many threads within the same address space
"simultaneously", using a single, shared TLB.  On such CPUs the
memory-mapped thread pointer would require that the kernel not map the
wired page, and then emulate a faulting thread pointer load in the
page fault handler - rather defeating the point of having a CPU
designed to accelerate multi-threading.

For multi-threaded CPUs the ultimate solution is the new ABI which
reserves a GPR for the thread register. But in the short-term, before
that new ABI is fully supported, a rdhwr-based implementation could be
argued for.

How do people feel about supporting two different TLS implementations
on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs,
and another for multi-threaded CPUs - similar to what ARM Linux seems
to be considering for SMP?

In each case the kernel could provide binary compatibility, at reduced
performance, for the incompatible thread access mechanism
i.e. emulating the a memory-mapped load, or the rdhwr, as
appropriate.  It might also be sensible to provide two variants of
libc.so (per ABI), compiled for the different TLS mechanisms, much as
the x86 Linux dynamic linker automatically loads either
/usr/lib/i486/libc.so or /usr/lib/i686/libc.so depending on the CPU
type.

> All reads from userspace are always protected in Linux; 

But not if you are trying to accelerate the emulation code by handling
it at exception level, so as to avoid a full register save/restore and
kernel context setup.  But Maciej's patch does include a fix to the
nested TLBL exception handler to make this work.

> Your point about the performance overhead of reading and decoding an
> instruction is worth keeping in mind, but hasn't been a big
> slowdown.

Measured how?

Nigel