From uhler at mips.com  Tue Feb  1 10:10:50 2005
From: uhler at mips.com (Michael Uhler)
Date: Tue, 1 Feb 2005 02:10:50 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050131205639.GK30888@nevyn.them.org>
Message-ID: <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM>

The one area that I'm concerned about is the use of rdhwr to return the
pointer.  There are several reasons why I'm not sure this is the right thing
to do:

- rdhwr is a MIPS32/64 Release 2 instruction.  No existing MIPS I-IV
implementation has this instruction and probably never will.  Even existing
MIPS3264 Release 2 implementations don't have the internal register to hold
the data.  This means that it will be years before any hardware will support
the feature, and that support depends on an architecture decision (see next
item)

- We have some concerns at the architecture level about using rdhwr for this
purpose rather than using a GPR under the umbrella of an ABI re-work that
some of you are involved in.

- Some preliminary work at MIPS suggests that a tuned handler for syscall is
faster than one for handling rdhwr at the reserved instruction handler.
This means that we're betting on having actual hardware implementations of
rdhwr out there in sufficient volume to make up for the fact that we're
penalizing everybody else by using an RI trap hander vs. a syscall trap
handler.

To me, all of these suggest that we may want to use syscall rather than
rdhwr to get the pointer, at least until we can decide whether to dedicate a
GPR for this purpose.

By the way, sorry for the late response to the original posting.  There was
some confusion within MIPS as to who was going to respond.  I've asked our
UK team to run point on the interaction on TLS from now on, so we'll be more
responsive.

/gmu


---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road      Voice:  (650)567-5025   FAX:   (650)567-5225
Mountain View, CA 94043   Mobile: (650)868-6870   Admin: (650)567-5085


-----Original Message-----
From: Daniel Jacobowitz [mailto:dan at codesourcery.com] 
Sent: Monday, January 31, 2005 12:57 PM
To: mips-tls at codesourcery.com
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI


Hi folks,

We have the 32-bit port of NPTL basically working now, and I have been
preparing the first patches for submission - which means binutils.  So I
have been going over the ABI in some detail looking for things that we want
to finalize before the patches are integrated.  I found three points...


First, a minor correction: LE was using "ori" where the equivalent LD
sequence used "addiu".  I've updated this on the Wiki.  It was a leftover
from an earlier draft.


Second, the %tpoff operator is currently ambiguous.  When we say %dtpoff, we
are always talking about the offset from the base of this module's DTV entry
to the location of the variable; currently this is always used with %hi and
%lo.  However, %tpoff can be used with %hi and %lo (Local Exec model, refers
to variable offset from thread pointer) %or without (Initial Exec model,
refers to GOT slot holding the variable offset).

How about this instead?
  R_MIPS_TLS_TPOFF -> R_MIPS_TLS_GOTTPOFF
  %tpoff(x)($28) -> %gottpoff(x)($28)

This also frees up %tpoff in case we want to use that the way PowerPC uses
@tprel.  foo at tprel@l is the low 16 bits of the offset; foo at tprel s the low
16 bits also, but signals an error if the ofset does not fit entirely in 16
bits.  The alternate sequence III of Local Exec could take advantage of
this.


Third, the Design Choices section of the specification has this to say:

     *  The compiler is not allowed to schedule the sequences below. 

  The sequences below must appear exactly as written in the code
  generated by the compiler. This restriction is present because we have
  not yet determined what linker optimizations may be possible. In order
  to facilitate adding linker optimizations in the future, without
  recompiling current code, the compiler is restricted from scheduling
  these sequences. 

I'd like to settle this one way or the other before finalizing the spec.

For reference, the possible linker optimizations are:
 General Dynamic -> Initial Exec (whenever linking an exec)  Local Dynamic
-> Local Exec (ditto)  Initial Exec -> Local Exec (when the symbol is known
to live in the
			     executable; can be applied starting at GD too)

The major advantage here is replacing a call to __tls_get_addr with a rdhwr
instruction.  These are, in theory, doable for MIPS.  Here's an example o32
GD -> IE, probably the most important one:

   0x00 lw $25, %call16(__tls_get_addr)($28)   R_MIPS_CALL16   g
   0x04 jalr $25
   0x08 addiu $4, $28, %tlsgd(x)               R_MIPS_TLS_GD   x
   0x12 $gp restore (not mentioned in the TLS ABI)

   0x00 rdhwr $4, $5
   0x04 lw $5, %tpoff(x1)($28)         R_MIPS_TLS_TPOFF        x1
   0x08 addu $4, $4, $5
   0x12 the $gp restore can be nop'd out

There are a couple of other quirks for this; the only one I can think of
offhand is MIPS-I load delay slots, which would mean neither sequence could
be used as-is.  The immediate disadvantage is that the compiler can not
schedule the sequences.  I don't know what all the tradeoffs are here, I
just know that the compiler implementation would be simpler if we did not
make the sequences fixed and unschedulable. So I'd like to ditch that unless
folks think that
  (A) the linker optimizations are useful
  (B) the linker optimizations are feasible
  (C) someone is likely to implement the linker optimizations

Any opinions?  I see that Alpha does implement the TLS linker relaxations;
on the other hand, Alpha already had a linker relaxation mechanism in place,
and the GNU tools for MIPS don't.

-- 
Daniel Jacobowitz


From dom at mips.com  Tue Feb  1 13:28:12 2005
From: dom at mips.com (Dominic Sweetman)
Date: Tue, 1 Feb 2005 13:28:12 +0000
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050131205639.GK30888@nevyn.them.org>
References: <20050131205639.GK30888@nevyn.them.org>
Message-ID: <16895.33772.807508.339803@doms-laptop.algor.co.uk>


Daniel,

> We have the 32-bit port of NPTL basically working now, and I have been
> preparing the first patches for submission - which means binutils.  So
> I have been going over the ABI in some detail looking for things that
> we want to finalize before the patches are integrated.  I found three
> points...

Thanks, ace work.  I hope we'll be in a position to pick this stuff up
fairly soon and beat on it.

> Second, the %tpoff operator is currently ambiguous.  When we say
> %dtpoff, we are always talking about the offset from the base of this
> module's DTV entry to the location of the variable; currently this is
> always used with %hi and %lo.  However, %tpoff can be used with %hi and
> %lo (Local Exec model, refers to variable offset from thread pointer)
> %or without (Initial Exec model, refers to GOT slot holding the
> variable offset).
> 
> How about this instead?
>   R_MIPS_TLS_TPOFF -> R_MIPS_TLS_GOTTPOFF
>   %tpoff(x)($28) -> %gottpoff(x)($28)

Sounds better.

> Third, the Design Choices section of the specification has this to say:
> 
>      *  The compiler is not allowed to schedule the sequences below. 
> 
>   The sequences below must appear exactly as written in the code
>   generated by the compiler. This restriction is present because we have
>   not yet determined what linker optimizations may be possible. In order
>   to facilitate adding linker optimizations in the future, without
>   recompiling current code, the compiler is restricted from scheduling
>   these sequences. 
> 
> I'd like to settle this one way or the other before finalizing the
> spec.

Why not sit on the fence?  See below.

> The major advantage here is replacing a call to __tls_get_addr with a
> rdhwr instruction.  These are, in theory, doable for MIPS.

Mike Uhler responded separately about MIPS Technologies' position on
using rdhwr for this purpose.

For NUBI we plan to reserve a general purpose register for a thread
pointer, making the optimization a bigger win.

> There are a couple of other quirks for this; the only one I can think
> of offhand is MIPS-I load delay slots, which would mean neither
> sequence could be used as-is.

There are relatively few MIPS-I machines out there.  Would it be
unacceptable if the standard NPTL system failed to work on them?

> The immediate disadvantage is that the compiler can not schedule the
> sequences.  I don't know what all the tradeoffs are here, I just
> know that the compiler implementation would be simpler if we did not
> make the sequences fixed and unschedulable.
> 
> So I'd like to ditch that unless folks think that
>   (A) the linker optimizations are useful
>   (B) the linker optimizations are feasible
>   (C) someone is likely to implement the linker optimizations
> 
> Any opinions?  I see that Alpha does implement the TLS linker
> relaxations; on the other hand, Alpha already had a linker relaxation
> mechanism in place, and the GNU tools for MIPS don't.

Why not re-write the spec to say "unless you generate this sequence
exactly like this, you'll probably prevent any future linker
optimization from working" - and then leave it to the compiler
toolchain. 

My opinion is that linker optimizations could be very valuable, given
a cheap way of accessing a thread pointer.  But MIPS Technologies are
very unlikely to do heroic work on the linker for the o32 ABI - but we
do intend to at least try that out for our NUBI ABI.

-- 
Dominic Sweetman, 
MIPS Technologies (UK)
The Fruit Farm, Ely Road, Chittering, CAMBS CB5 9PH, ENGLAND
phone: +44 1223 706205 / fax: +44 1223 706250 / swbrd: +44 1223 706200
http://www.mips.com


From dan at codesourcery.com  Tue Feb  1 19:43:42 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Tue, 1 Feb 2005 14:43:42 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <16895.33772.807508.339803@doms-laptop.algor.co.uk>
References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk>
Message-ID: <20050201194340.GV30888@nevyn.them.org>

On Tue, Feb 01, 2005 at 01:28:12PM +0000, Dominic Sweetman wrote:
> > There are a couple of other quirks for this; the only one I can think
> > of offhand is MIPS-I load delay slots, which would mean neither
> > sequence could be used as-is.
> 
> There are relatively few MIPS-I machines out there.  Would it be
> unacceptable if the standard NPTL system failed to work on them?

There are still plenty of non-MIPS-I configurations compiled for
MIPS-I.  I doubt that will change any time soon, so we have to be able
to cope with the load delay slots.

> > The immediate disadvantage is that the compiler can not schedule the
> > sequences.  I don't know what all the tradeoffs are here, I just
> > know that the compiler implementation would be simpler if we did not
> > make the sequences fixed and unschedulable.
> > 
> > So I'd like to ditch that unless folks think that
> >   (A) the linker optimizations are useful
> >   (B) the linker optimizations are feasible
> >   (C) someone is likely to implement the linker optimizations
> > 
> > Any opinions?  I see that Alpha does implement the TLS linker
> > relaxations; on the other hand, Alpha already had a linker relaxation
> > mechanism in place, and the GNU tools for MIPS don't.
> 
> Why not re-write the spec to say "unless you generate this sequence
> exactly like this, you'll probably prevent any future linker
> optimization from working" - and then leave it to the compiler
> toolchain. 
> 
> My opinion is that linker optimizations could be very valuable, given
> a cheap way of accessing a thread pointer.  But MIPS Technologies are
> very unlikely to do heroic work on the linker for the o32 ABI - but we
> do intend to at least try that out for our NUBI ABI.

Fine by me; if no one else responds, I'll do this.

Note that when I make this change, I'm also going to let the compiler
schedule the sequences; whoever implements the linker optimizations can
go back and undo that.

-- 
Daniel Jacobowitz


From dan at codesourcery.com  Tue Feb  1 19:49:19 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Tue, 1 Feb 2005 14:49:19 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM>
References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM>
Message-ID: <20050201194918.GW30888@nevyn.them.org>

On Tue, Feb 01, 2005 at 02:10:50AM -0800, Michael Uhler wrote:
> The one area that I'm concerned about is the use of rdhwr to return the
> pointer.  There are several reasons why I'm not sure this is the right thing
> to do:

I'm getting a lot of conflicting feedback about this.

> - rdhwr is a MIPS32/64 Release 2 instruction.  No existing MIPS I-IV
> implementation has this instruction and probably never will.  Even existing
> MIPS3264 Release 2 implementations don't have the internal register to hold
> the data.  This means that it will be years before any hardware will support
> the feature, and that support depends on an architecture decision (see next
> item)

Compare this to a syscall.  There is no existing implementation which
will implement the syscall efficiently, and _never_ will be.

> - We have some concerns at the architecture level about using rdhwr for this
> purpose rather than using a GPR under the umbrella of an ABI re-work that
> some of you are involved in.

This objection is way too vague for me to respond to.  Also, using
rdhwr does not prevent future use of a GPR, in an ABI that doesn't
exist yet.  Nice thing about read-only state; you can keep it in
multiple places easily.

> - Some preliminary work at MIPS suggests that a tuned handler for syscall is
> faster than one for handling rdhwr at the reserved instruction handler.
> This means that we're betting on having actual hardware implementations of
> rdhwr out there in sufficient volume to make up for the fact that we're
> penalizing everybody else by using an RI trap hander vs. a syscall trap
> handler.

So is this one.  Can you be more specific?  The only substantial
difference in overhead that I am familiar with is the additional
register save/restores; note that this is a substantial _advantage_ for
userland which would otherwise have to save and restore additional
registers.  Keeping the save/restores in the kernel is a win for code
size and complexity.

> To me, all of these suggest that we may want to use syscall rather than
> rdhwr to get the pointer, at least until we can decide whether to dedicate a
> GPR for this purpose.

In any case, I am putting Ralf on the hot seat here.  I'm going to do
whatever he likes, anyway, since it's no good to me if the kernel
doesn't support it :-)

-- 
Daniel Jacobowitz


From mark at codesourcery.com  Tue Feb  1 20:07:45 2005
From: mark at codesourcery.com (Mark Mitchell)
Date: Tue, 01 Feb 2005 12:07:45 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050201194918.GW30888@nevyn.them.org>
References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> <20050201194918.GW30888@nevyn.them.org>
Message-ID: <41FFE191.8090603@codesourcery.com>


Daniel Jacobowitz wrote:
> On Tue, Feb 01, 2005 at 02:10:50AM -0800, Michael Uhler wrote:
> 
>>The one area that I'm concerned about is the use of rdhwr to return the
>>pointer.  There are several reasons why I'm not sure this is the right thing
>>to do:
> 
> 
> I'm getting a lot of conflicting feedback about this.

 From our point of view, we've already got a validated implementation 
using rdhwr.  We'd like to avoid having to rework our code and then 
revalidate.

Realistically, if rdhwr isn't officially blessed, some vendors might 
still use our implementation.  Or, things might just languish.  In other 
words, I'm somewhat afraid that we've missed the technical window to 
debate this particular technical point.

As Dan says, the new MIPS ABI can do better in this regard, as in 
others.  Furthermore, if the kernel adds a syscall that can be used by 
the o32 ABI, then the tools can be updated to work with that too.  I 
think the only immutable aspect of this existing design is that if/once 
our implementation escapes into the wild, then kernels forevermore may 
have to support the rdhwr solution, even if most programs no longer use 
it.  I think that's a relatively small price to pay to get NPTL working 
on MIPS.

-- 
Mark Mitchell
CodeSourcery, LLC
mark at codesourcery.com
(916) 791-8304


From ica2_ts at csv.ica.uni-stuttgart.de  Tue Feb  1 21:58:59 2005
From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer)
Date: Tue, 1 Feb 2005 22:58:59 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM>
References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM>
Message-ID: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de>

Michael Uhler wrote:
> The one area that I'm concerned about is the use of rdhwr to return the
> pointer.  There are several reasons why I'm not sure this is the right thing
> to do:
> 
> - rdhwr is a MIPS32/64 Release 2 instruction.  No existing MIPS I-IV
> implementation has this instruction and probably never will.  Even existing
> MIPS3264 Release 2 implementations don't have the internal register to hold
> the data.  This means that it will be years before any hardware will support
> the feature, and that support depends on an architecture decision (see next
> item)

Yes. This means we will have a TLS register which is a bit slower than
a regular GPR for MIPS{32,64}R2 and a relatively slow emulated register
for older implementations. If we use a pseudo-syscall instead, we'll
have only the second variant with less performance potential.

> - We have some concerns at the architecture level about using rdhwr for this
> purpose rather than using a GPR under the umbrella of an ABI re-work that
> some of you are involved in.

The ABI re-work surely isn't mutually exclusive to o32 TLS. The main
reason for o32 TLS is to get rid of unmaintained linuxthreads while
maintaining source-level compatibility. It will also be available soon
(the same rationale applies to n32/n64 TLS, of course). The ABI re-work
is much more ambitious.

> - Some preliminary work at MIPS suggests that a tuned handler for syscall is
> faster than one for handling rdhwr at the reserved instruction handler.
> This means that we're betting on having actual hardware implementations of
> rdhwr out there in sufficient volume to make up for the fact that we're
> penalizing everybody else by using an RI trap hander vs. a syscall trap
> handler.

That's surprising, at least for the current Linux implementation. The
basic exception handler is the same, and the syscall path is already
time-critical and loaded with ABI dispatch etc. Adding another path
to it will penalize syscalls further. RI has no critical path yet,
adding the rdhwr emulation should be fast and relatively
straightforward. Extracing the instruction from mapped space could
get slow if it interferes with TLB handling, but I don't think that's
the common case.

> To me, all of these suggest that we may want to use syscall rather than
> rdhwr to get the pointer,

I nany case it should be easy to try both and see what works better,
once the rest of the Userland implementation is working.

> at least until we can decide whether to dedicate a GPR for this purpose.

Is there some data available how much pressure has to be expected for
the TLS register vs. normal GPR? I would guess it is significantly
lower, since there are several Linux implementations which use
emulation sucessfully. In that case, rdhwr might be even benefical for
the ABI re-work, and free up an GPR which can be used for better things.


Thiemo


From dan at codesourcery.com  Tue Feb  1 22:01:57 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Tue, 1 Feb 2005 17:01:57 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de>
References: <20050131205639.GK30888@nevyn.them.org> <005b01c50846$4ef8ac60$1bc0a8c0@MIPS.COM> <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de>
Message-ID: <20050201220156.GX30888@nevyn.them.org>

On Tue, Feb 01, 2005 at 10:58:59PM +0100, Thiemo Seufer wrote:
> I nany case it should be easy to try both and see what works better,
> once the rest of the Userland implementation is working.

FYI, the userland implementation is complete.  I hope to start by
posting the binutils patches this week, followed by GCC bits to queue
for 4.1.

-- 
Daniel Jacobowitz


From ica2_ts at csv.ica.uni-stuttgart.de  Tue Feb  1 22:02:11 2005
From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer)
Date: Tue, 1 Feb 2005 23:02:11 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <16895.33772.807508.339803@doms-laptop.algor.co.uk>
References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk>
Message-ID: <20050201220211.GN15265@rembrandt.csv.ica.uni-stuttgart.de>

Dominic Sweetman wrote:
[snip]
> > There are a couple of other quirks for this; the only one I can think
> > of offhand is MIPS-I load delay slots, which would mean neither
> > sequence could be used as-is.
> 
> There are relatively few MIPS-I machines out there.  Would it be
> unacceptable if the standard NPTL system failed to work on them?

But there are some, and a generic userland should IMHO support them.
It's a small price to keep o32 working everywhere.


Thiemo


From dan at codesourcery.com  Wed Feb  2 17:59:36 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Wed, 2 Feb 2005 12:59:36 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050201194340.GV30888@nevyn.them.org>
References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> <20050201194340.GV30888@nevyn.them.org>
Message-ID: <20050202175934.GD30888@nevyn.them.org>

On Tue, Feb 01, 2005 at 02:43:42PM -0500, Daniel Jacobowitz wrote:
> On Tue, Feb 01, 2005 at 01:28:12PM +0000, Dominic Sweetman wrote:
> > Why not re-write the spec to say "unless you generate this sequence
> > exactly like this, you'll probably prevent any future linker
> > optimization from working" - and then leave it to the compiler
> > toolchain. 
> > 
> > My opinion is that linker optimizations could be very valuable, given
> > a cheap way of accessing a thread pointer.  But MIPS Technologies are
> > very unlikely to do heroic work on the linker for the o32 ABI - but we
> > do intend to at least try that out for our NUBI ABI.
> 
> Fine by me; if no one else responds, I'll do this.
> 
> Note that when I make this change, I'm also going to let the compiler
> schedule the sequences; whoever implements the linker optimizations can
> go back and undo that.

Unfortunately, I've realized that we can't sit on this fence.  Here's a
somewhat contrived example.

int __thread a;
int *bar (void);
int *foo (int use_tls)
{
  if (use_tls)
    return &a;
  else
    return bar ();
}

foo:
	...
	.set	noreorder
	bnez	$4, .Lfunccall
	 lw	$25, %call16(bar)($28)
	.set	reorder
	lw	$25, %call16(__tls_get_addr)($28)
	addu	$4, $28, %tlsgd(a)
.Lfunccall:
	jal	$25
	...

i.e. the two calls have been tail merged.  A perfectly valid
optimization, and one that GCC theoretically could perform, though I do
not know offhand if it does.  Note that the valid TLS GD sequence,
exactly as defined by the ABI, occurs here.  Yet we can't remove the
call instruction without breaking the function.

The only way to make this work is either to mandate that an
ABI-conforming compiler can not optimize the TLS access sequences, or
to define additional relaxation marker relocations to mark valid
sequences.  Some platforms do one, some do the other.

My preference is to do the latter, which we can postpone until someone
is ready to implement them.

-- 
Daniel Jacobowitz


From uhler at mips.com  Thu Feb  3 01:00:03 2005
From: uhler at mips.com (Michael Uhler)
Date: Wed, 2 Feb 2005 17:00:03 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de>
Message-ID: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM>


Rather than respond to each email individually, I'm including one global
response to everybody's feedback.


>> The one area that I'm concerned about is the use of rdhwr to return 
>> the pointer.  There are several reasons why I'm not sure this is the 
>> right thing to do:

Dan> I'm getting a lot of conflicting feedback about this.

Mark>  From our point of view, we've already got a validated implementation 
Mark> using rdhwr.  We'd like to avoid having to rework our code and then 
Mark> revalidate.
Mark> 
Mark> Realistically, if rdhwr isn't officially blessed, some vendors might 
Mark> still use our implementation.  Or, things might just languish.  In
other 
Mark> words, I'm somewhat afraid that we've missed the technical window to 
Mark> debate this particular technical point.
Mark> 
Mark> As Dan says, the new MIPS ABI can do better in this regard, as in 
Mark> others.  Furthermore, if the kernel adds a syscall that can be used by

Mark> the o32 ABI, then the tools can be updated to work with that too.  I 
Mark> think the only immutable aspect of this existing design is that
if/once 
Mark> our implementation escapes into the wild, then kernels forevermore may

Mark> have to support the rdhwr solution, even if most programs no longer
use 
Mark> it.  I think that's a relatively small price to pay to get NPTL
working 
Mark> on MIPS.

In terms of the ship leaving the dock, is the issue one of specifically
rdhwr, or could we use another instruction which also traps as an RI (or
something else that isn't a syscall)? I'll talk more about rdhwr below, but
it's important for me to understand whether it's the instruction, or the
mechanism that makes you believe that the technical window has passed.


>> - rdhwr is a MIPS32/64 Release 2 instruction.  No existing MIPS I-IV 
>> implementation has this instruction and probably never will.  Even 
>> existing MIPS3264 Release 2 implementations don't have the internal 
>> register to hold the data.  This means that it will be years before 
>> any hardware will support the feature, and that support depends on an 
>> architecture decision (see next
>> item)

Dan> Compare this to a syscall.  There is no existing implementation which
will implement the syscall efficiently, and _never_ will be.

Thiemo> Yes. This means we will have a TLS register which is a bit slower
than a regular GPR for MIPS{32,64}R2 and a relatively slow emulated register
for older implementations. If we use a pseudo-syscall instead, we'll have
only the second variant with less performance potential.

I take your point on syscall vs. something else that traps as an ri.  So let
me try to explain my concern about the use of rdhwr specifically.

Compliance with the MIPS32/MIPS64 architectures (which is what's required
for implementations by both MIPS Technologies and MIPS architecture
licensees) requires passing a set of tests.  These tests check the corner
cases of the architecture at each revision.  We do this to prevent
fragmentation of the architecture and make your (you == the community of
people writing software for implementations of the architecture) life
easier.

In the particular case of rdhwr, we explicitly check that this instruction
generates a reserved instruction on implementations of Release 1 of the
architecture, and that all reserved encodings of rdhwr registers (which is
what you're proposing to use) cause a reserved instruction exception on
implementations of Release 2.  This means that there will never be a real
implementation of rdhwr on Release 1 implementations.  With the current
architecture spec, Release 2 implementations will be non-compliant with the
architecture unless we make an architecture change.  Changes to existing
architecture can certainly be done, but we don't take them lightly because
we need to get comment from those people who thought they had a stable
architecture from which to implement.  The fact that it's rdhwr makes it
somewhat simpler because we would make the TLS register optional, and
optional registers would cause a reserved instruction anyway.

But the point is, the decision to use a particular instruction for the TLS
pointer means that the architecture has to change.  To do that is going to
require some time while we consult with all of the architecture licensees.
Once that happens, somebody would have to actually implement the register on
an implementation of Release 2 of the architecture.  It will be years
(probably at least 2-3) before the first implementation appears with the TLS
register implemented via rdhwr, and the total population of those
implementations is going to be small.  The vast majority of MIPS
implementations will continue to trap with a reserved instruction, which
will fundamentally limit the performance of NPTL on MIPS.

The alternatives seem to be to use a GPR (but this requires an ABI change)
or to park the TLB  pointer someplace in the address space. I wondered to
Mark at one point whether we could put it at the base of the stack, then
down-align sp to access it.  We played with this a bit, but couldn't come up
with anything that was relatively clean.

So my feedback on the use of rdhwr (or any other instruction that traps) is
that as long as this is a short-term solution and/or we understand the
performance implications of how often that trap happens, it's OK.  Depending
on rdhwr to appear in a real implementation any time in the next 2-3 years
simply isn't going to happen.

If we do decide to use rdhwr (as opposed to another trapping instruction -
see further comments below), we're probably going to have to change whatever
RDHWR register number that you're using now.  You can't just pick one at
random as that will conflict with the architecture as we add new registers.

>> - We have some concerns at the architecture level about using rdhwr 
>> for this purpose rather than using a GPR under the umbrella of an ABI 
>> re-work that some of you are involved in.

Dan> This objection is way too vague for me to respond to.  Also, using
rdhwr does not prevent future use of a GPR, in an ABI that doesn't exist
yet.  Nice thing about read-only state; you can keep it in multiple places
easily.

Thiemo> The ABI re-work surely isn't mutually exclusive to o32 TLS. The main
reason for o32 TLS is to get rid of unmaintained linuxthreads while
maintaining source-level compatibility. It will also be available soon (the
same rationale applies to n32/n64 TLS, of course). The ABI re-work is much
more ambitious.

I personally believe that the ABI change is the long-term solution, but I
take your point about the needs for o32.

I've talked about my concerns about the use of rdhwr above.  My general
concern is about the widespread use of any instruction whose emulation
requires reading the instruction from memory (which would be pretty much
anything but syscall, which has at dedicated exception vector and passes
arguments via register).  We had occasion to have to debug a problem with
another operating system and a MIPS core from a different manufacturer.  We
discovered that this particular implementation did not guarantee that a load
done off the EPC value would always hit in the TLB.  In fact, it missed and
the kernel didn't use a guarded load, so it took a nested exception and
crashed.

You could say that this is a bug in the implementation, but we started to
look more broadly and concluded that it is possible for implementations,
particularly those that implement a virtual instruction cache, to wind up at
the reserved instruction handler and not have the instruction page mapped.

The advantage of syscall is that the argument is in a register, and no
instruction read is required to interpret the instruction.  One can
certainly use another instruction (e.g., rdhwr) whose emulation requires
reading the instruction, but the read needs to be guarded.  If this is true
of the Linux RI handler, we're all set.  If not, this needs to be considered
in the selection of the instruction that's going to trap.  While a TLB miss
isn't going to happen very often (maybe never on some processors), the code
has to deal with the case to ensure correctness.  When thinking about the
choice of rdhwr or something else that traps, we should consider this
situation.


>> - Some preliminary work at MIPS suggests that a tuned handler for 
>> syscall is faster than one for handling rdhwr at the reserved 
>> instruction handler. This means that we're betting on having actual 
>> hardware implementations of rdhwr out there in sufficient volume to 
>> make up for the fact that we're penalizing everybody else by using an 
>> RI trap hander vs. a syscall trap handler.

Dan> So is this one.  Can you be more specific?  The only substantial
difference in overhead that I am familiar with is the additional register
save/restores; note that this is a substantial _advantage_ for userland
which would otherwise have to save and restore additional registers.
Keeping the save/restores in the kernel is a win for code size and
complexity.

Thiemo> That's surprising, at least for the current Linux implementation.
The basic exception handler is the same, and the syscall path is already
time-critical and loaded with ABI dispatch etc. Adding another path to it
will penalize syscalls further. RI has no critical path yet, adding the
rdhwr emulation should be fast and relatively straightforward. Extracing the
instruction from mapped space could get slow if it interferes with TLB
handling, but I don't think that's the common case.

In thinking this over, I realized that I was combining performance with the
correctness issue that I mentioned above.  If one ignores the need to do a
guarded load, the performance delta between syscall and ri was very small.
I think that Macro did the original testing, and we can show you the code.
But I acknowledge that the isn't much difference in the two implementations.

It's been awhile since I looked at the code, but I thought we could hide the
additional instructions required to do this with syscall under the current
code for almost all implementations of the architecture.  That is, knowing
that all implementations are pipelined and that certain things create holes
in the pipeline, I seem to recall thinking that it would add no more cycles
(as opposed to instructions) to the syscall flow.  But as I said, it's been
awhile.


>> To me, all of these suggest that we may want to use syscall rather 
>> than rdhwr to get the pointer, at least until we can decide whether to 
>> dedicate a GPR for this purpose.

Dan> In any case, I am putting Ralf on the hot seat here.  I'm going to do
whatever he likes, anyway, since it's no good to me if the kernel doesn't
support it :-)

Thiemo> Is there some data available how much pressure has to be expected
for the TLS register vs. normal GPR? I would guess it is significantly
lower, since there are several Linux implementations which use emulation
sucessfully. In that case, rdhwr might be even benefical for the ABI
re-work, and free up an GPR which can be used for better things.

We have started some experiments on register usage and pressure as part of
the ABI work.  We'll let you know as we get more data.

/gmu

---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road      Voice:  (650)567-5025   FAX:   (650)567-5225
Mountain View, CA 94043   Mobile: (650)868-6870   Admin: (650)567-5085


From mark at codesourcery.com  Fri Feb  4 05:56:03 2005
From: mark at codesourcery.com (Mark Mitchell)
Date: Thu, 03 Feb 2005 21:56:03 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM>
References: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM>
Message-ID: <42030E73.7050603@codesourcery.com>

Michael Uhler wrote:

> In terms of the ship leaving the dock, is the issue one of specifically
> rdhwr, or could we use another instruction which also traps as an RI (or
> something else that isn't a syscall)? I'll talk more about rdhwr below, but
> it's important for me to understand whether it's the instruction, or the
> mechanism that makes you believe that the technical window has passed.

I think it's the mechanism, but Daniel could answer more definitively. 
I know that there's a kernel patch out there to interpret the rdhwr 
instruction -- but from the toolchain point of view I can't see any 
reason to think that any single instruction couldn't be used just as well.

> The alternatives seem to be to use a GPR (but this requires an ABI change)
> or to park the TLB  pointer someplace in the address space. I wondered to
> Mark at one point whether we could put it at the base of the stack, then
> down-align sp to access it.  We played with this a bit, but couldn't come up
> with anything that was relatively clean.

I don't remember quite what happenned to the idea of putting the value 
at some known location in memory.  I think that Dan shot this down 
relatively effectively, but I can't remember on quite what basis.  One 
downside is that it means that all implementations will always be 
somewhat inefficient; you're going to take a memory access hit, no 
matter what improvements are made to the architecture.

> So my feedback on the use of rdhwr (or any other instruction that traps) is
> that as long as this is a short-term solution and/or we understand the
> performance implications of how often that trap happens, it's OK.  Depending
> on rdhwr to appear in a real implementation any time in the next 2-3 years
> simply isn't going to happen.

I think that matches our expectations.  My understanding is that we'll 
have a new MIPS ABI, probably with a GPR for the thread pointer, by
then.  So, I think we should view this as short-term hack to o32.

-- 
Mark Mitchell
CodeSourcery, LLC
mark at codesourcery.com
(916) 791-8304


From dom at mips.com  Fri Feb  4 10:13:44 2005
From: dom at mips.com (Dominic Sweetman)
Date: Fri, 4 Feb 2005 10:13:44 +0000
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050202175934.GD30888@nevyn.them.org>
References: <20050131205639.GK30888@nevyn.them.org>
	<16895.33772.807508.339803@doms-laptop.algor.co.uk>
	<20050201194340.GV30888@nevyn.them.org>
	<20050202175934.GD30888@nevyn.them.org>
Message-ID: <16899.19160.280516.549900@gargle.gargle.HOWL>


Daniel Jacobowitz (dan at codesourcery.com) writes:

> > > Why not re-write the spec to say "unless you generate this sequence
> > > exactly like this, you'll probably prevent any future linker
> > > optimization from working" - and then leave it to the compiler
> > > toolchain. 
> > > 
> > > My opinion is that linker optimizations could be very valuable, given
> > > a cheap way of accessing a thread pointer.  But MIPS Technologies are
> > > very unlikely to do heroic work on the linker for the o32 ABI - but we
> > > do intend to at least try that out for our NUBI ABI.
> > 
> > Fine by me; if no one else responds, I'll do this.
> > 
> > Note that when I make this change, I'm also going to let the compiler
> > schedule the sequences; whoever implements the linker optimizations can
> > go back and undo that.
> 
> Unfortunately, I've realized that we can't sit on this fence.  Here's a
> somewhat contrived example...

A very devilishly cunning example, too.

> The only way to make this work is either to mandate that an
> ABI-conforming compiler can not optimize the TLS access sequences, or
> to define additional relaxation marker relocations to mark valid
> sequences.  Some platforms do one, some do the other.
> 
> My preference is to do the latter, which we can postpone until someone
> is ready to implement them.

I agree.

-- 
Dominic Sweetman
MIPS Technologies
The Fruit Farm, Ely Road, Chittering, CAMBS CB5 9PH, ENGLAND
phone +44 1223 706205/fax +44 1223 706250/swbrd +44 1223 706200
http://www.mips.com


From dan at codesourcery.com  Fri Feb  4 14:12:54 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Fri, 4 Feb 2005 09:12:54 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <16899.19160.280516.549900@gargle.gargle.HOWL>
References: <20050131205639.GK30888@nevyn.them.org> <16895.33772.807508.339803@doms-laptop.algor.co.uk> <20050201194340.GV30888@nevyn.them.org> <20050202175934.GD30888@nevyn.them.org> <16899.19160.280516.549900@gargle.gargle.HOWL>
Message-ID: <20050204141250.GA3829@nevyn.them.org>

On Fri, Feb 04, 2005 at 10:13:44AM +0000, Dominic Sweetman wrote:
> 
> Daniel Jacobowitz (dan at codesourcery.com) writes:
> > Unfortunately, I've realized that we can't sit on this fence.  Here's a
> > somewhat contrived example...
> 
> A very devilishly cunning example, too.

Why thank you! :-)

> > The only way to make this work is either to mandate that an
> > ABI-conforming compiler can not optimize the TLS access sequences, or
> > to define additional relaxation marker relocations to mark valid
> > sequences.  Some platforms do one, some do the other.
> > 
> > My preference is to do the latter, which we can postpone until someone
> > is ready to implement them.
> 
> I agree.

Great.  This gives me room to move forward with the GCC patches,
pending the continuing discussion about rdhwr.

-- 
Daniel Jacobowitz


From dan at codesourcery.com  Mon Feb  7 20:49:02 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Mon, 7 Feb 2005 15:49:02 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM>
References: <20050201215859.GM15265@rembrandt.csv.ica.uni-stuttgart.de> <00d301c5098b$b758ee50$1bc0a8c0@MIPS.COM>
Message-ID: <20050207204900.GH3829@nevyn.them.org>

On Wed, Feb 02, 2005 at 05:00:03PM -0800, Michael Uhler wrote:
> In terms of the ship leaving the dock, is the issue one of specifically
> rdhwr, or could we use another instruction which also traps as an RI (or
> something else that isn't a syscall)? I'll talk more about rdhwr below, but
> it's important for me to understand whether it's the instruction, or the
> mechanism that makes you believe that the technical window has passed.

I don't care what the trapping instruction is; I would prefer not to
move away from an RI.  As Mark wrote, that code is tested and working
(although not quite finalized).

> >> - rdhwr is a MIPS32/64 Release 2 instruction.  No existing MIPS I-IV 
> >> implementation has this instruction and probably never will.  Even 
> >> existing MIPS3264 Release 2 implementations don't have the internal 
> >> register to hold the data.  This means that it will be years before 
> >> any hardware will support the feature, and that support depends on an 
> >> architecture decision (see next
> >> item)
> 
> Dan> Compare this to a syscall.  There is no existing implementation which
> will implement the syscall efficiently, and _never_ will be.
> 
> Thiemo> Yes. This means we will have a TLS register which is a bit slower
> than a regular GPR for MIPS{32,64}R2 and a relatively slow emulated register
> for older implementations. If we use a pseudo-syscall instead, we'll have
> only the second variant with less performance potential.

Thiemo, you may already be clear on this point, but I'm going to
highlight it for the discussion anyway: the rdhwr solution does not use
a real register on MIPS32r2.  It will trap on every existing CPU.

> I take your point on syscall vs. something else that traps as an ri.  So let
> me try to explain my concern about the use of rdhwr specifically.
> 
> Compliance with the MIPS32/MIPS64 architectures (which is what's required
> for implementations by both MIPS Technologies and MIPS architecture
> licensees) requires passing a set of tests.  These tests check the corner
> cases of the architecture at each revision.  We do this to prevent
> fragmentation of the architecture and make your (you == the community of
> people writing software for implementations of the architecture) life
> easier.
> 
> In the particular case of rdhwr, we explicitly check that this instruction
> generates a reserved instruction on implementations of Release 1 of the
> architecture, and that all reserved encodings of rdhwr registers (which is
> what you're proposing to use) cause a reserved instruction exception on
> implementations of Release 2.  This means that there will never be a real
> implementation of rdhwr on Release 1 implementations.  With the current
> architecture spec, Release 2 implementations will be non-compliant with the
> architecture unless we make an architecture change.  Changes to existing
> architecture can certainly be done, but we don't take them lightly because
> we need to get comment from those people who thought they had a stable
> architecture from which to implement.  The fact that it's rdhwr makes it
> somewhat simpler because we would make the TLS register optional, and
> optional registers would cause a reserved instruction anyway.

No, I don't think you are looking at this from the right side.  The
decision to use a reserved rdhwr encoding for the thread pointer, AT
SOME FUTURE TIME, does not mean that Release 2 has any need to change.
The RI can be trapped on current processors, and it can be added to a
future architecture revision.  On the other hand, if the performance
benefits are compelling enough, that leaves you room to change the
architecture.  The phrase "Release 2 implementations will be
non-compliant" only applies to "Release 2 implementations with this
hypothetical register, of which I expect there to be none".

> But the point is, the decision to use a particular instruction for the TLS
> pointer means that the architecture has to change.  To do that is going to
> require some time while we consult with all of the architecture licensees.
> Once that happens, somebody would have to actually implement the register on
> an implementation of Release 2 of the architecture.  It will be years
> (probably at least 2-3) before the first implementation appears with the TLS
> register implemented via rdhwr, and the total population of those
> implementations is going to be small.  The vast majority of MIPS
> implementations will continue to trap with a reserved instruction, which
> will fundamentally limit the performance of NPTL on MIPS.
> 
> The alternatives seem to be to use a GPR (but this requires an ABI change)

As many people have pointed out, waiting for the ABI change isn't
practical.  In a sense, that would also fundamentally limit the
performance of NPTL on MIPS :-)

> or to park the TLB  pointer someplace in the address space. I wondered to
> Mark at one point whether we could put it at the base of the stack, then
> down-align sp to access it.  We played with this a bit, but couldn't come up
> with anything that was relatively clean.

You can't do it that way.  This is what LinuxThreads used to do and it
imposes impossible limits on your stack alignment and sizing.

Want to talk to me more about using a parked TLB entry?  I spoke with
someone (Ralf or Jun, probably) about the idea originally; I was told
it wasn't possible on MIPS SMP implementations to make this work, or
that there was some other reason why it was undesirable.  If that's not
accurate then we could use a reserved memory location.  However, that
makes the TLS model dependent on details of Linux's memory mapping -
not good for a hopefully generally useful ABI.

Note that I haven't been doing thread benchamarking, but the
performance overhead from emulating rdhwr has not been significant
in casual testing.  I'm not weeping for the lost speed either way.

> So my feedback on the use of rdhwr (or any other instruction that traps) is
> that as long as this is a short-term solution and/or we understand the
> performance implications of how often that trap happens, it's OK.  Depending
> on rdhwr to appear in a real implementation any time in the next 2-3 years
> simply isn't going to happen.

I understand that.

> If we do decide to use rdhwr (as opposed to another trapping instruction -
> see further comments below), we're probably going to have to change whatever
> RDHWR register number that you're using now.  You can't just pick one at
> random as that will conflict with the architecture as we add new registers.

Hint: that's why I asked MIPS for feedback, so that we could get a
non-conflicting register number assigned.  The only reason I picked $5
was because it was unassigned in the MIPS32r2 spec and I couldn't find
any reference to plans for it.

> I've talked about my concerns about the use of rdhwr above.  My general
> concern is about the widespread use of any instruction whose emulation
> requires reading the instruction from memory (which would be pretty much
> anything but syscall, which has at dedicated exception vector and passes
> arguments via register).  We had occasion to have to debug a problem with
> another operating system and a MIPS core from a different manufacturer.  We
> discovered that this particular implementation did not guarantee that a load
> done off the EPC value would always hit in the TLB.  In fact, it missed and
> the kernel didn't use a guarded load, so it took a nested exception and
> crashed.
> 
> You could say that this is a bug in the implementation, but we started to
> look more broadly and concluded that it is possible for implementations,
> particularly those that implement a virtual instruction cache, to wind up at
> the reserved instruction handler and not have the instruction page mapped.
> 
> The advantage of syscall is that the argument is in a register, and no
> instruction read is required to interpret the instruction.  One can
> certainly use another instruction (e.g., rdhwr) whose emulation requires
> reading the instruction, but the read needs to be guarded.  If this is true
> of the Linux RI handler, we're all set.  If not, this needs to be considered
> in the selection of the instruction that's going to trap.  While a TLB miss
> isn't going to happen very often (maybe never on some processors), the code
> has to deal with the case to ensure correctness.  When thinking about the
> choice of rdhwr or something else that traps, we should consider this
> situation.

All reads from userspace are always protected in Linux; anything else
is a bug, plain and simple.  This is a non-issue.  Your point about the
performance overhead of reading and decoding an instruction is worth
keeping in mind, but hasn't been a big slowdown.

> It's been awhile since I looked at the code, but I thought we could hide the
> additional instructions required to do this with syscall under the current
> code for almost all implementations of the architecture.  That is, knowing
> that all implementations are pipelined and that certain things create holes
> in the pipeline, I seem to recall thinking that it would add no more cycles
> (as opposed to instructions) to the syscall flow.  But as I said, it's been
> awhile.

I've seen pretty strong negative push for adding more complexity to the
already extremely complex syscall path.  A syscall which didn't trash
the same set of registers would be a lot of complexity.

In any case, I'll note that the binutils and GCC portions of TLS
support are ready for submission to the FSF and I'm moving on to glibc.
The GCC bits include the instruction under debate, whatever it turns out
to be.  I don't want to bog this down in discussion longer than
necessary, so I hope we can come to an agreement in the next few days.

-- 
Daniel Jacobowitz


From nigel at mips.com  Wed Feb  9 13:11:50 2005
From: nigel at mips.com (Nigel Stephens)
Date: Wed, 9 Feb 2005 13:11:50 +0000
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
Message-ID: <16906.3094.245677.488728@mips.com>

Daniel Jacobowitz wrote:
> Note that I haven't been doing thread benchamarking, but the
> performance overhead from emulating rdhwr has not been significant
> in casual testing.  I'm not weeping for the lost speed either way.

One reason why MIPS is concerned about using rdhwr is that it may
condemn the whole MIPS architecture to poor multi-threading
performance relative to other architectures for some number of
years. If the architecture gains a poor reputation in this area, then
the mud could stick.

As part of our own experiments Maciej implemented a "fast path" rdhwr
emulation, which he promises he will post to this list today. It has a
typical emulation time of between 30 and 60 cycles, depending on the
CPU, and assuming a fixed destination register for rdhwr (e.g. only
rdhwr $2,$5).  Not too bad, but not brilliant either if thread pointer
access time turns out to be critical to the performance of some
threaded applications.

But I don't yet understand how important the thread pointer access
time is for a typical threaded program. Can we get a better idea of
the dynamic frequency of thread register loads? Does anyone know some
suitable application programs or benchmarks which exercise the TLS
mechanism, and from which we could extract some statistics?

With regard to TLS on other architectures, people might like to read
this email conversation
http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html
and continued here
http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620.
Ah, I see that since I last looked, you've been contributing to that
discussion already Daniel.

For other readers, here's a summary. ARM have developed a new ABI
(EABI) which adds a user-accessible coprocessor thread register, but
that register is only available on ARM architecture v6 and above
(shades of rdhwr!), and requires trap-based kernel emulation on older
CPUs.

Since the majority of ARM CPUs are pre-v6, the ARM Linux developers
aren't happy about the cost of this emulation and are discussing
alternative mechanisms which would work better on old ARM CPUs using
the old ABI. The favorite seems to be mapping a read-only page which
holds a copy of the thread pointer at a fixed virtual address. Since
the ARM MMU architecture apparently cannot handle this on SMP
configurations it looks like ARM Linux is considering splitting into
two TLS "universes", depending on the compiler/linker options used to
build the application and libraries:

1) Old ABI and memory-mapped thread pointer for pre ARM v6
   architectures, which can't run SMP anyway. For compatibility, old
   memory-mapped code running on ARM v6 SMP systems will generate a
   page fault when accessing the magic page, and the thread pointer
   load will be emulated by the kernel.

2) New EABI only for SMP ARM v6 and above, using the new user
   coprocessor reg. The new EABI can't run well on old CPUs anyway,
   because it uses other new instructions which aren't available on
   pre-ARM v6 (atomic instructions). Again, for compatibility
   purposes, access to the new thread coprocessor register can be
   trapped and emulated by the kernel.

So ARM Linux may before too long have well-performing TLS
implementations for all of its Linux-capable CPUs.

> Want to talk to me more about using a parked TLB entry?  I spoke with
> someone (Ralf or Jun, probably) about the idea originally; I was told
> it wasn't possible on MIPS SMP implementations to make this work, or
> that there was some other reason why it was undesirable.  If that's not
> accurate then we could use a reserved memory location.  However, that
> makes the TLS model dependent on details of Linux's memory mapping -
> not good for a hopefully generally useful ABI.

While it's true that all threads within the same address space must
share a single page table, which would prevent such a trick being
implemented using the normal TLB refill mechanism, I think that it
might indeed be possible to implement this by using a per-thread wired
(parked) TLB entry, updated on a context switch. This would map a
magic page containing a copy of the thread pointer to a fixed virtual
address: if in the bottom 32K then it would need just one instruction
(e.g. lw $v0,0x1000($0)).

A wired TLB entry like this should be fairly straightforward to
implement on even a simple RTOS, since it doesn't require a full-blown
VM system.  I don't see why it wouldn't work on an SMP system too,
since each CPU has its own TLB, and therefore a unique wired TLB
entry per active thread. 

To maintain single-threaded performance we'd perhaps want to increment
the Wired count only when running real TLS code, so as not to reduce
the number of TLB entries available for random replacement for the
majority single-threaded applications.

I've spoken to Ralf about this idea, and though he's not thrilled by
it, he hasn't (yet) said that it couldn't work.

But a notable downside of this technique is that it won't work on the
new generation of explicitly multi-threaded CPUs, which could be
executing instructions from many threads within the same address space
"simultaneously", using a single, shared TLB.  On such CPUs the
memory-mapped thread pointer would require that the kernel not map the
wired page, and then emulate a faulting thread pointer load in the
page fault handler - rather defeating the point of having a CPU
designed to accelerate multi-threading.

For multi-threaded CPUs the ultimate solution is the new ABI which
reserves a GPR for the thread register. But in the short-term, before
that new ABI is fully supported, a rdhwr-based implementation could be
argued for.

How do people feel about supporting two different TLS implementations
on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs,
and another for multi-threaded CPUs - similar to what ARM Linux seems
to be considering for SMP?

In each case the kernel could provide binary compatibility, at reduced
performance, for the incompatible thread access mechanism
i.e. emulating the a memory-mapped load, or the rdhwr, as
appropriate.  It might also be sensible to provide two variants of
libc.so (per ABI), compiled for the different TLS mechanisms, much as
the x86 Linux dynamic linker automatically loads either
/usr/lib/i486/libc.so or /usr/lib/i686/libc.so depending on the CPU
type.

> All reads from userspace are always protected in Linux; 

But not if you are trying to accelerate the emulation code by handling
it at exception level, so as to avoid a full register save/restore and
kernel context setup.  But Maciej's patch does include a fix to the
nested TLBL exception handler to make this work.

> Your point about the performance overhead of reading and decoding an
> instruction is worth keeping in mind, but hasn't been a big
> slowdown.

Measured how?

Nigel


From dan at codesourcery.com  Wed Feb  9 16:21:26 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Wed, 9 Feb 2005 11:21:26 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <16906.3094.245677.488728@mips.com>
References: <16906.3094.245677.488728@mips.com>
Message-ID: <20050209162119.GA8011@nevyn.them.org>

On Wed, Feb 09, 2005 at 01:11:50PM +0000, Nigel Stephens wrote:
> Daniel Jacobowitz wrote:
> > Note that I haven't been doing thread benchamarking, but the
> > performance overhead from emulating rdhwr has not been significant
> > in casual testing.  I'm not weeping for the lost speed either way.
> 
> One reason why MIPS is concerned about using rdhwr is that it may
> condemn the whole MIPS architecture to poor multi-threading
> performance relative to other architectures for some number of
> years. If the architecture gains a poor reputation in this area, then
> the mud could stick.
> 
> As part of our own experiments Maciej implemented a "fast path" rdhwr
> emulation, which he promises he will post to this list today. It has a
> typical emulation time of between 30 and 60 cycles, depending on the
> CPU, and assuming a fixed destination register for rdhwr (e.g. only
> rdhwr $2,$5).  Not too bad, but not brilliant either if thread pointer
> access time turns out to be critical to the performance of some
> threaded applications.

Can you compare this to the normal cost of an emulated instruction? 
I'm not sure if I've posted the rdhwr emulation patch anywhere; I know
Ralf has a copy.  I'm not thrilled about hardcoding the target
register but if that's what ya gotta do...

> But I don't yet understand how important the thread pointer access
> time is for a typical threaded program. Can we get a better idea of
> the dynamic frequency of thread register loads? Does anyone know some
> suitable application programs or benchmarks which exercise the TLS
> mechanism, and from which we could extract some statistics?

TLS is a user level feature, so it's very difficult to predict access
patterns.  Glibc uses it for:
  - errno
  - locale data
  - pthread_self
There are, roughly, twentyish TLS traps in the startup of a typical
single-threaded application.  I do not know anything about thread
benchmarking so I can't give you much there.

> With regard to TLS on other architectures, people might like to read
> this email conversation
> http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-January/026468.html
> and continued here
> http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2005-February/thread.html#26620.
> Ah, I see that since I last looked, you've been contributing to that
> discussion already Daniel.

I wrote most of that except for Nico's reimplementation.

> > Want to talk to me more about using a parked TLB entry?  I spoke with
> > someone (Ralf or Jun, probably) about the idea originally; I was told
> > it wasn't possible on MIPS SMP implementations to make this work, or
> > that there was some other reason why it was undesirable.  If that's not
> > accurate then we could use a reserved memory location.  However, that
> > makes the TLS model dependent on details of Linux's memory mapping -
> > not good for a hopefully generally useful ABI.
> 
> While it's true that all threads within the same address space must
> share a single page table, which would prevent such a trick being
> implemented using the normal TLB refill mechanism, I think that it
> might indeed be possible to implement this by using a per-thread wired
> (parked) TLB entry, updated on a context switch. This would map a
> magic page containing a copy of the thread pointer to a fixed virtual
> address: if in the bottom 32K then it would need just one instruction
> (e.g. lw $v0,0x1000($0)).
> 
> A wired TLB entry like this should be fairly straightforward to
> implement on even a simple RTOS, since it doesn't require a full-blown
> VM system.  I don't see why it wouldn't work on an SMP system too,
> since each CPU has its own TLB, and therefore a unique wired TLB
> entry per active thread. 
> 
> To maintain single-threaded performance we'd perhaps want to increment
> the Wired count only when running real TLS code, so as not to reduce
> the number of TLB entries available for random replacement for the
> majority single-threaded applications.
> 
> I've spoken to Ralf about this idea, and though he's not thrilled by
> it, he hasn't (yet) said that it couldn't work.

You can't do this for "only single-threaded code" - see above about
errno and locales.  TLS is either used by glibc or not.  When built for
NPTL, it's used.

I'm pretty dubious about the tradeoff of wiring a TLB entry for this.
Benchmarks will not accurately show the cost of having fewer available.

> But a notable downside of this technique is that it won't work on the
> new generation of explicitly multi-threaded CPUs, which could be
> executing instructions from many threads within the same address space
> "simultaneously", using a single, shared TLB.  On such CPUs the
> memory-mapped thread pointer would require that the kernel not map the
> wired page, and then emulate a faulting thread pointer load in the
> page fault handler - rather defeating the point of having a CPU
> designed to accelerate multi-threading.

Well, that's a bit of a full stop!  Now we've got _three_ TLS
mechanisms to contend with.  That's too many.

> How do people feel about supporting two different TLS implementations
> on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs,
> and another for multi-threaded CPUs - similar to what ARM Linux seems
> to be considering for SMP?

I think it's a horrible, horrible idea.  Complexity has a cost too.

> > Your point about the performance overhead of reading and decoding an
> > instruction is worth keeping in mind, but hasn't been a big
> > slowdown.
> 
> Measured how?

As I said, only casually.

>From this discussion I am still inclined to stay with rdhwr.  30-60
cycles is not very long.

-- 
Daniel Jacobowitz
CodeSourcery, LLC


From macro at mips.com  Wed Feb  9 18:32:36 2005
From: macro at mips.com (Maciej W. Rozycki)
Date: Wed, 9 Feb 2005 18:32:36 +0000 (GMT)
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050209162119.GA8011@nevyn.them.org>
References: <16906.3094.245677.488728@mips.com> <20050209162119.GA8011@nevyn.them.org>
Message-ID: <Pine.LNX.4.61.0502091746500.25773@perivale.mips.com>

On Wed, 9 Feb 2005, Daniel Jacobowitz wrote:

> > As part of our own experiments Maciej implemented a "fast path" rdhwr
> > emulation, which he promises he will post to this list today. It has a
> > typical emulation time of between 30 and 60 cycles, depending on the
> > CPU, and assuming a fixed destination register for rdhwr (e.g. only
> > rdhwr $2,$5).  Not too bad, but not brilliant either if thread pointer
> > access time turns out to be critical to the performance of some
> > threaded applications.
> 
> Can you compare this to the normal cost of an emulated instruction? 

 For the 24Kf processor the cost of doing a normal emulation is about 550% 
of that of my fast path.  For the 4Kc one it's 1975%...

> I'm not sure if I've posted the rdhwr emulation patch anywhere; I know
> Ralf has a copy.  I'm not thrilled about hardcoding the target
> register but if that's what ya gotta do...

 You can have a fast path for the dedicated target register and normal 
emulation for the others to keep the semantics consistent.  The cost rise 
from doing a computed goto to emulate a write to an arbitrary target 
register is about 25%, i.e. the total cost is about 125% of the original.

  Maciej


From dan at codesourcery.com  Wed Feb  9 18:52:35 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Wed, 9 Feb 2005 13:52:35 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <Pine.LNX.4.61.0502091746500.25773@perivale.mips.com>
References: <16906.3094.245677.488728@mips.com> <20050209162119.GA8011@nevyn.them.org> <Pine.LNX.4.61.0502091746500.25773@perivale.mips.com>
Message-ID: <20050209185227.GB8011@nevyn.them.org>

On Wed, Feb 09, 2005 at 06:32:36PM +0000, Maciej W. Rozycki wrote:
> On Wed, 9 Feb 2005, Daniel Jacobowitz wrote:
> 
> > > As part of our own experiments Maciej implemented a "fast path" rdhwr
> > > emulation, which he promises he will post to this list today. It has a
> > > typical emulation time of between 30 and 60 cycles, depending on the
> > > CPU, and assuming a fixed destination register for rdhwr (e.g. only
> > > rdhwr $2,$5).  Not too bad, but not brilliant either if thread pointer
> > > access time turns out to be critical to the performance of some
> > > threaded applications.
> > 
> > Can you compare this to the normal cost of an emulated instruction? 
> 
>  For the 24Kf processor the cost of doing a normal emulation is about 550% 
> of that of my fast path.  For the 4Kc one it's 1975%...
> 
> > I'm not sure if I've posted the rdhwr emulation patch anywhere; I know
> > Ralf has a copy.  I'm not thrilled about hardcoding the target
> > register but if that's what ya gotta do...
> 
>  You can have a fast path for the dedicated target register and normal 
> emulation for the others to keep the semantics consistent.  The cost rise 
> from doing a computed goto to emulate a write to an arbitrary target 
> register is about 25%, i.e. the total cost is about 125% of the original.

For GCC and ABI purposes, this means we might as well define in the TLS
ABI which register has to be used, and we can open it up when we look
back in ten years and everyone has the register :-)

Thanks for the numbers.  I think that working with the fast-path
emulation and rdhwr is our best bet at this time.  It also has a
substantial locality (i.e. all the code in one place) benefit over
playing with the TLB...

-- 
Daniel Jacobowitz
CodeSourcery, LLC


From macro at mips.com  Wed Feb  9 19:40:59 2005
From: macro at mips.com (Maciej W. Rozycki)
Date: Wed, 9 Feb 2005 19:40:59 +0000 (GMT)
Subject: Linux TLS pointer access reference emulation
Message-ID: <Pine.LNX.4.61.0502090500020.22211@perivale.mips.com>

Hello,

 I have published a Linux patch and a small test user program that I've 
used for performance evaluation of a few possible TLS pointer access 
methods.  The software is available at: 
"ftp://ftp.linux-mips.org/pub/linux/mips/people/macro/tls.tar.bz2". The 
patch implements an emulation of "rdhwr $2, $4" and syscall #0x10000, both 
retrieving a member of "struct thread_info" associated with the current 
process (the patch uses an arbitrary one; of course for the TLS pointer 
that should be replaced with a meaningful struct field).

 The patch applies to Linux 2.6.9-rc1, specifically to the malta CVS 
repository at linux-mips.org as of Oct 20th, 2004.  It should work with 
the corresponding version from the main repository as well, but using it 
with the current revision requires adjusting it to these synthesized TLB 
handlers.

 The patch has its shortcomings, most notably it's been written for the 
32-bit kernel only.  For 64-bit ones it needs to be aware of the XTLB 
refill handler.  It may actually be done quite nicely with these 
synthesized handlers; also avoiding the need to fetch the shadow of the 
EBase cp0 register.

 The userland software consists of a small program that benchmarks the 
available methods in tight loops.  Keeping caches warm this should provide 
a reasonable optimistic execution time estimate.  The program expects two 
arguments, a CPU frequency (which you can obtain from `dmesg' or failing 
this -- from your system's specs) and a number of loops to execute.  It 
provides a number of outputs which are essentially raw as it's not really 
been meant for general use, but most importantly you are after "cycle 
count" reports for "scall" and "rdhwr" (these should be self explaining); 
perhaps "instr" as well (which counts cycles used for a single instruction 
and may not be accurate depending on the implementation of your 
processor's execution pipeline(s)).

 There are actually two programs included -- "time-0" and "time-1"; the 
former is what I've used for benchmarking and the latter is mainly for 
verification of proper operation with VIVT I-caches (but its output is 
meaningful, too).

 With Linux from the malta repository the programs can be trivially 
modified to benchmark a full instruction emulation.  This version of Linux 
already emulates "rdhwr" for other registers, so all that has to be done 
is to replace "rdhwr $2, $4" with e.g. "rdhwr $2, $2" in rdhwr.h; that's 
how I did these additional benchmarks for Daniel.

 Alongside the software there are a few reports provided that I've 
obtained with a Malta board for CPUs I've had immediately available on 
core cards.

 Please let me know if you have any further questions regarding this 
package.

  Maciej


From ralf at linux-mips.org  Wed Feb  9 19:50:13 2005
From: ralf at linux-mips.org (Ralf Baechle)
Date: Wed, 9 Feb 2005 20:50:13 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <16906.3094.245677.488728@mips.com>
References: <16906.3094.245677.488728@mips.com>
Message-ID: <20050209195013.GD5740@linux-mips.org>

On Wed, Feb 09, 2005 at 01:11:50PM +0000, Nigel Stephens wrote:

> One reason why MIPS is concerned about using rdhwr is that it may
> condemn the whole MIPS architecture to poor multi-threading
> performance relative to other architectures for some number of
> years. If the architecture gains a poor reputation in this area, then
> the mud could stick.

I'm not too concerned here because currently there seems to be very little
code available that exploits TLS.  As time progresses more such code
will be written and that I hope would be sufficient time for the hardware
folks, if we deciede to take that route.

> But I don't yet understand how important the thread pointer access
> time is for a typical threaded program. Can we get a better idea of
> the dynamic frequency of thread register loads? Does anyone know some
> suitable application programs or benchmarks which exercise the TLS
> mechanism, and from which we could extract some statistics?

That's a not very quantitative statement but the Alpha people are
apparently very satisfied with their PAL code exception based solution.
At least so said rth when I last spoke to him.

> To maintain single-threaded performance we'd perhaps want to increment
> the Wired count only when running real TLS code, so as not to reduce
> the number of TLB entries available for random replacement for the
> majority single-threaded applications.
> 
> I've spoken to Ralf about this idea, and though he's not thrilled by
> it, he hasn't (yet) said that it couldn't work.

My general feeling it's the kind of tradeoff that are sort of the
equivalent to juggling with razor blades with closed eyes ;-)  Or
translated into plain English, it's one of those optimizations which I
think have a strong potencial to fire back and actually result in a loss.
It's hard to say without actual application and without knowing the
size of the workload in advance.

> For multi-threaded CPUs the ultimate solution is the new ABI which
> reserves a GPR for the thread register. But in the short-term, before
> that new ABI is fully supported, a rdhwr-based implementation could be
> argued for.

A new ABI is alot of work, will take time and not last convincing.  We
want something sooner than that could happen.

> How do people feel about supporting two different TLS implementations
> on top of the existing MIPS ABIs - one optimised for legacy MIPS CPUs,
> and another for multi-threaded CPUs - similar to what ARM Linux seems
> to be considering for SMP?

It's a bit like hardware and software floating point, ll/sc and non-ll/sc
binaries which are flavours which we already have.  Once can only hope
that the number of binary variants won't reach the actual worst case
number.

  Ralf


From uhler at mips.com  Wed Feb  9 20:41:49 2005
From: uhler at mips.com (Michael Uhler)
Date: Wed, 9 Feb 2005 12:41:49 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050209162119.GA8011@nevyn.them.org>
Message-ID: <003601c50ee7$c763d510$0202a8c0@MIPS.COM>


> From this discussion I am still inclined to stay with rdhwr.  
> 30-60 cycles is not very long.

The problem that I'm having with this email thread is that I no longer
understand what problem is being solved.  You're getting push-back from the
MIPS folks, including me, about a solution that uses emulation of an
instruction which is unlikely to be actually implemented in hardware any
time soon, if ever.  It sounds like the ARM people are having the same
heartburn for the same reason.  There appears to be no appreciable
benchmarking that suggests that a trap-and-emulate approach will be fine, or
not so fine in terms of performance of the solution.

If the problem we're trying to solve is to get NPTL to work on MIPS, period
the end, then we can stop the email thread now.  That certainly seems to be
the path we're taking as of now.

If we're trying to find a solution to the problem such that MIPS isn't at a
competitive disadvantage, perhaps we should start looking for such a
solution, rather than rejecting suggestions because of the changes to the
current software implementation.

So can you please crisply state what problem we're trying to solve here, and
the bounds on an acceptable solution from your point of view.  I'm really
confused.

/gmu

---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road      Voice:  (650)567-5025   FAX:   (650)567-5225
Mountain View, CA 94043   Mobile: (650)868-6870   Admin: (650)567-5085


From dan at codesourcery.com  Wed Feb  9 21:13:41 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Wed, 9 Feb 2005 16:13:41 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <003601c50ee7$c763d510$0202a8c0@MIPS.COM>
References: <20050209162119.GA8011@nevyn.them.org> <003601c50ee7$c763d510$0202a8c0@MIPS.COM>
Message-ID: <20050209211337.GC8011@nevyn.them.org>

On Wed, Feb 09, 2005 at 12:41:49PM -0800, Michael Uhler wrote:
> 
> > From this discussion I am still inclined to stay with rdhwr.  
> > 30-60 cycles is not very long.
> 
> The problem that I'm having with this email thread is that I no longer
> understand what problem is being solved.  You're getting push-back from the
> MIPS folks, including me, about a solution that uses emulation of an
> instruction which is unlikely to be actually implemented in hardware any
> time soon, if ever.  It sounds like the ARM people are having the same
> heartburn for the same reason.  There appears to be no appreciable
> benchmarking that suggests that a trap-and-emulate approach will be fine, or
> not so fine in terms of performance of the solution.
> 
> If the problem we're trying to solve is to get NPTL to work on MIPS, period
> the end, then we can stop the email thread now.  That certainly seems to be
> the path we're taking as of now.
> 
> If we're trying to find a solution to the problem such that MIPS isn't at a
> competitive disadvantage, perhaps we should start looking for such a
> solution, rather than rejecting suggestions because of the changes to the
> current software implementation.
> 
> So can you please crisply state what problem we're trying to solve here, and
> the bounds on an acceptable solution from your point of view.  I'm really
> confused.

My point is for NPTL to work.  Specifically:

  - Tying it to a new ABI is unacceptable in the short term, and
    possibly in the longer term, because of community pushback against
    additional ABIs.

  - Using a wired TLB entry gives kernel developers the shakes, because
    it restricts the available TLB slots, which can have complex
    effects on the performance of existing applications.

That only leaves methods which enter the kernel.  Our choices are via a
load from an unmapped page, via a syscall, or via an RI exception. 
Only one of those models is compatible with acceleration via future
hardware.

Using rdhwr does not provide cripplingly slow - or even perceptibly
slow - performance, and I'm using a much more expensive emulation layer
than Maciej's.  If the MIPS folks can clearly justify their concerns
about this solution, and some relevant ideas to benchmark the problem,
then maybe we can make forward progress.

As was just pointed out to me, Maciej's numbers are on the same order
of magnitude as a load miss.  Note that the rdhwr is never needed more
than once per function with an optimizing compiler.  I think that at
the cost of a load miss, you're getting a bargain.

-- 
Daniel Jacobowitz
CodeSourcery, LLC


From mark at codesourcery.com  Wed Feb  9 21:19:25 2005
From: mark at codesourcery.com (Mark Mitchell)
Date: Wed, 09 Feb 2005 13:19:25 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <003601c50ee7$c763d510$0202a8c0@MIPS.COM>
References: <003601c50ee7$c763d510$0202a8c0@MIPS.COM>
Message-ID: <420A7E5D.3050008@codesourcery.com>

Michael Uhler wrote:
>>From this discussion I am still inclined to stay with rdhwr.  
>>30-60 cycles is not very long.
> 
> The problem that I'm having with this email thread is that I no longer
> understand what problem is being solved.

I'll try to summarize what I understand the situation to be.

CodeSourcery has done an implementation for one of its customers using 
rdhwr.  We'd like to get that an implementation, or a variant of it, out 
to the broader community.  From our point of view, it's just as easy to 
use any other single instruction that loads the thread pointer, instead 
of "rdhwr"; that could be "lw" or some other trapping load.  There's no 
technical problem making that change.

What is important is that we get buy-in from the rest of the Linux MIPS 
community.  We really want to avoid multiple different versions of this 
stuff out there.  Concern about that issue is right now preventing 
Daniel from posting his patches; he doesn't want to see things fragment.

Daniel feels rdhwr is the best choice, as it would seem to avoid 
complexity down the road; MIPS seems to feel that the number of cycles 
required to access the thread pointer on current hardware is more 
important.  I don't think I have an opinion, but I would point out that 
(a) optimizations for TLS models mean that you can often make multiple 
accesses to TLS with a single thread pointer load, and (b) a lot of 
threaded programs make very little use of TLS, but one still needs 
*some* implementation of TLS in order to get NPTL off the ground.  Both 
of these points suggest that cycles-per-thread-pointer-load may not be 
that important a metric.  A compromise solution would be to say that the 
ABI requires "rdhwr", but mark the instruction with a dynamic relocation 
that the dynamic loader could modify into a different instruction if 
that makes sense on a particular system.

Ultimately, I think it would be most helpful would be for the Linux MIPS 
community to make a decision of what single instruction needs to go in 
that slot, and then let us know.  If, in the end, it's not rdhwr, that's 
OK.  However, Daniel's validated his changes using the rdhwr solution, 
and we don't have the resources to re-validate with another solution, 
under our existing contracts.  So, we would provide our patches to MIPS, 
and let MIPS handle the upstream submission.

-- 
Mark Mitchell
CodeSourcery, LLC
mark at codesourcery.com
(916) 791-8304


From ica2_ts at csv.ica.uni-stuttgart.de  Wed Feb  9 23:26:41 2005
From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer)
Date: Thu, 10 Feb 2005 00:26:41 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <003601c50ee7$c763d510$0202a8c0@MIPS.COM>
References: <20050209162119.GA8011@nevyn.them.org> <003601c50ee7$c763d510$0202a8c0@MIPS.COM>
Message-ID: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de>

Michael Uhler wrote:
> 
> > From this discussion I am still inclined to stay with rdhwr.  
> > 30-60 cycles is not very long.
> 
> The problem that I'm having with this email thread is that I no longer
> understand what problem is being solved.

The short term goal is still to add TLS to the existing ABIs without
breaking source compatibility.

> You're getting push-back from the
> MIPS folks, including me, about a solution that uses emulation of an
> instruction which is unlikely to be actually implemented in hardware any
> time soon, if ever.  It sounds like the ARM people are having the same
> heartburn for the same reason.

Yes, and nobody backs up this concern with hard data. Maciej's numbers
are good to know, but they tell nothing about how heavily the TLS
implementation will be used in real-world applications.

> There appears to be no appreciable
> benchmarking that suggests that a trap-and-emulate approach will be fine, or
> not so fine in terms of performance of the solution.

I looked for good NPTL benchmarks and decent performance comparisions
in the meanwhile. I found none at all for emulated TLS vs. TLS register.
Numbers were cited by

- mail(s) from Ulrich Drepper to the Linux Kernel Mailing List, which
  compares against linuxthreads, and uses some rather unrealistic loads
  like 100000 threads. The NPTL source has some highly synthetic
  benchmarks which might have been used to get those numbers.

- Some mails on developer mailing lists which cite a 1-2% performance
  improvement over linuxthreads for larger java applications. For that
  case the difference between emulated and hardware TLS appears to be
  negligable.

> If the problem we're trying to solve is to get NPTL to work on MIPS, period
> the end, then we can stop the email thread now.  That certainly seems to be
> the path we're taking as of now.
> 
> If we're trying to find a solution to the problem such that MIPS isn't at a
> competitive disadvantage, perhaps we should start looking for such a
> solution, rather than rejecting suggestions because of the changes to the
> current software implementation.
> 
> So can you please crisply state what problem we're trying to solve here, and
> the bounds on an acceptable solution from your point of view.  I'm really
> confused.

The assumption of "competitive disadvantage" is AFAICS unproven, and
the schemes suggested to improve TLS performance may well turn out as
over-engineering. Using e.g. a wired TLB is known to have a performance
impact for all applications, the same is (probably to a lesser extent)
true for reserving a GPR.


Thiemo


From uhler at mips.com  Wed Feb  9 23:46:33 2005
From: uhler at mips.com (Michael Uhler)
Date: Wed, 9 Feb 2005 15:46:33 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de>
Message-ID: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>


> Yes, and nobody backs up this concern with hard data. 
> Maciej's numbers are good to know, but they tell nothing 
> about how heavily the TLS implementation will be used in 
> real-world applications.

> The assumption of "competitive disadvantage" is AFAICS 
> unproven, and the schemes suggested to improve TLS 
> performance may well turn out as over-engineering. Using e.g. 
> a wired TLB is known to have a performance impact for all 
> applications, the same is (probably to a lesser extent) true 
> for reserving a GPR.

Both points are valid.  But they assume that if we DO have a performance
problem, we'll be able to go back and fix that problem with an alternative
method (something other than a new ABI).  It was my impression that we were
discussing something that was not going to be easy to change once defined.

So if we are proposing something that uses trap-and-emulate of rdhwr (whose
register number will still have to change - I'll figure out what that is) as
an initial proposal, and we're prepared to change it to address any
performance problem that we find, I'm OK with that.

If we're not prepared to change things if we find a performance problem, I
wonder why the burden of proof is on us to prove that we're not at a
competitive disadvantage (something the ARM folks are obviously concerned
about also) vs. the burden of proof being on having sufficient performance
analysis to know that it will be fine.

To me, it all comes down to whether this is a final, unchangable, solution
or not.

/gmu
---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road      Voice:  (650)567-5025   FAX:   (650)567-5225
Mountain View, CA 94043   Mobile: (650)868-6870   Admin: (650)567-5085


From mark at codesourcery.com  Thu Feb 10 00:04:51 2005
From: mark at codesourcery.com (Mark Mitchell)
Date: Wed, 09 Feb 2005 16:04:51 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>
References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>
Message-ID: <420AA523.4060307@codesourcery.com>

Michael Uhler wrote:
> Both points are valid.  But they assume that if we DO have a performance
> problem, we'll be able to go back and fix that problem with an alternative
> method (something other than a new ABI).  It was my impression that we were
> discussing something that was not going to be easy to change once defined.

That's why I suggested, as a possible compromise, that we require that 
compilers/linkers mark the rdhwr instruction with a relocation.  That 
would allow dynamic linkers to make appropriate changes to the code, if 
appropriate.

To me, this seems like a very practical way of moving forward with our 
current implementation, while hedging our bets; what do you and others 
think?

> If we're not prepared to change things if we find a performance problem, I
> wonder why the burden of proof is on us to prove that we're not at a
> competitive disadvantage (something the ARM folks are obviously concerned
> about also) vs. the burden of proof being on having sufficient performance
> analysis to know that it will be fine.

I desparately want to avoid getting into an ARM/MIPS controversy.

(CodeSourcery is not an advocate for one architecture over the other; we 
are pleased to work with many major semiconductor vendors, and while 
loyal to each of our customers, neutral overall.)

However, I will say that the ARM GNU/Linux community and ARM, Ltd., are 
not necessarily of one mind on this topic.  I believe (thought I 
certainly cannot speak for them) that ARM, Ltd., is OK with the 
requirement that, to get maximum NPTL/TLS performance, you use an ARM V6 
chip and the new ARM ABI, including a coprocessor-read instruction to 
access the thread pointer.  The ARM GNU/Linux commmunity seems to have a 
greater attachment to the old ABI and a greater desire to hardware 
without the appropriate coprocessors.

-- 
Mark Mitchell
CodeSourcery, LLC
mark at codesourcery.com
(916) 791-8304


From dan at codesourcery.com  Thu Feb 10 00:10:10 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Wed, 9 Feb 2005 19:10:10 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>
References: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de> <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>
Message-ID: <20050210001007.GE8011@nevyn.them.org>

On Wed, Feb 09, 2005 at 03:46:33PM -0800, Michael Uhler wrote:
> 
> > Yes, and nobody backs up this concern with hard data. 
> > Maciej's numbers are good to know, but they tell nothing 
> > about how heavily the TLS implementation will be used in 
> > real-world applications.
> 
> > The assumption of "competitive disadvantage" is AFAICS 
> > unproven, and the schemes suggested to improve TLS 
> > performance may well turn out as over-engineering. Using e.g. 
> > a wired TLB is known to have a performance impact for all 
> > applications, the same is (probably to a lesser extent) true 
> > for reserving a GPR.
> 
> Both points are valid.  But they assume that if we DO have a performance
> problem, we'll be able to go back and fix that problem with an alternative
> method (something other than a new ABI).  It was my impression that we were
> discussing something that was not going to be easy to change once defined.
> 
> So if we are proposing something that uses trap-and-emulate of rdhwr (whose
> register number will still have to change - I'll figure out what that is) as
> an initial proposal, and we're prepared to change it to address any
> performance problem that we find, I'm OK with that.
> 
> If we're not prepared to change things if we find a performance problem, I
> wonder why the burden of proof is on us to prove that we're not at a
> competitive disadvantage (something the ARM folks are obviously concerned
> about also) vs. the burden of proof being on having sufficient performance
> analysis to know that it will be fine.
> 
> To me, it all comes down to whether this is a final, unchangable, solution
> or not.

There's a couple of different possibilities covered by "final" and
"unchangeable".  The one thing I'm most desperate to avoid is something
that becomes impossible or extremely difficult to support later.  Once
we publish this as part of an ABI, there will soon be deployed systems
using it; I don't want to have to force them to transition to something
else.  The TP access instruction ends up in both system libraries and
user applications, so there's real legacy impact.

Rdhwr could end up in this state, if MIPS determines that there is not
an available register encoding or that devoting one to this usage is a
bad idea.  If that's the case, let us know - we'll have to go back to
the drawing board.

[FYI: ARM actually uses a function call for application TP access.  I
think this is a bad design decision, which is why I haven't proposed it
for MIPS.  Having to make a function call tacks even more overhead onto
IE/LE access, particularly userspace register pressure.]

However, if we're willing to support whatever we choose here for the
unspecified future, I don't see a big barrier to choosing a new,
superior approach later.  Probably none of the toolchain components
would be affected; they could seamlessly transition to a new model. The
kernel would have to continue to support the existing model and the new
model; I'll leave that answer to the kernel developers reading this,
but I believe that Maciej's patches would not be infeasible to
maintain.

If this is OK with you, I'd appreciate it if you could get back to us
about the choice of register number.  I've got at least a couple more
days worth of work before the port will be ready.  It won't cripple me
to sit on it for a couple weeks after that, but I'd love to have it
submitted - I've already started to get requests for the code.

-- 
Daniel Jacobowitz
CodeSourcery, LLC


From dan at codesourcery.com  Thu Feb 10 00:18:41 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Wed, 9 Feb 2005 19:18:41 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <420AA523.4060307@codesourcery.com>
References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> <420AA523.4060307@codesourcery.com>
Message-ID: <20050210001837.GF8011@nevyn.them.org>

On Wed, Feb 09, 2005 at 04:04:51PM -0800, Mark Mitchell wrote:
> Michael Uhler wrote:
> >Both points are valid.  But they assume that if we DO have a performance
> >problem, we'll be able to go back and fix that problem with an alternative
> >method (something other than a new ABI).  It was my impression that we were
> >discussing something that was not going to be easy to change once defined.
> 
> That's why I suggested, as a possible compromise, that we require that 
> compilers/linkers mark the rdhwr instruction with a relocation.  That 
> would allow dynamic linkers to make appropriate changes to the code, if 
> appropriate.
> 
> To me, this seems like a very practical way of moving forward with our 
> current implementation, while hedging our bets; what do you and others 
> think?

I don't think it's worthwhile, since it doesn't hedge bets very well. 
An alternative sequence could easily turn out to be more than one
instruction, but still faster than a trap.

-- 
Daniel Jacobowitz
CodeSourcery, LLC


From mark at codesourcery.com  Thu Feb 10 00:22:27 2005
From: mark at codesourcery.com (Mark Mitchell)
Date: Wed, 09 Feb 2005 16:22:27 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050210001837.GF8011@nevyn.them.org>
References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> <420AA523.4060307@codesourcery.com> <20050210001837.GF8011@nevyn.them.org>
Message-ID: <420AA943.2040901@codesourcery.com>

Daniel Jacobowitz wrote:

> I don't think it's worthwhile, since it doesn't hedge bets very well. 
> An alternative sequence could easily turn out to be more than one
> instruction, but still faster than a trap.

(I was thinking that thus far all the sequences seem to have been 
single-instruction; the options thus far seem to have been "rhdwr", "lw 
$x, 0x1000($0)", and some other reserved-instruction sequence.)

Anyhow, there goes my attempt at cutting through this particular Gordion 
knot. :-)

Thanks,

-- 
Mark Mitchell
CodeSourcery, LLC
mark at codesourcery.com
(916) 791-8304


From ica2_ts at csv.ica.uni-stuttgart.de  Thu Feb 10 00:42:48 2005
From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer)
Date: Thu, 10 Feb 2005 01:42:48 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>
References: <20050209232641.GA11812@rembrandt.csv.ica.uni-stuttgart.de> <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM>
Message-ID: <20050210004248.GB11812@rembrandt.csv.ica.uni-stuttgart.de>

Michael Uhler wrote:
> 
> > Yes, and nobody backs up this concern with hard data. 
> > Maciej's numbers are good to know, but they tell nothing 
> > about how heavily the TLS implementation will be used in 
> > real-world applications.
> 
> > The assumption of "competitive disadvantage" is AFAICS 
> > unproven, and the schemes suggested to improve TLS 
> > performance may well turn out as over-engineering. Using e.g. 
> > a wired TLB is known to have a performance impact for all 
> > applications, the same is (probably to a lesser extent) true 
> > for reserving a GPR.
> 
> Both points are valid.  But they assume that if we DO have a performance
> problem, we'll be able to go back and fix that problem with an alternative
> method (something other than a new ABI).  It was my impression that we were
> discussing something that was not going to be easy to change once defined.

>From a technical POV it's not that hard to change the method. But
for adoption of NPTL it would be a disaster to create incompatible
variants after the initial deployment.

> So if we are proposing something that uses trap-and-emulate of rdhwr (whose
> register number will still have to change - I'll figure out what that is) as
> an initial proposal, and we're prepared to change it to address any
> performance problem that we find, I'm OK with that.

The current state is already well beyond the proposal phase. A potential
change of the current implementation would need to happen soon, and
would also need a sound reasoning. (Mark's idea of attaching a marker
relocation to the instruction is probably the best to keep an emergency
exit open. It would trade startup time for a TLS access change.)

> If we're not prepared to change things if we find a performance problem, I
> wonder why the burden of proof is on us to prove that we're not at a
> competitive disadvantage (something the ARM folks are obviously concerned
> about also) vs. the burden of proof being on having sufficient performance
> analysis to know that it will be fine.

The burden of proof is always on the side who wants to change a
working solution. :-)


Thiemo


From uhler at mips.com  Thu Feb 10 00:57:01 2005
From: uhler at mips.com (Michael Uhler)
Date: Wed, 9 Feb 2005 16:57:01 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050210004248.GB11812@rembrandt.csv.ica.uni-stuttgart.de>
Message-ID: <003601c50f0b$6e54df40$cb14a8c0@MIPS.COM>


> The burden of proof is always on the side who wants to change 
> a working solution. :-)

Oh, please.  Tell that to any open source maintainer who is looking for the
right solution, not just a working solution.

/gmu

---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road      Voice:  (650)567-5025   FAX:   (650)567-5225
Mountain View, CA 94043   Mobile: (650)868-6870   Admin: (650)567-5085


From ralf at linux-mips.org  Thu Feb 10 00:58:01 2005
From: ralf at linux-mips.org (Ralf Baechle)
Date: Thu, 10 Feb 2005 01:58:01 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <420AA523.4060307@codesourcery.com>
References: <002001c50f01$95c7cb50$cb14a8c0@MIPS.COM> <420AA523.4060307@codesourcery.com>
Message-ID: <20050210005801.GA10366@linux-mips.org>

On Wed, Feb 09, 2005 at 04:04:51PM -0800, Mark Mitchell wrote:

> Michael Uhler wrote:
> >Both points are valid.  But they assume that if we DO have a performance
> >problem, we'll be able to go back and fix that problem with an alternative
> >method (something other than a new ABI).  It was my impression that we were
> >discussing something that was not going to be easy to change once defined.
> 
> That's why I suggested, as a possible compromise, that we require that 
> compilers/linkers mark the rdhwr instruction with a relocation.  That 
> would allow dynamic linkers to make appropriate changes to the code, if 
> appropriate.
> 
> To me, this seems like a very practical way of moving forward with our 
> current implementation, while hedging our bets; what do you and others 
> think?

So we're now close to a consenus.  Given that I'd now accept a kernel patch
that does the right thing.  Which probably means taking Maciej's patch and
polishing to work for the latest kernel.

  Ralf


From ica2_ts at csv.ica.uni-stuttgart.de  Thu Feb 10 01:32:13 2005
From: ica2_ts at csv.ica.uni-stuttgart.de (Thiemo Seufer)
Date: Thu, 10 Feb 2005 02:32:13 +0100
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <003601c50f0b$6e54df40$cb14a8c0@MIPS.COM>
References: <20050210004248.GB11812@rembrandt.csv.ica.uni-stuttgart.de> <003601c50f0b$6e54df40$cb14a8c0@MIPS.COM>
Message-ID: <20050210013213.GD11812@rembrandt.csv.ica.uni-stuttgart.de>

Michael Uhler wrote:
> 
> > The burden of proof is always on the side who wants to change 
> > a working solution. :-)
> 
> Oh, please.  Tell that to any open source maintainer who is looking for the
> right solution, not just a working solution.

Well, so far the rdhwr emulation looks like it is both. A different
solution would IMHO need to show an improvement beyond theoretical
considerations.


Thiemo


From uhler at mips.com  Thu Feb 10 21:25:13 2005
From: uhler at mips.com (Michael Uhler)
Date: Thu, 10 Feb 2005 13:25:13 -0800
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <20050210001007.GE8011@nevyn.them.org>
Message-ID: <005b01c50fb7$0285c480$0c02a8c0@MIPS.COM>


> If this is OK with you, I'd appreciate it if you could get 
> back to us about the choice of register number.  I've got at 
> least a couple more days worth of work before the port will 
> be ready.  It won't cripple me to sit on it for a couple 
> weeks after that, but I'd love to have it submitted - I've 
> already started to get requests for the code.

I wouldn't exactly say that it's OK with us.  The impression that I get is
that it's too late to change, and even if it weren't we'd have to prove that
the trap-and-emulate approach had performance problems.  We have to trade
that off with our own desire to get NPTL supported, even if we have a
feeling that the implementation may cause problems in the future.

So, I have allocated RDHWR register 29 (decimal) for use as the pseudo-TLS
pointer.  What this means is that we have changed the architecture documents
to indicate that this register is used for an ABI-related activity such that
it will never be re-allocated for another purpose.  At this point, we do not
intend to implement this as a hardware register, nor will other MIPS
implementations be doing so.  We'll revisit this (as an architecture change)
once we measure the performance impact of the proposal and compare that with
other potential changes to the ABI.

Based on the email thread, I'm not sure if Mark's suggestion of a compromise
by marking the RDHWR with a relocation has benefit or not.  If it does, it
would be nice to have some hedge in the future.

/gmu

---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road      Voice:  (650)567-5025   FAX:   (650)567-5225
Mountain View, CA 94043   Mobile: (650)868-6870   Admin: (650)567-5085


From dan at codesourcery.com  Thu Feb 10 21:58:24 2005
From: dan at codesourcery.com (Daniel Jacobowitz)
Date: Thu, 10 Feb 2005 16:58:24 -0500
Subject: [mips-tls] A couple of potential changes to the MIPS TLS ABI
In-Reply-To: <005b01c50fb7$0285c480$0c02a8c0@MIPS.COM>
References: <20050210001007.GE8011@nevyn.them.org> <005b01c50fb7$0285c480$0c02a8c0@MIPS.COM>
Message-ID: <20050210215822.GC12253@nevyn.them.org>

On Thu, Feb 10, 2005 at 01:25:13PM -0800, Michael Uhler wrote:
> 
> > If this is OK with you, I'd appreciate it if you could get 
> > back to us about the choice of register number.  I've got at 
> > least a couple more days worth of work before the port will 
> > be ready.  It won't cripple me to sit on it for a couple 
> > weeks after that, but I'd love to have it submitted - I've 
> > already started to get requests for the code.
> 
> I wouldn't exactly say that it's OK with us.  The impression that I get is
> that it's too late to change, and even if it weren't we'd have to prove that
> the trap-and-emulate approach had performance problems.  We have to trade
> that off with our own desire to get NPTL supported, even if we have a
> feeling that the implementation may cause problems in the future.

Let me quote a message you didn't directly answer:

   My point is for NPTL to work.  Specifically:
   
     - Tying it to a new ABI is unacceptable in the short term, and
       possibly in the longer term, because of community pushback against
       additional ABIs.
   
     - Using a wired TLB entry gives kernel developers the shakes, because
       it restricts the available TLB slots, which can have complex
       effects on the performance of existing applications.
   
   That only leaves methods which enter the kernel.  Our choices are via a
   load from an unmapped page, via a syscall, or via an RI exception.
   Only one of those models is compatible with acceleration via future
   hardware.
   
   Using rdhwr does not provide cripplingly slow - or even perceptibly
   slow - performance, and I'm using a much more expensive emulation layer
   than Maciej's.  If the MIPS folks can clearly justify their concerns
   about this solution, and some relevant ideas to benchmark the problem,
   then maybe we can make forward progress.
   
   As was just pointed out to me, Maciej's numbers are on the same order
   of magnitude as a load miss.  Note that the rdhwr is never needed more
   than once per function with an optimizing compiler.  I think that at
   the cost of a load miss, you're getting a bargain.


I'm not asking for "proof" that there are performance problems.  I'm
asking for something that I can understand as more than a "feeling",
since I "feel" that there aren't.  I'm trying to be responsive to your
concerns, but I'm still having trouble getting a handle on why
(whether?) you think any of the other ideas presented are better than
emulating rdhwr.

Please, if you have a better proposal...

> So, I have allocated RDHWR register 29 (decimal) for use as the pseudo-TLS
> pointer.  What this means is that we have changed the architecture documents
> to indicate that this register is used for an ABI-related activity such that
> it will never be re-allocated for another purpose.  At this point, we do not
> intend to implement this as a hardware register, nor will other MIPS
> implementations be doing so.  We'll revisit this (as an architecture change)
> once we measure the performance impact of the proposal and compare that with
> other potential changes to the ABI.

Thank you.  We'll need to choose a preferred GPR for the fast-path
emulation also; Thiemo suggested $3, which sounds reasonable to me.  The
choice does not make a great deal of difference.

> Based on the email thread, I'm not sure if Mark's suggestion of a compromise
> by marking the RDHWR with a relocation has benefit or not.  If it does, it
> would be nice to have some hedge in the future.

I don't think it adds any additional value.  It would only be useful if
we wanted to replace the one instruction with any single other
instruction in legacy code; most things will be rebuildable and I do
not see the application startup time overhead as preferable to the
kernel emulation for legacy code.  Does anyone else see a scenario in
which this would be good to have?

-- 
Daniel Jacobowitz
CodeSourcery, LLC