ARM backend contribution from Apple.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@33353 91177308-0d34-0410-b5e6-96231b3b80d8
author: Evan Cheng <evan.cheng@apple.com> 2007-01-19 07:51:42 +0000
committer: Evan Cheng <evan.cheng@apple.com> 2007-01-19 07:51:42 +0000
commit: a8e2989ece6dc46df59b0768184028257f913843 (patch)
tree: c0e782730e267b35f0d15668d0689e6c322fe246 /lib/Target/ARM/README.txt
parent: bd92d81d22c90433e968077aa0a4157d631d6365 (diff)
1 files changed, 406 insertions, 37 deletions
diff --git a/lib/Target/ARM/README.txt b/lib/Target/ARM/README.txt
index 0f79322b63..000e8e6450 100644
--- a/lib/Target/ARM/README.txt
+++ b/lib/Target/ARM/README.txt
@@ -2,69 +2,438 @@
 // Random ideas for the ARM backend.
 //===---------------------------------------------------------------------===//
 
-Consider implementing a select with two conditional moves:
+Reimplement 'select' in terms of 'SEL'.
 
-cmp x, y
-moveq dst, a
-movne dst, b
+* We would really like to support UXTAB16, but we need to prove that the
+  add doesn't need to overflow between the two 16-bit chunks.
 
-----------------------------------------------------------
+* implement predication support
+* Implement pre/post increment support.  (e.g. PR935)
+* Coalesce stack slots!
+* Implement smarter constant generation for binops with large immediates.
 
+* Consider materializing FP constants like 0.0f and 1.0f using integer 
+  immediate instructions then copy to FPU.  Slower than load into FPU?
 
-%tmp1 = shl int %b, ubyte %c
-%tmp4 = add int %a, %tmp1
+//===---------------------------------------------------------------------===//
+
+The constant island pass is extremely naive.  If a constant pool entry is
+out of range, it *always* splits a block and inserts a copy of the cp 
+entry inline.  It should:
+
+1. Check to see if there is already a copy of this constant nearby.  If so, 
+   reuse it.
+2. Instead of always splitting blocks to insert the constant, insert it in 
+   nearby 'water'.
+3. Constant island references should be ref counted.  If a constant reference
+   is out-of-range, and the last reference to a constant is relocated, the
+   dead constant should be removed.
+
+This pass has all the framework needed to implement this, but it hasn't 
+been done.
+
+//===---------------------------------------------------------------------===//
+
+We need to start generating predicated instructions.  The .td files have a way
+to express this now (see the PPC conditional return instruction), but the 
+branch folding pass (or a new if-cvt pass) should start producing these, at
+least in the trivial case.
+
+Among the obvious wins, doing so can eliminate the need to custom expand 
+copysign (i.e. we won't need to custom expand it to get the conditional
+negate).
+
+//===---------------------------------------------------------------------===//
+
+Implement long long "X-3" with instructions that fold the immediate in.  These
+were disabled due to badness with the ARM carry flag on subtracts.
+
+//===---------------------------------------------------------------------===//
+
+We currently compile abs:
+int foo(int p) { return p < 0 ? -p : p; }
+
+into:
+
+_foo:
+        rsb r1, r0, #0
+        cmn r0, #1
+        movgt r1, r0
+        mov r0, r1
+        bx lr
+
+This is very, uh, literal.  This could be a 3 operation sequence:
+  t = (p sra 31); 
+  res = (p xor t)-t
+
+Which would be better.  This occurs in png decode.
+
+//===---------------------------------------------------------------------===//
+
+More load / store optimizations:
+1) Look past instructions without side-effects (not load, store, branch, etc.)
+   when forming the list of loads / stores to optimize.
+
+2) Smarter register allocation?
+We are probably missing some opportunities to use ldm / stm. Consider:
+
+ldr r5, [r0]
+ldr r4, [r0, #4]
+
+This cannot be merged into a ldm. Perhaps we will need to do the transformation
+before register allocation. Then teach the register allocator to allocate a
+chunk of consecutive registers.
+
+3) Better representation for block transfer? This is from Olden/power:
 
-compiles to
+	fldd d0, [r4]
+	fstd d0, [r4, #+32]
+	fldd d0, [r4, #+8]
+	fstd d0, [r4, #+40]
+	fldd d0, [r4, #+16]
+	fstd d0, [r4, #+48]
+	fldd d0, [r4, #+24]
+	fstd d0, [r4, #+56]
 
-add r0, r0, r1, lsl r2
+If we can spare the registers, it would be better to use fldm and fstm here.
+Need major register allocator enhancement though.
 
-but
+4) Can we recognize the relative position of constantpool entries? i.e. Treat
 
-%tmp1 = shl int %b, ubyte %c
-%tmp4 = add int %tmp1, %a
+	ldr r0, LCPI17_3
+	ldr r1, LCPI17_4
+	ldr r2, LCPI17_5
 
-compiles to
-mov r1, r1, lsl r2
-add r0, r1, r0
+   as
+	ldr r0, LCPI17
+	ldr r1, LCPI17+4
+	ldr r2, LCPI17+8
 
----------------------------------------------------------
-%tmp1 = shl int %b, ubyte 4
-%tmp2 = add int %a, %tmp1
+   Then the ldr's can be combined into a single ldm. See Olden/power.
 
-compiles to
+Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
+double 64-bit FP constant:
 
-mov r2, #4
-add r0, r0, r1, lsl r2
+	adr	r0, L6
+	ldmia	r0, {r0-r1}
 
-should be
+	.align 2
+L6:
+	.long	-858993459
+	.long	1074318540
 
-add r0, r0, r1, lsl #4
+5) Can we make use of ldrd and strd? Instead of generating ldm / stm, use
+ldrd/strd instead if there are only two destination registers that form an
+odd/even pair. However, we probably would pay a penalty if the address is not
+aligned on 8-byte boundary. This requires more information on load / store
+nodes (and MI's?) then we currently carry.
 
-----------------------------------------------------------
+//===---------------------------------------------------------------------===//
+
+* Consider this silly example:
+
+double bar(double x) {  
+  double r = foo(3.1);
+  return x+r;
+}
+
+_bar:
+	sub sp, sp, #16
+	str r4, [sp, #+12]
+	str r5, [sp, #+8]
+	str lr, [sp, #+4]
+	mov r4, r0
+	mov r5, r1
+	ldr r0, LCPI2_0
+	bl _foo
+	fmsr f0, r0
+	fcvtsd d0, f0
+	fmdrr d1, r4, r5
+	faddd d0, d0, d1
+	fmrrd r0, r1, d0
+	ldr lr, [sp, #+4]
+	ldr r5, [sp, #+8]
+	ldr r4, [sp, #+12]
+	add sp, sp, #16
+	bx lr
+
+Ignore the prologue and epilogue stuff for a second. Note 
+	mov r4, r0
+	mov r5, r1
+the copys to callee-save registers and the fact they are only being used by the
+fmdrr instruction. It would have been better had the fmdrr been scheduled
+before the call and place the result in a callee-save DPR register. The two
+mov ops would not have been necessary.
+
+//===---------------------------------------------------------------------===//
+
+Calling convention related stuff:
+
+* gcc's parameter passing implementation is terrible and we suffer as a result:
+
+e.g.
+struct s {
+  double d1;
+  int s1;
+};
+
+void foo(struct s S) {
+  printf("%g, %d\n", S.d1, S.s1);
+}
 
-add an offset to FLDS/FLDD/FSTD/FSTS addressing mode
+'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
+then reload them to r1, r2, and r3 before issuing the call (r0 contains the
+address of the format string):
 
-----------------------------------------------------------
+	stmfd	sp!, {r7, lr}
+	add	r7, sp, #0
+	sub	sp, sp, #12
+	stmia	sp, {r0, r1, r2}
+	ldmia	sp, {r1-r2}
+	ldr	r0, L5
+	ldr	r3, [sp, #8]
+L2:
+	add	r0, pc, r0
+	bl	L_printf$stub
 
-the function
+Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
 
-void %f() {
+* Return an aggregate type is even worse:
+
+e.g.
+struct s foo(void) {
+  struct s S = {1.1, 2};
+  return S;
+}
+
+	mov	ip, r0
+	ldr	r0, L5
+	sub	sp, sp, #12
+L2:
+	add	r0, pc, r0
+	@ lr needed for prologue
+	ldmia	r0, {r0, r1, r2}
+	stmia	sp, {r0, r1, r2}
+	stmia	ip, {r0, r1, r2}
+	mov	r0, ip
+	add	sp, sp, #12
+	bx	lr
+
+r0 (and later ip) is the hidden parameter from caller to store the value in. The
+first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
+r2 into the address passed in. However, there is one additional stmia that
+stores r0, r1, and r2 to some stack location. The store is dead.
+
+The llvm-gcc generated code looks like this:
+
+csretcc void %foo(%struct.s* %agg.result) {
 entry:
-	call void %g( int 1, int 2, int 3, int 4, int 5 )
+	%S = alloca %struct.s, align 4		; <%struct.s*> [#uses=1]
+	%memtmp = alloca %struct.s		; <%struct.s*> [#uses=1]
+	cast %struct.s* %S to sbyte*		; <sbyte*>:0 [#uses=2]
+	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
+	cast %struct.s* %agg.result to sbyte*		; <sbyte*>:1 [#uses=2]
+	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
+	cast %struct.s* %memtmp to sbyte*		; <sbyte*>:2 [#uses=1]
+	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
 	ret void
 }
 
-declare void %g(int, int, int, int, int)
+llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
+constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
+into a number of load and stores, or 2) custom lower memcpy (of small size) to
+be ldmia / stmia. I think option 2 is better but the current register
+allocator cannot allocate a chunk of registers at a time.
 
-Only needs 8 bytes of stack space. We currently allocate 16.
+A feasible temporary solution is to use specific physical registers at the
+lowering time for small (<= 4 words?) transfer size.
 
-----------------------------------------------------------
+* ARM CSRet calling convention requires the hidden argument to be returned by
+the callee.
 
-32 x 32 -> 64 multiplications currently uses two instructions. We
-should try to declare smull and umull as returning two values.
+//===---------------------------------------------------------------------===//
+
+We can definitely do a better job on BB placements to eliminate some branches.
+It's very common to see llvm generated assembly code that looks like this:
+
+LBB3:
+ ...
+LBB4:
+...
+  beq LBB3
+  b LBB2
 
-----------------------------------------------------------
+If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
+then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
+
+See McCat/18-imp/ComputeBoundingBoxes for an example.
+
+//===---------------------------------------------------------------------===//
 
-Implement addressing modes 2 (ldrb) and 3 (ldrsb)
+We need register scavenging.  Currently, the 'ip' register is reserved in case
+frame indexes are too big.  This means that we generate extra code for stuff 
+like this:
 
-----------------------------------------------------------
+void foo(unsigned x, unsigned y, unsigned z, unsigned *a, unsigned *b, unsigned *c) { 
+   short Rconst = (short) (16384.0f * 1.40200 + 0.5 );
+   *a = x * Rconst;
+   *b = y * Rconst;
+   *c = z * Rconst;
+}
+
+we compile it to:
+
+_foo:
+***     stmfd sp!, {r4, r7}
+***     add r7, sp, #4
+        mov r4, #186
+        orr r4, r4, #89, 24 @ 22784
+        mul r0, r0, r4
+        str r0, [r3]
+        mul r0, r1, r4
+        ldr r1, [sp, #+8]
+        str r0, [r1]
+        mul r0, r2, r4
+        ldr r1, [sp, #+12]
+        str r0, [r1]
+***     sub sp, r7, #4
+***     ldmfd sp!, {r4, r7}
+        bx lr
+
+GCC produces:
+
+_foo:
+        ldr     ip, L4
+        mul     r0, ip, r0
+        mul     r1, ip, r1
+        str     r0, [r3, #0]
+        ldr     r3, [sp, #0]
+        mul     r2, ip, r2
+        str     r1, [r3, #0]
+        ldr     r3, [sp, #4]
+        str     r2, [r3, #0]
+        bx      lr
+L4:
+        .long   22970
+
+This is apparently all because we couldn't use ip here.
+
+//===---------------------------------------------------------------------===//
+
+Pre-/post- indexed load / stores:
+
+1) We should not make the pre/post- indexed load/store transform if the base ptr
+is guaranteed to be live beyond the load/store. This can happen if the base
+ptr is live out of the block we are performing the optimization. e.g.
+
+mov r1, r2
+ldr r3, [r1], #4
+...
+
+vs.
+
+ldr r3, [r2]
+add r1, r2, #4
+...
+
+In most cases, this is just a wasted optimization. However, sometimes it can
+negatively impact the performance because two-address code is more restrictive
+when it comes to scheduling.
+
+Unfortunately, liveout information is currently unavailable during DAG combine
+time.
+
+2) Consider spliting a indexed load / store into a pair of add/sub + load/store
+   to solve #1 (in TwoAddressInstructionPass.cpp).
+
+3) Enhance LSR to generate more opportunities for indexed ops.
+
+4) Once we added support for multiple result patterns, write indexed loads
+   patterns instead of C++ instruction selection code.
+
+5) Use FLDM / FSTM to emulate indexed FP load / store.
+
+//===---------------------------------------------------------------------===//
+
+We should add i64 support to take advantage of the 64-bit load / stores.
+We can add a pseudo i64 register class containing pseudo registers that are
+register pairs. All other ops (e.g. add, sub) would be expanded as usual.
+
+We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers
+from the i64 register. These are single moves which can be eliminated if the
+destination register is a sub-register of the source. We should implement proper
+subreg support in the register allocator to coalesce these away.
+
+There are other minor issues such as multiple instructions for a spill / restore
+/ move.
+
+//===---------------------------------------------------------------------===//
+
+Implement support for some more tricky ways to materialize immediates.  For
+example, to get 0xffff8000, we can use:
+
+mov r9, #&3f8000
+sub r9, r9, #&400000
+
+//===---------------------------------------------------------------------===//
+
+We sometimes generate multiple add / sub instructions to update sp in prologue
+and epilogue if the inc / dec value is too large to fit in a single immediate
+operand. In some cases, perhaps it might be better to load the value from a
+constantpool instead.
+
+//===---------------------------------------------------------------------===//
+
+GCC generates significantly better code for this function.
+
+int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
+    int i = 0;
+
+    if (StackPtr != 0) {
+       while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
+          Line[i++] = Stack[--StackPtr];
+        if (LineLen > 32768)
+        {
+            while (StackPtr != 0 && i < LineLen)
+            {
+                i++;
+                --StackPtr;
+            }
+        }
+    }
+    return StackPtr;
+}
+
+//===---------------------------------------------------------------------===//
+
+This should compile to the mlas instruction:
+int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
+
+//===---------------------------------------------------------------------===//
+
+At some point, we should triage these to see if they still apply to us:
+
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
+
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
+
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
+
+http://www.inf.u-szeged.hu/gcc-arm/
+http://citeseer.ist.psu.edu/debus04linktime.html
+
+//===---------------------------------------------------------------------===//
author	Evan Cheng <evan.cheng@apple.com>	2007-01-19 07:51:42 +0000
committer	Evan Cheng <evan.cheng@apple.com>	2007-01-19 07:51:42 +0000
commit	a8e2989ece6dc46df59b0768184028257f913843 (patch)
tree	c0e782730e267b35f0d15668d0689e6c322fe246 /lib/Target/ARM/README.txt
parent	bd92d81d22c90433e968077aa0a4157d631d6365 (diff)