Looking through archive.org I found an old web page of mine  it was archived in February 1999, but if I recall correctly was written three or four years before that.
And, hey, look! I used to be an engineer!
HTML generated semiautomatically from an old email to the hotwired mailing list that was written very late at night. Don’t treat as gospel….
x_{i} is bit i of the number x, in standard twos complement form.
We’re adding two nbit numbers a & b, and a carryin c_{0} to give an nbit result z, and a carryout c_{n}
z = a plus b plus c_{0}
The operands and result are of the form
a = { a_{n1}, a_{n2}, … a_{1}, a_{0} }
and so on.
Logically addition is this:
For all 0 <= i < n
z_{i} = a_{i} ¤ b_{i} ¤ c_{i}
c_{i+1} = x_{i} · y_{i} + c_{i} · (x_{i} + y_{i}) [1]
This is often reformulated in terms of propagate (p) and generate (g) signals:
g_{i} = a_{i} · b_{i} [2]
p_{i} = a_{i} ¤ b_{i} [3]
(The alternative p_{i} = a_{i} + b_{i} is sometimes used)
(Often a kill signal (k) is used too: k_{i} = not(a_{i} + b_{i}) )
Then the carry signals can be given as:
c_{i+1} = g_{i} + c_{i} · p_{i} substitute [2] & [3] in [1] [4]
and
z_{i} = p_{i} ¤ c_{i}
The expensive bit about addition is generating c_{i}. Once all the c_{i} are generated the z_{i} can be generated in constant time (not a negligable time, but at least it’s constant with operand length).
A ripple carry adder (RCA) implements [1] directly. You all understand ripple carry adders, so I’ll just say that they are cheap and slow.
The critical path in an RCA is from c_{0}, a_{0} or b_{0} through all the full adders to z_{n1} or c_{n}.
For a four bit adder we can expand [4] to give
c_{1} = g_{0} + c_{0}·p_{0} [5]
c_{2} = g_{1} + g_{0}·p_{1} + c_{0}·p_{0}·p_{1} [6]
c_{3} = g_{2} + g_{1}·p_{2} + g_{0}·p_{1}_{p}_{2} + c_{0}·p_{0}·p_{1}·p_{2} [7]
c_{4} = g_{3} + g_{2}·p_{3} + g_{1}·p_{2}·p_{3} + g_{0}·p_{1}·p_{2}·p_{3} + c_{0}·p_{0}·p_{1}·p_{2}·p_{3} [8]
This is reasonable for 4 bits, but will get grotesquely large in terms of are and fanout if taken much further.
One option is to divide the operands into groups of four bits, use carrylookahead within each group and ripple the carries between groups.
If we can improve on speed of rippling carries within each group, then surely we can improve on the speed of rippling carries between groups by a similar approach.
Consider Group_{0}, consisting of (g_{0}  g_{3}, p_{0}  p_{3})
Defining a group generate' g* and a
group propagate’ p*:
For Group_{0}:
g*_{0} = g_{3} + g_{2}·p_{3} + g_{1}·p_{2}·p_{3} + g_{0}·p_{1}·p_{2}·p_{3}
p*_{0} = p_{0}·p_{1}·p_{2}·p_{3}
and identically for all other groups.
Now
c_{4} = g8_{0} + c_{0}·p*_{0}
c_{8} = g_{1} + g_{0}·p_{1} + c_{0}·p_{0}·p*_{1}
c_{12} = g_{2} + g_{1}·p_{2} + g_{0}·p_{1}·p_{2} + c_{0}·p_{0}·p_{1}·p*_{2}
c_{16} = [deleted  it’s huge….]
These are identical to equations [58] used to generate carries within groups.
There’s no need to stop there.
The fourbit groups can be combined into sixteenbit groups of groups.
Each can produce generate (g) and propagate (p) signals, combined as above to give c_{16}, c_{32}, c_{48} and c_{64}.
This divide and conquer approach will eventually generate c_{i} for the entire word width.
<This would be a lot clearer with diagrams>
For board level adders CLA is pretty good, ‘cos it can use standard MSI lookahead generator ICs for nearly everything.
It’s not bad in CMOS, but it’s not particularly good either. I’d use one of the following architectures in preference. The only exception might be in a cellbased design where an optimised lookahead cell is provided.
“In VLSI technology the carryskip adder is comparable in speed to the carry lookahead technique (for commonly used word lengths) while it requires less chip area and consumes less power.”
– Computer Arithmetic Algorithms, Israel Koren
…and that’s why it’s the adder I’d use for a 32 bit system, and probably for a 64 bit system too.
Carry propagation can skip any stage for which p_{i} = 1 (ie a_{i} != b_{i}). Several consecutive stages can be skipped if p_{i} =1 for each stage.
A carryskip adder is divided into groups of consecutive stages, with a simple ripple carry scheme in each group.
Each group generates a group propagate signal, p*_{i}.
For Group_{i}, consisting of k stages j, j+1, … j+k1
p_{i} = p_{j} · p_{j+1} · … · p_{j+k1}
This is used to allow an incoming carry into the group to skip over all the stages in the group and generate a group carry out.
Group_{i}Carry_{out} = c_{j+k} + p_{i} · Group_{i}Carry_{in}
(c_{j+k} is the normal ripple carry out from the most significat stage in the group)
The critical path through a carryskip adder is via ripple carry through, one of the groups, and via the skip carry chain through the remainder of the groups.
(Think about it. A carry coming out of a group via the ripple chain must have been generated within the group  if it was generated before the group, p*_{i} would have been 1 and the stage would have been skipped. So the critical path will travel through only one group).
In a carryskip adder the groups will not all be the same size. The optimal division of an nbit adder into carryskip groups depends on the characteristics of the target technology.
A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1} for a typical technology.
VLSI implementations of carryskip adders tend to be quite small  using the 32 bit adder given above the extra cost over an RCA is about 20 extra gates.
While a single level of carryskip speeds things up a lot a second level of skip, skipping over more than one group can speed things up a little further for very little extra cost.
The reason carry propagation is slow in a ripple adder is because each stage needs to have a_{i}, b_{i} and c_{i} available before it can calculate c_{i}+i. One way of removing this dependency on c_{i} is to calculate both a_{i} + b_{i} + 0, and a_{i} + b_{i} + 1, then choose the appropriate result when c_{i} becomes available.
This is the basic trick of the carry select adder.
For 32bit operands the 32 stage adder could consist of four 8 bit groups.
Group_{0} is purely an 8bit ripple carry adder, with c_{0} as it’s carry input. The sum output from this RCA goes directly to the adder output.
Group_{1} has two 8bit ripple cary adders, one with a C_{in} of 0, the other with a C_{in} of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor controlled by the C_{out} of Group_{0}. The output of the mux goes to the adder output.
Group_{2} has two 8bit RCAs, the same as group_{1}. The sum outputs of this go to two 2:1 muxes. One is controlled by the C_{out} of the Group_{1} chain with a C_{in} of 0, so the mux output is what the sum would be if the C_{out} of Group_{0} were 0. The other is controlled by the C_{out} of the Group_{1} chain with a C_{in} of 1, giving the sum if the C_{out} of Group_{1} were 1.
These two mux outputs are fed to another 2:1 mux, controlled by the C_{out} of Group_{0}, to select the correct sum to send to the adder output.
Group_{3} is similar. The sums from the two RCAs go through two 2:1 muxes controled by the two C_{out}s of the two RCAs in Group_{2}, giving two values, one if the C_{out} of Group_{1} is 1, one if it is 0.
These two values are passed to another pair of 2:1 muxes, controlled by the two C_{out}s of Group_{1}, giving two values, one if the C_{out} of Group_{0} is 1, the other if it is 0.
The correct one of these two values is selected by a final 2:1 controlled by the C_{out} of Group_{0}.
<This is far too complicated for ASCII art at this time of night. If there’s any interest I might HTMLify it>
Carry select adders tend to be slightly faster than skip adders, particularly for wide operands. They will typically consume a little over twice the area of a ripple carry adder. The layout for a select adder tends to be very regular, which can be a big advantage for datapath compilers/tilers.
One of the DEC Alphas uses a slight variant on this approach. There’s a paper in Vol 27, No 11, November 1992 of the IEEE Journal of SolidState Circuits giving a few paragraphs about the adder, and a nice diagram.
These are bizarre to think about, but very powerful, particularly for longer word lengths.
It’s a little mathematical in places….
Define an operator ‘o’:
(g,p) o (g’,p’) = (g + (p · g’), p · p’)
Define G_{i} & P_{i}:
(G_{1},P_{1}) = (g_{1},p_{1})
(G_{i},P_{i}) = (g_{i},p_{i}) o (G_{i1},P_{i1}) 2 <= i <= n
Then
c_{i} = G_{i} for 1 <= i <= n [1]
There’s a fairly easy, but not too interesting, inductive proof of [1].
Next, ‘o’ is associative, ie
(g_{1},p_{1}) o ( (g_{2},p_{2}) o (g_{3},p_{3}) )
= ( (g_{1},p_{1}) o (g_{2},p_{2}) ) o (g_{3},p_{3}) for all (g_{i},p_{i}) [2]
Again I’ll miss out the proof  it’s just an ‘expand both sides and notice they are identical’.
So to find c_{i} it suffices to calculate
(G_{i},P_{i}) = (g_{i},p_{1}) o (g_{i1},p_{i1}) o ··· o (g_{1},p_{1})
and by [2] this can be calculated in any order.
(The original reference is Brent & Kung, IEEE Transactions on Computers, Vol C31,No 3, March 1982)
First consider the simple problem of calculating just c_{n}, for n=16. Since ‘o’ is associative (G_{16},P_{16}) can be generated by a binary tree:
Each wire in the diagram carries a pair of signals (g,p). (g_{i},p_{i}) are fed in at the base.
Each node ‘O’ performs this operation:
(g_{in},p_{in}) o (g’_{in},p’_{in})
  O 


 </pre>(g_{in},p_{in}) (g’_{in},p’_{in})(G_{16},P_{16})  O ____  ______  ________ 
O O \ 
 __  __  __  __  __  __  \ 
O O O O _ _ _ _  _  _  _  _  \  \  \ 
O O O O O O O O \ \ \ \ \ \ \ \
 \  \  \  \  \  \  \ 
 \  \  \  \  \  \  \  \
                                15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0(g_{i},p_{i})
This will give (G_{16},p_{16}) and hence c_{16}. To generate c_{15} downto c_{1} another similar tree is required:
c_{i}
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1                  O  O  O  O  O  O  O    \  \  \  \  \  \  \                     O    O    O        _   _   _        _    _    _                         O                _                _                _                \                \                                                                         O                _                ____                 ________                         O        O        \        \         __        __         __        __         __        __                     O    O    O    O    _   _   _   _    _    _    _    _                   O  O  O  O  O  O  O  O  \  \  \  \  \  \  \  \                                 
                                15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0(g_{i},p_{i})
There are many varying architectures for prefix adders, driven by speed/area/layout complexity tradeoffs.