### ADC-BASED RECEIVERS FOR WIRELINE COMMUNICATION

by

Luke Wang

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto

© Copyright 2019 by Luke Wang

#### Abstract

ADC-Based Receivers for Wireline Communication

Luke Wang Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2019

Power efficient analog to digital converter (ADC) based receivers are desired for wireline communications as the industry transitions to 4-PAM at data-rates above 50Gb/s. A high power time-interleaved successive approximation register (SAR) ADC with 7 or 8 bit resolution is usually used to cover high loss (>=30dB at Nyquist frequency) channels with cross-talk. However the majority of links within a data-center between servers and switches are around medium loss (20dB) or lower. In order to take advantage of the direct correlation between loss and power consumption, a novel greedy search based power scaling scheme with BER metric is proposed so that the ADC can be adapted to work at the minimum power required by any given link. Unlike prior art, this strategy does not require finer threshold levels and works in conjunction with any equalizer. In addition, system level simulations are developed to aid receiver design and dynamic error sources such as jitter, skew and dynamic non-linearity are investigated. It is shown that typical ADC design targeting an effective number of bits across the entire Nyquist band may be over-designed as dynamic effects are reduced due to channel attenuation. A 64Gb/s 4-PAM ADC-based receiver prototype was tailored for link power scaling and fabricated in a 16nm FinFET technology. The receiver analog front-end consists of a single stage half-rate sampling continuous time linear equalizer, and 6-bit flash (1-bit folding) ADC taking advantage of sampled input distribution symmetry to enable non-uniform quantization. For a channel with -8.6dB loss at Nyquist, the ADC can be configured in 2-bit mode, achieving BER  $< 10^{-6}$  at a RX AFE power consumption of 100mW. For a -29.5dB high loss channel, the RX AFE consumes 283.9mW and achieves a BER  $< 10^{-4}$  in conjunction with a software digital equalizer. This corresponds to a power saving of 64.8%, the highest reported

at this data-rate. For a -13.5dB loss channel, greedy search power scaling is used to optimize the quantizer, achieving an order of magnitude improvement in BER compared to uniform quantization.

### Acknowledgements

First and foremost, I would like to express my deepest gratitude to my supervisor Professor Anthony Chan Carusone for providing me with guidance and support. His advice, both technical and non-technical, have been invaluable from my time as a Master's student until now. I would not have been able to complete the work described here without his help. I would also like to thank Professor Ng, Professor Sheikholeslami, Professor Liscidini, and Professor Adve for their participation and comments as part of my PhD exam committee. Thanks also to Professor Elad Alon for acting as the external examiner and providing valuable comments.

This project was completed in collaboration with Huawei Canada. I would especially like to thank Dustin Dunwell, David Cassan, MarcAndre LaCroix, and Davide Tonietto for logistic support and Muhammad Ali Khan, Mark Roberts, and Trevor Monson for top-level layout integration & CAD support. I would also like to thank Yingying Fu for helping me with transmitter set-up and testing for use in the prototype measurements. Thanks Yingying also for keeping my spirits up during initial bring-up in Ottawa. Also Rudy Beerkens and Walter Li for helping with testing needs when I was in Ottawa. Most importantly Chris Feist for his superb debugging skills that made chip bring-up possible despite Huawei land-mines.

Thanks to Hossein Shakiba for technical discussion during testing in Markham. The early Huawei Markham adopters Shayan Shahramian and Behzad Dehlaghi for keeping me company and the rest of the Huawei gang (Ravi, Josh, Alireza, *et al*) for making light of the peculiarities of the Markham office. It was definitely an unique experience, from doing layout in the dark, watching Al Jarezza in the sweltering lab, and getting attacked by killer white-boards.

I would like to thank everyone in BA5000 for all the good times and hope to work with you again in the future (see you soon Jeff). I would like to thank NSERC for providing funding through the PGS-D scholarship and the Province of Ontario for providing the OGS scholarship. Finally, I would like to thank my parents and friends for the support during my study.

# Contents

|   | Tabl       | Table of Contents |             |       |      |       |       |       |      |      |      | iv   |     |     |     |     |   |     |   |   |   |     |   |   |     |     |     |
|---|------------|-------------------|-------------|-------|------|-------|-------|-------|------|------|------|------|-----|-----|-----|-----|---|-----|---|---|---|-----|---|---|-----|-----|-----|
|   | List       | of Figur          | es          |       | •    |       |       |       |      |      |      |      |     |     |     |     |   |     |   |   | • |     |   |   |     |     | vii |
|   | List       | of Table          | es          | •••   |      |       |       |       |      |      |      | •    |     |     |     |     |   |     | • |   | • | ••• | • |   |     | · • | xii |
| 1 | Intr       | oductio           | n           |       |      |       |       |       |      |      |      |      |     |     |     |     |   |     |   |   |   |     |   |   |     |     | 1   |
|   | 1.1        | Motiva            | ation       |       | •    |       |       |       |      |      |      |      |     |     | •   |     |   |     | • |   | • |     |   |   |     |     | 1   |
|   | 1.2        | Outlin            | e           | •••   |      | • •   |       |       |      | •    |      | •    |     | •   | •   |     | • | • • | • | • | • | ••• | • | • | •   |     | 5   |
| 2 | Background |                   |             |       |      |       |       |       |      |      |      |      | 7   |     |     |     |   |     |   |   |   |     |   |   |     |     |     |
|   | 2.1        | Wirelin           | ne Modula   | atio  | n: 2 | 2-PA  | ٩M    | and   | 14-] | PA   | Μ    | •    |     | •   | •   |     | • |     | • |   |   |     |   |   |     |     | 7   |
|   | 2.2        | Equali            | zation      |       | •    |       |       |       |      |      |      |      |     |     |     |     |   |     | • |   | • |     |   |   |     |     | 12  |
|   |            | 2.2.1             | Feed-for    | rwai  | rd E | Equa  | alize | ers   | (FF  | E)   |      |      |     |     |     |     |   |     | • |   | • |     |   |   |     |     | 13  |
|   |            | 2.2.2             | Continuo    | ous   | Tin  | ne I  | Line  | ear l | Equ  | ali  | zer  | · (C | CTI | LE  | )   |     |   |     | • |   | • |     |   |   |     |     | 15  |
|   |            | 2.2.3             | Decision    | n Fe  | edt  | oack  | c Eq  | lual  | izeı | r (I | DFI  | E)   |     |     |     |     |   |     | • |   | • |     |   |   |     |     | 17  |
|   | 2.3        | Comm              | on Receiv   | ver A | Arc  | hite  | ectu  | res   | for  | 4-I  | PA   | M    |     |     | •   |     |   |     | • |   | • |     |   |   |     |     | 20  |
|   | 2.4        | Power             | Efficient A | AD    | C A  | Arch  | nitec | ctur  | e.   |      |      |      |     |     |     |     |   |     | • |   | • |     |   |   |     |     | 22  |
|   | 2.5        | Power             | Minimiza    | atior | n of | f AI  | DC-   | bas   | ed I | Rec  | ceiv | ver  | s.  |     | •   |     |   |     | • |   | • |     |   |   |     |     | 23  |
|   |            | 2.5.1             | Embedde     | led I | Equ  | ıaliz | zatic | on    |      |      |      |      |     |     | •   |     |   |     | • |   | • |     |   |   |     |     | 25  |
|   |            | 2.5.2             | Non-Uni     | ifor  | m (  | Qua   | ntiz  | atic  | on a | nd   | BI   | ER   | Μ   | etr | ic  |     | • |     | • |   | • |     |   |   |     |     | 25  |
|   | 2.6        | Summ              | ary         | •••   |      |       | •••   |       | •••  | •    |      | •    |     | •   | •   |     | • | • • | • | • | • | ••• | • | • | • • |     | 31  |
| 3 | Link       | <b>x</b> Power    | Scaling a   | and   | Sy   | ster  | m L   | .eve  | el D | esi  | gn   |      |     |     |     |     |   |     |   |   |   |     |   |   |     |     | 33  |
|   | 3.1        | Link P            | ower Scal   | ling  | ; Us | ing   | No    | n-U   | Jnif | orn  | n L  | .ev  | el  | Se  | lec | tio | n |     | • |   | • |     |   |   |     |     | 33  |
|   |            | 3.1.1             | Greedy S    | Sear  | rch  | Ap    | proa  | ach   |      |      |      | •    |     | •   | •   |     | • | • • | • |   | • |     |   | • | •   |     | 35  |
|   | 3.2        | Advan             | ced Link S  | Syst  | tem  | ı Mo  | ode   | l an  | d D  | esi  | ign  | •    |     | •   | •   |     | • | • • | • |   | • |     |   | • | •   |     | 39  |
|   |            | 3.2.1             | Modellin    | ng I  | Jyn  | iami  | ic E  | Effec | cts  |      |      | •    |     | •   | •   |     | • | • • | • |   | • |     |   | • | •   |     | 41  |
|   |            |                   | 3.2.1.1     | S     | kew  | v an  | ıd Ji | itter | •    |      |      | •    |     | •   | •   |     | • | • • | • |   | • |     |   |   |     |     | 41  |
|   |            |                   | 3.2.1.2     | Ν     | lon- | -Lin  | neari | ity   |      |      |      | •    |     | •   | •   |     | • | • • | • |   | • |     |   |   |     |     | 42  |
|   |            |                   | 3.2.1.3     | D     | eriv | vati  | ve F  | Estii | mat  | ion  | ۱.   |      |     |     |     |     |   |     | • |   |   |     |   |   |     |     | 43  |

|   |     |         | 3.2.1.4 A Note on ENOB                              | 44 |
|---|-----|---------|-----------------------------------------------------|----|
|   |     | 3.2.2   | Simulation Results                                  | 46 |
|   | 3.3 | Summ    | ary                                                 | 50 |
| 4 | ADO | C-Based | Receiver Circuit Design and Implementation          | 52 |
|   | 4.1 | Propos  | sed 64Gb/s ADC-Based Link and Receiver Architecture | 52 |
|   | 4.2 | Desigr  | 1 Technology Considerations                         | 54 |
|   | 4.3 | Analog  | g Front-end Design                                  | 55 |
|   |     | 4.3.1   | Half-Rate Sampling CTLE                             | 55 |
|   |     |         | 4.3.1.1 Design Considerations                       | 55 |
|   |     |         | 4.3.1.2 Layout Considerations                       | 57 |
|   |     | 4.3.2   | Sampling Consideration for Sub-ADCs                 | 58 |
|   |     | 4.3.3   | Simulation Results                                  | 61 |
|   | 4.4 | Sub-A   | DC Design                                           | 62 |
|   |     | 4.4.1   | Architecture and Timing                             | 62 |
|   |     | 4.4.2   | VGA and FFE Design                                  | 66 |
|   |     | 4.4.3   | Comparator Design                                   | 67 |
|   |     | 4.4.4   | Reference Generation                                | 68 |
|   |     | 4.4.5   | Wallace Tree Adder Encoder                          | 71 |
|   |     | 4.4.6   | Layout and Floorplan                                | 71 |
|   |     | 4.4.7   | Simulation Results                                  | 72 |
|   | 4.5 | Top Le  | evel Design                                         | 72 |
|   |     | 4.5.1   | Floorplan                                           | 72 |
|   |     | 4.5.2   | Simulation of 4-way TI ADC                          | 74 |
|   |     | 4.5.3   | Digital Back-End                                    | 76 |
| 5 | Mea | sureme  | ent Results and Discussion                          | 78 |
|   | 5.1 | ADC I   | Performance                                         | 79 |
|   |     | 5.1.1   | Static Performance                                  | 79 |
|   |     | 5.1.2   | Dynamic Performance                                 | 81 |
|   |     |         | 5.1.2.1 Sub-ADC Performance                         | 81 |
|   |     |         | 5.1.2.2 Time-Interleaved ADC Performance            | 82 |
|   | 5.2 | Receiv  | er Performance                                      | 84 |
|   |     | 5.2.1   | Greedy Search                                       | 89 |
|   | 5.3 | Compa   | arison                                              | 92 |

| 6 | Cone | clusion                                  | 95 |
|---|------|------------------------------------------|----|
|   | 6.1  | Summary                                  | 95 |
|   | 6.2  | Future Work                              | 96 |
| A | Link | Simulation Model for ADC-Based Receivers | 98 |

# **List of Figures**

| 1.1 | Global IP traffic forecast based on device type [1]                             | 1  |
|-----|---------------------------------------------------------------------------------|----|
| 1.2 | Servers inside a data-center at Google in Mayes County, Oklahoma, United        |    |
|     | States [2]                                                                      | 2  |
| 1.3 | Illustration of link types as defined by the OIF-CEI-56G standard [3]           | 3  |
| 1.4 | Estimates of total US data-center power consumption. The solid line represents  |    |
|     | historical data from 2000-2014, while the dashed lines are projections based on |    |
|     | improved management (IM), best practices (BP), hyperscale shift (HS). [4]       | 4  |
| 1.5 | A typical SerDes link including the transmitter (TX), receiver (RX), phase      |    |
|     | locked loop (PLL), clock & data recovery (CDR), and termination                 | 5  |
| 1.6 | Power efficiency trend for wire-line transceivers in literature: PAM-4 links    |    |
|     | indicated in circular markers                                                   | 6  |
| 2.1 | Modulation formats used in wireline communication in time and frequency         |    |
|     | domain                                                                          | 8  |
| 2.2 | Example wireline channel for 64Gb/s data transmission using 2-PAM and 4-        |    |
|     | PAM showing insertion loss and corresponding pulse responses                    | 9  |
| 2.3 | Cross-talk sources NEXT & FEXT impacting received data at RX2                   | 10 |
| 2.4 | Channel output for 2-PAM & 4-PAM 2Gb/s signal in time-domain and eye            |    |
|     | diagram format                                                                  | 11 |
| 2.5 | Channel output for 2-PAM & 4-PAM 8Gb/s signal in time-domain and eye            |    |
|     | diagram format                                                                  | 12 |
| 2.6 | Wireline transmission through channel causing high frequency attenuation and    |    |
|     | subsequent compensation through use of equalizer illustrated in frequency do-   |    |
|     | main                                                                            | 13 |
| 2.7 | Transmitter FFE pulse and frequency response showing the effect of equaliza-    |    |
|     | tion with DC attenuation.                                                       | 14 |
| 2.8 | Conceptual FFE implementation on TX (discrete time 1UI delay) and RX side       |    |
|     | (continuous time 1UI delay $T$ )                                                | 14 |

| 2.9  | Receiver FFE using discrete time delay                                        | 15 |
|------|-------------------------------------------------------------------------------|----|
| 2.10 | Passive continuous time linear equalizer (CTLE).                              | 16 |
| 2.11 | Active continuous time linear equalizer (CTLE)                                | 17 |
| 2.12 | Decision feedback equalizer implementation for single tap: direct feedback    |    |
|      | and loop-unrolled for both 2-PAM & 4-PAM                                      | 18 |
| 2.13 | Look-ahead technique applied to (a) 2-PAM structure with look-ahead factor    |    |
|      | LF = 2, and (b) 4-PAM structure with $LF = 3$                                 | 19 |
| 2.14 | Common receiver architectures for 4-PAM Mixed-signal approach for <20dB       |    |
|      | loss and ADC-based approach for >20dB loss.                                   | 21 |
| 2.15 | Time interleaved flash and SAR ADCs operating from 10GS/s to 64GS/s           | 23 |
| 2.16 | Various hybrid ADC architectures for improved power efficiency: (a) Com-      |    |
|      | parator based asynchronous binary search (CABS), (b) Multiple comparator      |    |
|      | based asynchronous SAR, (c) Folding flash ADC                                 | 24 |
| 2.17 | A 10GS/s 32-way time-interleaved SAR ADC with a 3-tap FFE: (a) Time-          |    |
|      | interleaved structure, (b) FFE scaling built into the CDAC, post-cursor and   |    |
|      | previous cursor taps set by 5 bit $A_1$ and $A_{-1}$                          | 26 |
| 2.18 | A time-interleaved pipeline ADC with embedded DFE: (a)DFE concept for         |    |
|      | multi-level signal, (b)DFE embedding at end of each pipeline stage for inter- |    |
|      | leaved structure                                                              | 27 |
| 2.19 | A loop unrolled DFE utilizing non-uniform quantizing Flash architecture       | 28 |
| 2.20 | Non-uniform selection ADC using high resolution DAC and modified LMS          |    |
|      | adapation AMBER                                                               | 30 |
| 2.21 | Non-uniform selection ADC using edge samples to set fine references for flash |    |
|      | comparators of center samples                                                 | 31 |
| 3.1  | Basic simulation model of ADC-based receiver for wireline communication       | 34 |
| 3.2  | Greedy search progression illustration: starting from a symmetric 5-bit ADC.  | 0. |
| 0.2  | threshold levels are removed until the BER limit is no longer satisfied.      | 36 |
| 3.3  | Threshold level selection for 4-PAM input for channel [0.12, 1, 0.49] and a   |    |
|      | SNR = 30dB, with a 1pre/2post tap FFE MMSE equalizer.                         | 37 |
| 3.4  | Threshold level selection for 4-PAM input for 10dB channel and a SNR =        |    |
|      | 30dB, with a 2 tap FFE MMSE equalizer                                         | 38 |
| 3.5  | Threshold level selection for 4-PAM input for 20dB channel and a SNR =        |    |
|      | 30dB, with a 7 tap FFE, 1 tap DFE MMSE equalizer.                             | 39 |
| 3.6  | Advanced link model with system impairments for architecture investigation.   | 40 |
| 3.7  | Transmitter level mismatch metric (RLM) for 4-PAM.                            | 40 |
|      |                                                                               |    |

| 3.8  | Illustration of skew error for a 8-way time-interleaved ADC                     | 42 |
|------|---------------------------------------------------------------------------------|----|
| 3.9  | Illustration of dynamic non-linearity effects on ADC SNDR/SFDR with model       |    |
|      | coefficients $k_4 - k_6$ according to Eqn. 3.3                                  | 44 |
| 3.10 | Generation of channel derivative for dynamic non-linearity modelling            | 45 |
| 3.11 | Simulation of minimal vertical eye opening for medium and high loss channels    |    |
|      | for different ADC resolutions and RX FFE length, annotations show settings      |    |
|      | for minimal EH of 5%                                                            | 47 |
| 3.12 | Example 6-bit ADC simulation with model parameters in Table 3.3.                | 48 |
| 3.13 | Link simulation with model parameters in Table 3.3.                             | 50 |
| 3.14 | Eye diagram for link simulation for 6-bit ADC with 6 post-cursor FFE taps (1    |    |
|      | pre-cursor tap) corresponding to only noise and all error sources cases         | 51 |
| 4.1  | Proposed link architecture using a TX with a 3-tap FFE and 6-bit ADC-based      |    |
|      | receiver with greedy search enabled link power-scaling                          | 53 |
| 4.2  | Receiver architecture including front-end half-rate sampling CTLE and 8-way     |    |
|      | time-interleaved ADC.                                                           | 53 |
| 4.3  | Example layout for a 4-fin 6 finger ( $W_{tot} = 924nm, L = 16nm$ ) device sur- |    |
|      | rounded by 2 finger dummies on either side                                      | 55 |
| 4.4  | Two-way time-interleaved CTLE sampler when CLK is high: left side is in         |    |
|      | track mode and right side is in hold mode                                       | 56 |
| 4.5  | Clock generation for CTLE cascode sampler: (a) CM shift problem due to          |    |
|      | clock timing mismatch, (b) clock phase alignment to prevent CM shift, and       |    |
|      | reduced swing buffer for generating clock for cascode NMOS, (c) unit delay      |    |
|      | cell for alignment path.                                                        | 57 |
| 4.6  | Layout of input transistor (a) showing gate, source and drain connections, and  |    |
|      | bias tail device (b) showing "sea of gates" style layout.                       | 59 |
| 4.7  | Front-end timing possibilities: a) Charge Redistribution b) Switch Multiplexing | 60 |
| 4.8  | Implemented sampling structure with front-end sampler passing samples to        |    |
|      | sub-ADCs                                                                        | 61 |
| 4.9  | Clocking network for front-end sampler and sub-ADC phases                       | 62 |
| 4.10 | Optimizing AFE SNDR using clock alignment controls                              | 63 |
| 4.11 | AFE input amplitude and SNDR versus frequency with skew added                   | 63 |
| 4.12 | Sub-ADC circuit architecture: gm-stage implementing VGA followed by 1-bit       |    |
|      | folding and 5-bit full flash with 31 comparators with clock gating and Wallace  |    |
|      | tree adder                                                                      | 64 |
| 4.13 | Sub-ADC timing diagram: folding switches are used to pipeline the conversion.   | 65 |

| FFE implementation from prior art (a) and in this work (b) using current mode  |                                                                                                                     |
|--------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| routing with common gate buffering.                                            | 66                                                                                                                  |
| VGA/FFE circuit implementation with relevant device sizes                      | 67                                                                                                                  |
| Comparator design with key device sizes.                                       | 69                                                                                                                  |
| Kickback compensation techniques for LSB comparators by: (a) augmenting        |                                                                                                                     |
| the dynamic pre-amplifier and (b) shifting the references to account for sys-  |                                                                                                                     |
| tematic offset.                                                                | 70                                                                                                                  |
| Implementation of resistive ladder for LSB comparator reference generation.    | 70                                                                                                                  |
| Implementation of 7:3 (a) and 15:4 (b) Wallace encoders                        | 71                                                                                                                  |
| Layout floorplan with sub-ADC components illustrated                           | 72                                                                                                                  |
| Power breakdown of 4GS/s sub-ADC with a total power consumption of 32.6mW      |                                                                                                                     |
| from 0.9/1.2V supplies                                                         | 73                                                                                                                  |
| Sub-ADC FFT low frequency and Nyquist frequency spectrum (256 points),         |                                                                                                                     |
| achieving an ENOB of 5 at Nyquist.                                             | 73                                                                                                                  |
| SNDR/SFDR across frequency for transient noise ON and OFF, showing ap-         |                                                                                                                     |
| proximately 1dB degradation in SNDR.                                           | 74                                                                                                                  |
| Top level ADC floorplan with CTLE sampler in the middle and 4 sub-ADCs         |                                                                                                                     |
| on either side.                                                                | 75                                                                                                                  |
| SNDR/SFDR RCC extracted simulation for 16GS/s 4-way TI ADC                     | 75                                                                                                                  |
| Spectrum (1024 ppt FFT) illustrating effect of gain mismatch on Nyquist fre-   |                                                                                                                     |
| quency input for 16GS/s 4-way TI ADC                                           | 76                                                                                                                  |
| Digital back-end implemented in software for enabling greedy search            | 77                                                                                                                  |
| Die photo and layout view with important blocks annotated of ADC-based         |                                                                                                                     |
| receiver, core area of $650\mu$ m $\times 250\mu$ m                            | 78                                                                                                                  |
| Offset calibration of sub-ADC using off-chip DAC to generate $V_{cal}$ .       | 79                                                                                                                  |
| Offset calibration example for a single sub-ADC showing CDAC convergence       |                                                                                                                     |
| for both positive and negative $V_{cal} =  V $ .                               | 80                                                                                                                  |
| Histogram of all LSB comparators across 8 sub-ADCs showing a standard de-      |                                                                                                                     |
| viation of 7.75 DAC codes                                                      | 81                                                                                                                  |
| DNL and INL after comparator threshold calibration for all 8 sub-ADCs show-    |                                                                                                                     |
| ing -0.97/+1.38 LSB and -1.6/+1.38 LSB respectively.                           | 82                                                                                                                  |
| SNDR/SFDR of all 8 sub-ADCs: worst and best SNDR at Nyquist frequency          |                                                                                                                     |
| is 27.71dB and 32.47dB respectively                                            | 83                                                                                                                  |
| Effect of resetting folder output to reduce ISI leading to a 6.6dB improvement |                                                                                                                     |
| in sub-ADC SFDR (2048 point FFT).                                              | 83                                                                                                                  |
|                                                                                | FFE implementation from prior art (a) and in this work (b) using current mode<br>routing with common gate buffering |

| 5.8  | SNDR/SFDR vs DCC code and input amplitude for Nyquist frequency input              |     |
|------|------------------------------------------------------------------------------------|-----|
|      | of 32GS/s 8-way TI ADC.                                                            | 84  |
| 5.9  | Spectrum (16k point FFT) and SNDR/SFDR for different FSR and $f_s$ of 8-way        |     |
|      | TI ADC                                                                             | 84  |
| 5.10 | CTLE gain relative to flat setting for all degeneration settings and RX noise as   |     |
|      | function of degeneration capacitance at maximum boost.                             | 85  |
| 5.11 | Eye diagrams illustrating FFE performance for 2-PAM input for a 21.8dB             |     |
|      | channel at 32.1875Gb/s                                                             | 86  |
| 5.12 | Test set-up and insertion loss for links used in 4-PAM measurements                | 87  |
| 5.13 | ADC output eye diagrams ( $\approx$ 52K samples) without DSP EQ, for (a) channel A |     |
|      | (IL= $-8.6$ dB) showing an open eye and (b) channel C (IL = $-21.7$ dB) showing    |     |
|      | a closed eye, with TX FFE and CTLE at optimal settings                             | 88  |
| 5.14 | Autocorrelation of ADC output indicating a high number of FFE taps is needed       |     |
|      | for a 30dB loss channel, and resulting BER as a function of FFE tap length         | 88  |
| 5.15 | Bathtub curves for channels A, C and D generated by sweeping TX phase-             |     |
|      | interpolator with all EQ coefficients frozen.                                      | 89  |
| 5.16 | Non-linearity correction filter implementation and resulting BER bathtub be-       |     |
|      | fore and after filter is turned on.                                                | 90  |
| 5.17 | Link power scaling with greedy search illustrating non-uniform quantization        |     |
|      | selection: optimal 5-bit non-uniform levels shown in blue and ADC output eye       |     |
|      | PDF shown in black at optimal sampling point                                       | 91  |
| 5.18 | Illustration of 5-bit non-uniform level selection using greedy search for differ-  |     |
|      | ent equalizer structures                                                           | 92  |
| 5.19 | RX-AFE power scaling with resolution for prototype 64Gb/s ADC-based re-            |     |
|      | ceiver                                                                             | 94  |
| A.1  | System model code structure illustration.                                          | 99  |
| A.2  | MATLAB function <i>subADC_macro_generic.m</i> incorporating dynamic effects.       | 100 |

# **List of Tables**

| 3.1 | Simulation of greedy search performance for representative channels with 4-                                                          |    |
|-----|--------------------------------------------------------------------------------------------------------------------------------------|----|
|     | $PAM (SNR = 30dB) \dots \dots$ | 39 |
| 3.2 | Summary of Time-Interleaved Mismatches for N <sub>TI</sub> sub-ADCs with Input Fre-                                                  |    |
|     | quency of $\omega$ , $k = 1, 2,, N_{TI}$ -1                                                                                          | 41 |
| 3.3 | ADC-based Link Simulation Parameters                                                                                                 | 49 |
| 3.4 | Simulation Results for 6-bit ADC with 6 Post-Cursor FFE Taps (1 Pre-Cursor                                                           |    |
|     | Tap) at BER < $10^{-6}$                                                                                                              | 50 |
| 5.1 | Comparison with Other ADC-Based 4-PAM Receivers >50Gb/s                                                                              | 93 |

# **List of Acronyms**

| ADC Analog-to-Digital Converter                  |
|--------------------------------------------------|
| AFE Analog Front-End                             |
| AI Artificial Intelligence                       |
| AMBER Adaptive Minimum Bit Error Rate            |
| AR Augmented Reality                             |
| AWGN Additive White Gaussian Random Noise        |
| <b>BER</b> Bit Error Ratio                       |
| CABS Comparator Based Asynchronous Binary Search |
| CAD Computer Aided Design                        |
| CAGR Compound Annual Growth Rate                 |
| CAL Calibration                                  |
| CDR Clock and Data Recovery                      |
| CICC Custom Integrated Circuits Conference       |
| CML Current Mode Logic                           |
| CTLE Continuous-Time Linear Equalizer            |
| DAC Digital-to-Analog Converter                  |
| DCC Duty-Cycle Corrector                         |

**DDJ** Data Dependent Jitter

| DFE | Decision-Feedback | Equalizer |
|-----|-------------------|-----------|
|-----|-------------------|-----------|

- **DFT** Discrete Fourier Transform
- **DNL** Differential Non-Linearity
- **DR** Data-Rate
- **DRC** Design Rule Check
- **DSP** Digital Signal Processing (or Processor)
- **DUT** Device Under Test
- **EH** Eye Height
- **EM** Electro-Migration
- **ENOB** Effective Number Of Bits
- **EQ** Equalizer
- **EW** Eye Width
- FEC Forward Error Correction
- FEXT Far-End Cross-Talk
- FFE Feed Forward Equalizer
- FFT Fast Fourier Transform
- FIR Finite Impulse Response
- FO4 Fanout of 4
- **FOM** Figure of Merit
- **FSR** Full Scale Range
- Gb/s Giga-Bits per Second
- GS/s Giga-Samples per Second
- HD3 Third Harmonic Distortion
- ICR Insertion Loss to Cross-Talk Ratio

| INL | Integral | Non- | Linea | rity |
|-----|----------|------|-------|------|
|-----|----------|------|-------|------|

- **IL** Insertion Loss
- **ISCAS** International Symposium on Circuits and Systems
- **ISI** Inter-Symbol Interference
- **ISSCC** International Solid-State Circuits Conference
- I/O Input/Output
- **IoT** Internet of Things
- **JSSC** Journal of Solid States Circuits

LM Lloyd-Max

LMS Least Mean Squares

LR Long Reach

LSB Least Significant Bit

LTI Linear Time-Invariant

LUT Look-Up Table

MLSD Maximum Likelihood Sequence Detection

MMSE Minimum Mean Squared Error

MOM Metal Oxide Metal

MOSFET Metal-Oxide-Semiconductor Field-Effect Transistor

MR Medium Reach

MSB Most Significant Bit

MSQE Mean Squared Quantization Error

**NEXT** Near-End Cross-Talk

NMOS N-Channel MOSFET

NRZ Non-Return to Zero

**OE** Optical Engine

- **OIF** Optical Internetworking Forum
- **OSR** Oversampling Ratio
- PAM Pulse Amplitude Modulation
- PCB Printed Circuit Board
- **PCBA** Printed Circuit Board Assembly
- PCIe Peripheral Component Interconnect Express
- **PDF** Probability Density Function
- **PI** Phase Interpolator
- PLL Phase Locked Loop
- **PMOS** P-Channel MOSFET
- **PRBS** Pseudo-Random Binary Sequence
- **RCC** Resistor Capacitor Coupled Capacitor (Parasitics Extraction)
- **RJ** Random Jitter
- **RLM** Level Mismatch Ratio
- RX Receiver
- S&H Sample and Hold
- SAR Successive Approximation Register
- SERDES Serializer/De-Serializer
- SF Source Follower
- SFDR Spurious-Free Dynamic Range
- **SHE** Self-Heating Effect
- SNDR Signal-to-Noise-and-Distortion Ratio
- SNR Signal-to-Noise Ratio

TCAS Transactions on Circuits and Systems

T&H Track and Hold

TI Time-Interleaved

TT Typical-Typical (Corner)

TJ Total Jitter

TX Transmitter

**UI** Unit Interval

**ULVT** Ultra Low Threshold (VT)

USR Ultra Short Reach

- **USB** Universal Serial Bus
- VGA Variable Gain Amplifier
- VLSI Very Large Scale Integration

VSR Very Short Reach

**XSR** Extra Short Reach

XT Cross-Talk

# Chapter 1

# Introduction

### 1.1 Motivation

Growth in global internet traffic has been accelerated by the proliferation of smartphones, highdefinition content streaming, the construction of the "cloud", and internet of things (IoT) devices that power emerging virtual reality (VR)/augmented reality (AR) and artificial intelligence (AI) applications. Cisco [1] estimates that the compound annual growth rate (CAGR) in IP traffic will be 24% from 2016-2021, with traffic being measured in exa-bytes (10<sup>18</sup> bytes) per month as shown in Figure 1.1.



Figure 1.1: Global IP traffic forecast based on device type [1].

This three-fold increase in traffic in 5 years will guarantee the emergence of faster wireline communications standards. Historically, these standards have doubled data-rates every 3-4 years; at the time of writing, 56Gb/s wireline links are in deployment while 112Gb/s links are in development. Recent standards at 56Gb/s like OIF-CEI-56G [3] released by the Optical Internetworking Forum (OIF) employ 4 level pulse-amplitude modulation (4-PAM) in addition to non-return-to-zero (NRZ) or 2-PAM modulation. System architecture with 4-PAM has also evolved to relax the link system requirement of bit error rate (BER) with the use of forward error correction (FEC) coding. OIF-CEI-56G accommodates different bit error rates at different distances, referred to as ultra-short-reach (USR), extra short reach (XSR), very-short-reach (VSR), medium-reach (MR), and long-reach (LR). Within a data-center, servers are arranged in racks as shown in Figure 1.2 where line cards within each chassis are connected at the front through modules and back through the backplane.



Figure 1.2: Servers inside a data-center at Google in Mayes County, Oklahoma, United States [2].

Figure 1.3 illustrates the XSR-LR links present in these servers as defined by OIF-CEI-56G standard: XSR links connecting a chip to a nearby optical engine (OE) have a distance <50mm with an insertion loss of 5-10dB at Nyquist frequency, VSR links connecting a pluggable optical module to a nearby chip have a distance <150mm with insertion loss of 10-15dB, MR links connecting nearby chips on the same printed circuit board assembly (PCBA) have a distance <500mm with a loss of 15-25dB, finally LR links connecting different chips through the back-

plane have a distance <1000mm and >25dB of loss. Currently electrical links still dominate connections <1m, with optical links potentially replacing them beyond 100Gb/s. One problem with data-centers is the power consumption and thus cooling required to sustain proper operation.



Figure 1.3: Illustration of link types as defined by the OIF-CEI-56G standard [3].

A comprehensive study [4] of data-center energy use in the United States by the Department of Energy and several US higher-education institutions found that data-centers nation-wide consumed approximately 70 billion kWh/year, equivalent to 1.8% of total power consumption in the US or 7 billion USD as shown in Figure 1.4. Better energy practices being put into place has already greatly reduced the energy consumption from the predicted 2010 trends indicated by the dashed gray line. Additonal hardware and software innovations may be neccessary to reduce the energy footprint even further. Solutions for data-center cooling have ranged from building a floating platform to house them at sea to even fully submerging them in water through the use of a container [5]. Thus one can see that the power efficiency of wireline links within data-centers, measured as mW/Gbps or pJ/bit is a key constraint in design. Note that the power efficiency defined for wireline transceivers does not agree with the typical definition of efficiency, for example for power amplifiers. Here the higher the number, the more power is consumed by the link, as in the link is *less* efficient.



Figure 1.4: Estimates of total US data-center power consumption. The solid line represents historical data from 2000-2014, while the dashed lines are projections based on improved management (IM), best practices (BP), hyperscale shift (HS). [4].

The typical transmitter/receiver pair, known as the transceiver, is a serializer/de-serializer (SerDes) structure shown in Figure 1.5 where parallel streams are serialized before transmission and then de-serialized at the receiver. The reason for this is the need for efficient bump/pin usage for data coming off the chip/package. In addition to the serializer/de-serializer, a SerDes consists of a transmitter (TX), a phase locked loop (PLL) for clock generation, a receiver (RX) and a clock & data recovery (CDR) circuit to recover the clock to sample the data at the RX. Both the TX & RX are terminated with impedances matched to the transmission-line interconnects in order to minimize reflections at the channel interfaces. The RX usually contains an equalizer and digital signal processing unit (DSP) to tune it in order to compensate for the channel loss. Other variations of this structure exist, for instance in which a CDR is not needed due to clock forwarding, but Figure 1.5 is a common representation.

Trends in high-speed SerDes design can be observed by plotting the power-efficiency (pJ/bit or mW/Gb/s) versus compensable channel loss (dB) for published transceiver links above 10Gb/s [6–59]. Figure 1.6a shows this as well as illustrating the process technology used in the design in color, while figure 1.6b also illustrates the actual data-rate in color. Transceivers using 4-PAM [19, 21, 41, 44, 48, 52, 57] as the modulation format are indicated in circular markers. Note that the total power consumption of a transceiver link includes the



Figure 1.5: A typical SerDes link including the transmitter (TX), receiver (RX), phase locked loop (PLL), clock & data recovery (CDR), and termination.

PLL, CDR and potential digital signal processor (DSP), for instance following an ADC-based receiver. However, this number may not be reported completely in literature, therefore this plot is not entirely accurate. Nonetheless 1.6 shows that there is approximately a 10x power increase to compensate for an additional 30dB of loss, thus there is a huge incentive for a link to be power-scalable if the loss range is variable. This could be case for instance if a single product is intended to cover a variety of link conditions. This is also the reason why standards specify different link requirements for different distances and therefore loss. Also note that as the data-rate increases, the design shifts from NRZ (2-PAM) modulation based to 4-PAM based. The reason for this will be examined in the next chapter. In addition, the 4-PAM designs at higher data-rate are enabled by technology scaling, with some transceivers being implemented in 16nm FinFET CMOS, the latest CMOS node in full production in 2017. The work outlined in this thesis is motivated by the need for an intelligent power-scalable receiver. Essentially this allows the link to slide along Figure 1.6, especially for the majority of server links, which are actually < 20dB in loss. Interestingly, at around 20dB of loss, Figure 1.6 shows an extreme spread of power efficiency, this will also be addressed when taking a look at link architecture in the next chapter.

### 1.2 Outline

This thesis is divided into 6 chapters. Chapter 2 will provide a background of wireline based communication and outline the relevant works on receivers in this area. This includes both existing ADC architectures used for receiver design and relevant power reduction strategies



Figure 1.6: Power efficiency trend for wire-line transceivers in literature: PAM-4 links indicated in circular markers.

used in ADCs for wireline communication. Chapter 3 will introduce the greedy search based approach to link power scaling and link modelling for investigating receiver imperfections. It will show that for 4-PAM, non-uniform quantization based power scaling can be beneficial for short reach channels and moreover dynamic errors such as dynamic non-linearity in the receiver can be tolerated due to channel attenuation in a wireline setting. This leads to significant power savings in the design. Chapter 4 will present the receiver design using the results of Chapter 3 with attention directed towards the analog front-end and time-interleaved ADC. Chapter 5 consists of measurement results and discussion. Finally, conclusions and future work directions will be presented in Chapter 6.

## Chapter 2

## Background

This chapter introduces the basics of wireline communication, addressing the need for 4-PAM and channel compensation. It also introduces the typical receiver architectures for a SerDes link. Both mixed-signal and ADC-based receivers are examined in detail with existing techniques used to improved power efficiency.

#### 2.1 Wireline Modulation: 2-PAM and 4-PAM

In contrast to wireless applications where a narrowband data signal is modulated onto a high frequency carrier, wireline transmission is broadband in nature, where the transmission media (usually copper) acts as a low-pass filter. The most common modulation format is the 2-PAM format, where data pulses are mapped to 2 distinct voltage levels, also referred to as the nonreturn-to-zero (NRZ) format, owning to the fact that the binary bits [0,1] are mapped to symbols [-1,1] that do not return to 0 amplitude. The other modulation format introduced in standards is the 4-PAM format, where the binary bits [0,1] are grouped by 2 and mapped to 4 distinct voltage levels as symbols [-3,-1,1,3] or [-1,-1/3,1/3,1]. These two transmission formats are shown both in time-domain and frequency-domain in Figure 2.1. Notice that the difference between 2-PAM and 4-PAM is that for the same bit rate (bits per second, bps or b/s) the symbol interval or unit interval (UI) is double for 4-PAM since 2 bits can be transmitted per symbol. In the frequency domain this means the sinc envelope falls to its first null at half the frequency for 4-PAM: i.e. at 5GHz for 4-PAM instead of 10GHz in the 2-PAM case for a 10Gb/s datarate. This has significant implications when the channel is considered. This advantage does not however come for free, in addition to more complex circuitry needed for TX & RX, due to the peak amplitude swing at the output of the TX limited by the supply voltage, the inherent signal-to-noise ratio (SNR) for 4-PAM is less than 2-PAM by  $20log_{10}(\frac{1}{3})$  or  $\approx 9.5$  dB.

Figure 2.2a shows the insertion loss (IL) of a measured wireline backplane channel pro-



Figure 2.1: Modulation formats used in wireline communication in time and frequency domain.

vided by Cisco for the IEEE802.3cd Ethernet task force [60]. The time-domain effect of the channel is shown in Figure 2.2b in the form of a pulse response in red, this is the impulse response (inverse Fourier transform of the frequency response) convolved with a single data pulse of 1 UI, resulting in the output after the channel. Both 2-PAM and 4-PAM cases are shown for an example data-rate of 64Gb/s corresponding to the Nyquist frequency annotations on the frequency response. The IL experienced by 2-PAM at its Nyquist frequency of 32GHz would be -63dB, while for 4-PAM it would be -33dB at a Nyquist frequency of 16GHz. Note that in time domain, the low pass nature of the channel spreads out the transmitted pulse over multiple UIs causing interference to other data symbols, a phenomenon known as inter-symbol interference (ISI). The point on the pulse response with the highest magnitude is termed the main-cursor, although in general this definition depends on the sampling position chosen by the clock and data recovery unit. The sampled data points on the left of the main-cursor spaced 1 UI apart are termed pre-cursors and on the right of the main-cursors are termed post-cursors, with proper numbering as shown. For the data to be recovered without error at the receiver, one wants to maximize the main cursor while eliminating or minimizing the pre and post-cursors. The higher Nyquist frequency of 2-PAM causes the pre and post-cursors to be more similar in magnitude to the main-cursor compared to the 4-PAM case, thus causing more ISI. This increase in ISI or equivalently 30dB difference in IL between 2-PAM & 4-PAM makes it much harder to compensate for the loss if 2-PAM is used, even though using 4-PAM incurs a SNR penalty of 9.5dB.

In addition to the channel loss (through), two types of interferers are shown in Figure 2.3: the near-end cross-talk (NEXT) and far-end cross-talk (FEXT). Figure 2.3 illustrates how these arise: NEXT is the coupling due to the adjacent transmitter (TX1) onto the receiver



(b) Channel pulse response

Figure 2.2: Example wireline channel for 64Gb/s data transmission using 2-PAM and 4-PAM showing insertion loss and corresponding pulse responses.

(RX2), while FEXT is the coupling due to the far-end transmitter (TX3) onto the receiver (RX2). Since the FEXT interferer goes through the channel, it is attenuated significantly at high frequency and does not pose as big of a problem as NEXT. Note that in general multiple NEXT and FEXT signals may be present at the receiver of interest depending on the number of simultaneously active data lanes. At 32GHz, the Nyquist frequency of the 2-PAM signal, the NEXT signal level is the same as the data signal level after channel attenuation, i.e. the insertion loss to cross-talk ratio (ICR) is approximation 0dB. Thus unless special cross-talk cancellation circuitry is introduced, the broad-band NEXT appearing as "random noise" will most likely cause the link to fail even if the through insertion loss can be compensated. For both of these reasons, at higher data-rates, 4-PAM has become a necessity despite the increased complexity in circuit design.

In order to see the impact of the channel for a series of transmitted data pulses, an eye diagram is typically constructed by cutting the data at the output of the channel into segments 1UI in length and placing them on top of each other. This way the impact of all data patterns can be seen in the same time frame where sampling in both time and voltage (decision device) at the receiver must take place. Using the same example channel as above, the output pulses and corresponding eye diagrams are plotted in Figure 2.4 for a data-rate of 2Gb/s for both



Figure 2.3: Cross-talk sources NEXT & FEXT impacting received data at RX2.

2-PAM & 4-PAM. In both cases, the eye diagram is open, i.e. there is margin to recover the data correctly if for instance the data is sampled at reference time zero and the decision device is set to certain voltage levels, e.g. zero for 2-PAM assuming no DC offset. The eye diagram is typically constructed with a certain number of bits,  $N_{bit}$ , and when the eye is open, corresponds to error free operation over  $N_{bit}$ . To quantify the margins, the eye height (EH) and eye width (EW) are typically used and are specified at the system BER requirement as shown in Figure 2.4a. Note that the eye diagram has vertical symmetry since the binary bits are randomly distributed with equal probability and the channel is perfectly linear (LTI impulse response model). However, for this particular channel, the eye diagram is asymmetrical in the horizontal direction. In fact the zero time reference does not give the maximum eye height (EH), and indeed, depending on the compensation scheme used to improve the eye, referred to as equalization, the clock and data recovery (CDR) may choose a different sampling point in time. This may shift the sampling time to the right in this case to maximize EH. In any case, a symmetrical eye is desirable since interferences, such as jitter in time, or voltage offset in the decision device may shift the sampling point and are usually zero mean random processes with some variance. Thus one would want the same right/left margin and top/bottom margin. In the case of 4-PAM, one would also want the 3 eye openings to be roughly the same in order for symbols to have the same tolerance to error.

When the data-rate is increased, the additional insertion loss of the channel causes the ISI



Figure 2.4: Channel output for 2-PAM & 4-PAM 2Gb/s signal in time-domain and eye diagram format.



Figure 2.5: Channel output for 2-PAM & 4-PAM 8Gb/s signal in time-domain and eye diagram format.

to worsen and the eye to close. This is shown in Figure 2.5 for a data-rate of 8Gb/s. In both the 2Gb/s and 8Gb/s case, we see that using 4-PAM does not actually increase the eye margins, in fact, it degrades them. This is due to the SNR penalty of 9.5dB mentioned previously. At 2Gb/s, the insertion loss difference between 2-PAM and 4-PAM is only 2dB for this channel, while at 8Gb/s it is 4.7dB. Thus, again it is only at high data-rates where 4-PAM offers a significant advantage to 2-PAM. In order to open the eye in the 8Gb/s case, equalization must be employed. The next section briefly examines common equalization approaches.

## 2.2 Equalization

Typically wireline systems do not employ optimal detection in the form of a matched filter receiver with noise-whitening filter and Viterbi algorithm using maximum likelihood sequence

detection (MLSD). MLSD has been employed in wireline receivers for long distance (>100m) optical needs [61], however for backplane applications it is still too power inefficient. Thus here we limit our discussion to the sub-optimal solution of equalization. Essentially the idea of equalization is to invert the channel response by using a high-pass like filter, illustrated conceptually in Figure 2.6. Note that in general the high pass filter must have a high-frequency roll-off to prevent noise enhancement above the Nyquist frequency ( $f_{Nyq}$ ). This high frequency boost can be done both at the transmitter side and the receiver side.



Figure 2.6: Wireline transmission through channel causing high frequency attenuation and subsequent compensation through use of equalizer illustrated in frequency domain.

#### **2.2.1** Feed-forward Equalizers (FFE)

At the transmitter side, the most popular implementation [62] is a feed-forward equalizer (FFE) which is just a discrete time finite-impulse-response (FIR) filter W(z). Since the digital data stream is known and timing is defined completely by the transmitter clock, the unit interval delays and scaling are easy to generate with no noise enhancement. However because of the peak amplitude constraint, i.e. the peak swing at the output of the TX must be within the supply voltage rails of the chip, any equalization done attenuates the main cursor or low frequency content. For example, the TX filter  $W(z) = -0.13 + 0.6z^{-1} - 0.27z^{-2}$  is illustrated in time-domain and in frequency domain in Figure 2.7 where its equalizing effect is also shown. This filter has an attenuation of 0.2 or 14dB at DC and has 1 pre-cursor and 1 post-cursor tap. In addition the FFE taps are usually fixed, thus sub-optimal, as adapting them would require channel response information to be communicated from the receiver back to the transmitter through another path called the back-channel.

The TX FFE implementation is shown in Figure 2.8a where in addition to scaling the main cursor, it implements 1 pre and 1 post-tap with weights  $[w_{-1}, w_0, w_1]$ . This type of FIR filter



Figure 2.7: Transmitter FFE pulse and frequency response showing the effect of equalization with DC attenuation.



(a) TX FFE Implementation

(b) RX FFE Implementation

Figure 2.8: Conceptual FFE implementation on TX (discrete time 1UI delay) and RX side (continuous time 1UI delay T).

can also be implemented on the RX side as shown in Figure 2.8b with weights  $[c_{-1}, c_0, c_1]$ , avoiding the need for a back-channel and peak amplitude constraint since the data is attenuated significantly at the channel output. However, the 1UI delay (*T*) is a broadband analog delay which is difficult to generate compared to the digital delay needed at the TX [63, 64]. Note that in both cases, additional taps may be implemented by simply adding more delays and summing.

Another possible implementation of an RX FIR equalizer is to use track and holds (T&H) which are uniformly sampling in time to generate the delays. This is usually done in a time-interleaved fashion to also relax the timing of the slicer, leading to a 1/N-rate receiver architecture for N T&Hs [65]. This is shown in Figure 2.9 for N = 3 and lends itself well to a receiver architecture where sampling at the front-end is required anyway, i.e. an analog-to-digital

(ADC) based receiver. For example, for a 30Gb/s 2-PAM data signal, each slice would be sampling at 10GS/s, with the sampling edge of the 10GHz clocks (CLK) spaced 1UI (1/30GHz) apart corresponding to CLK0, CLK120, and CLK240 as shown. One disadvantage is that the outputs of the T&Hs must be routed appropriately to the adjacent T&Hs, which may introduce additional parasitic loading through long metal interconnects. Before exploring receiver architectures, there are two other common equalizers: the continuous time linear equalizer (CTLE), and decision feedback equalizer (DFE).



Figure 2.9: Receiver FFE using discrete time delay

#### 2.2.2 Continuous Time Linear Equalizer (CTLE)

An analog continuous time filter can be used at the receiver side to act as the channel inversion filter instead of the FFE. These filters, known as continuous time linear equalizers (CTLE), can be either passive [66] or active [67]. One possible single-ended implementation of the passive filter and its frequency response is shown in Figure 2.10, where minimizing  $C_2$  enables the gain to approach unity at high frequency. The lower limit on  $C_2$  represents parasitic capacitance at the output. The corresponding transfer function is shown in Equation 2.1. The DC gain,  $\frac{R_2}{R_1+R_2} < 1$ , is less than the AC gain,  $\frac{C_1}{C_1+C_2} < 1$ , with the ratio of AC gain over DC gain or pole frequency  $\omega_p$  over zero frequency  $\omega_z$  known as the high frequency boost or peaking amount. Thus the passive filter also attenuates the low frequency content of the received signal to provide boost. The amount and frequency of peaking is usually tunable by adjusting the zero



(a) Single-ended implementation

Figure 2.10: Passive continuous time linear equalizer (CTLE).

location and/or pole location. For large high frequency boost, large DC attenuation is required causing eye attenuation similar to the TX FFE, thus an active CTLE may be preferred.

$$H_{passive}(s) = \frac{R_2}{R_1 + R_2} \frac{1 + R_1 C_1 s}{1 + \frac{R_1 R_2}{R_1 + R_2} (C_1 + C_2) s}$$
(2.1)

One popular fully-differential implementation of the active CTLE and its frequency response is shown in Figure 2.11 [67]. The corresponding transfer function is shown in Equation 2.2. In this case the bandwidth is limited by the 2nd pole at  $\omega_{p2} = \frac{1}{R_D C_p}$ , where  $R_D$  is the load resistance and  $C_p$  is the total capacitive load at the output. In general this is fixed and set close to the Nyquist frequency of data transmission. In case the parasitic capacitance on the output limits the bandwidth, it can be improved by various methods such as inductive peaking in the load. The peaking response can be adjusted by tuning the degeneration elements: by increasing  $R_s$  the zero  $\omega_z$  moves to a lower frequency and the peaking is increased at the expense of lower DC gain (minimal impact on 1st pole  $\omega_{p1}$ ), by increasing  $C_s$  both  $\omega_z$  and  $\omega_{p1}$  move to a lower frequency to adjust to the channel profile without impacting the peaking amount. One disadvantage of CTLE similar to receive FFE is that it can not differentiate between noise/cross-talk and signal and thus noise amplification occurs. In order to alleviate this problem a decision feedback equalizer (DFE) may be used.

$$H_{active}(s) = \frac{g_m R_D}{1 + g_m R_s/2} \frac{1 + R_s C_s s}{(1 + \frac{R_s C_s}{1 + g_m R_s/2} s)(1 + R_D C_p s)}$$
(2.2)



(a) Differential implementation

Figure 2.11: Active continuous time linear equalizer (CTLE).

#### **2.2.3** Decision Feedback Equalizer (DFE)

Decision feedback equalizer (DFE) is another common equalizer structure implemented at the receiver side. The DFE is a non-linear equalizer unlike the linear filters described above. A conceptual implementation using 1 tap is shown for 2-PAM signal on the left in Figure 2.12a where the decision device or slicer determines the previous bit by comparing it to the threshold DZ (e.g. 0). The decision is then scaled by  $\alpha$ , fed-back and summed with the current bit. If the slicer decided correctly then by adjusting the scaling factor the ISI introduced by the previous bit can be completely removed. This can then be repeated with multiple taps, however due to the feedback structure only post-cursor ISI can be cancelled. In practice for high data-rates, the first tap may not be implementable in this manner since the total delay for the feedback is only 1 UI: this includes the decision time for the slicer, the scaling and summation propagation delay. For instance at 50Gb/s for 2-PAM, 1 UI is only 20ps and for most advanced CMOS processes, the fanout of 4 (FO4) rise/fall-time at typical corner is on the order of 10ps. The most common way to combat this is to perform loop unrolling as shown on the right in Figure 2.12a. Loop unrolling, also known as speculation, computes all possible outputs by comparing to thresholds DZ -  $\alpha$  for the previous bit being -1, and DZ +  $\alpha$  for the previous bit being 1 and selects the correct decision later. Thus the critical path delay is reduced to a multiplexer

delay, the cost being the number of slicers is now doubled. This problem is exacerbated for the 4-PAM case as shown in Figure 2.12b. Here the loop unrolled case needs 4 times the number of slicers compared to the one without loop unrolling. In addition, the loop unrolled critical path now contains a 4:1 multiplexer as opposed to a 2:1 multiplexer in the 2-PAM case.



(b) 4-PAM

Figure 2.12: Decision feedback equalizer implementation for single tap: direct feedback and loop-unrolled for both 2-PAM & 4-PAM.

At even higher data-rates, DFEs with look-ahead is often used, as first described in [68] and implemented for instance in [69] using a quarter-rate architecture to function up to 85Gb/s in simulation across PVT. A simple look-ahead example for 2-PAM is shown in Figure 2.13a,
where the dependence on the output 1UI prior in the feedback loop is now transformed into a dependence on the output 2UI prior. The concept can of course be applied to any look-ahead factor (*LF*) and in general a 2-PAM *D*-tap DFE will need  $2^D(LF - 1) + \sum_{i=1}^D 2^{D-i}$  2:1 muxes while a 4-PAM *D*-tap DFE will need  $4^D(LF - 1) + \sum_{i=1}^D 4^{D-i} \times 2 \times 3$  2:1 muxes. An example for 4-PAM is shown in Figure 2.13b for D = 1 and LF = 3, where the feed-forward section may also be pipelined in an actual implementation. In general the number of slicers required by the DFE grows as a power of 4 as more taps are added for loop-unrolling, while the number of muxes grows by a factor  $4^D$  as the look-ahead factor is increased for a 4-PAM system. This presents a key constraint for using multi-tap DFEs for 4-PAM at high data-rates.



Figure 2.13: Look-ahead technique applied to (a) 2-PAM structure with look-ahead factor LF = 2, and (b) 4-PAM structure with LF = 3.

The key advantage of the DFE is that it does not amplify noise/crosstalk due to the slicing operation, i.e. a "digital" value is being fed-back. However this means that the slicer must decide correctly most of time for the feedback to be effective. Large noise and residual ISI can lead to error-propagation [70] where an erroneous decision leads to the wrong feedback value

and thus the next error and so on. Error-propagation for 2-PAM is generally not a big problem since the BER target is often lower at  $< 10^{-12}$  compared to  $< 10^{-6}$  for 4-PAM before FEC. A proper FEC engine that can handle DFE burst errors due to error propagation is necessary for 4-PAM in the current standards. A 4-PAM receiver without FEC is also a possibility, however this has not been adopted in any standard currently.

### 2.3 Common Receiver Architectures for 4-PAM

Given the background on equalizers, it is possible to arrive at the receiver architecture shown in Figure 2.14a [19, 52, 71]. This mixed-signal receiver architecture uses a CTLE usually implemented in multiple stages and a DFE. One potential advantage of a receive FFE is its channel-shaping capability which is better than the single-stage CTLE. However with multiple stages, the CTLE is capable of implementing both mid-band shaping and long-tail cancellation [44] without the analog delay generation required of the FFE. Additional shaping is also provided on the TX side with a few taps to limit the penalty imposed by peak amplitude constraint. The TX FFE and CTLE are also necessary to cancel any pre-cursors that the DFE can not handle. In terms of the DFE, due to complexity, the number of taps is limited 1 in [52] and 3 in [19]. There has been an effort in [71] to increase the number of taps to 10 at 56Gp/s in 16nm FinFET CMOS process without loop-unrolling, however the power consumption is significant at 12.88mW/tap, with taps 2-10 improving the eye opening at a BER of  $10^{-6}$  by 25%. This barely allows the link to function at 0.2UI margin for a VSR channel with 10dB loss.

Given the difficulties in implementing mixed-signal receivers for 4-PAM, ADC-based receivers have become increasingly popular [21, 41, 48, 57]. Currently they are the only viable solution for 4-PAM links at data-rates of 56Gb/s and above, and with channel losses at 20dB and above. The primary reason is that digital circuit solutions have become more attractive for equalization functions compared to their analog counterparts due to the ease of digital abstraction, reduced circuit area, and power, afforded by technology scaling. Therefore in the case of an ADC-based receiver, the majority of the equalization function is implemented digitally in the digital signal processor (DSP). The DSP also takes care of any adaptation loops and incorporates a digital CDR for baud-rate timing recovery. Such an architecture is shown in Figure 2.14b. The front-end multi-stage CTLE does not need to provide as much boost as the mixed-signal case due to the DSP equalizer and can be implemented in less stages with lower power. Generally the CTLE has a hard time fitting the channel response for a high loss channel, leading to higher residual ISI. For 4-PAM, the effect of residual ISI is 3 times worse since the adjacent symbol could be 3 times the magnitude of the current symbol, e.g. +/-3 to



(a) Mixed-signal Architecture



(b) ADC-based Architecture

Figure 2.14: Common receiver architectures for 4-PAM Mixed-signal approach for <20dB loss and ADC-based approach for >20dB loss.

+/-1. The highly flexible DSP is capable of addressing this with many taps of FFE and a few taps of DFE: 14 tap FFE and 1 tap (loop-unrolled) DFE in [57] and 24 tap FFE and 1 tap DFE in [48].

One disadvantage of the ADC-based receiver is its power efficiency when compared with traditional analog/mixed signal receivers. In terms of analog complexity, an extra bit of resolution requires approximately 2 times more power (empirical as expressed in Walden's Figure of Merit) and typically ADCs used in wireline communication require a resolution of 6 bits or above. The resolution is dependent on the channel loss characteristics, with high loss channels requiring higher resolution such that the quantization does not affect the equalization. In the author's Master thesis [72], time-interleaving, and the necessary error correction as a result, was explored as a means of improving power efficiency. However this alone does not solve the power efficiency problem as link requirements become more stringent. For instance, the 2-PAM receiver works in [73] & [74] address this problem by using two paths. In both works, the ADC path consumes approximately 3 times the power of the mixed signal path and is used for high loss/crosstalk channels (>25dB at Nyquist frequency). An unified approach would be to enable power-scaling in the ADC-based receiver such that its power consumption is similar to a mixed-signal receiver for low loss channels, thereby removing the need for another

path. Another issue is the DSP power consumption, which can be in fact comparable to the analog power; for instance 40.4% of the total power is consumed by the DSP in [57]. However as shown in [57], scaling digital power corresponding to link loss is much easier than analog power since equalizer taps can be easily disabled. For analog power saving, due to the ADC structure being a successive approximation register (SAR) architecture, the saving is only 9.2%, compared to the digital power saving of 70.5% when scaling from a channel loss of 32dB to 7.4dB. In the next sections, alternative ADC architectures are discussed and recent power scaling link ideas in literature are presented.

### 2.4 Power Efficient ADC Architecture

Regarding ADC architecture, CMOS scaling has benefited ADC power efficiency the most where the architecture contains significant digital hardware. This includes both the successive approximation register (SAR) and flash ADC. For high speed (>10Gb/s) links, while SiGe and InP processes allow converters to operate without time interleaving, time interleaving becomes a necessity to preserve power efficiency. For instance the 3-bit single channel ADC in [75] fabricated in a  $0.12\mu$ m SiGe process consumes 3.8W while operating at 40GS/s. Figure 2.15 shows time-interleaved CMOS flash and SAR ADCs operating from 10GS/s to 64GS/s [48,73, 74,76–92] using Walden's figure of merit (FOM). Notice that both architectures achieve similar energy efficiency at a resolution of 6 bits and below, with the SAR architecture at lower FOM. For instance the 20GS/s 6 bit Flash [88] achieves 124fJ/conv-step at 10GHz input frequency, while the 36GS/s 6 bit SAR [86] achieves 105fJ/conv-step at 14.1GHz, both fabricated in 32nm SOI CMOS. For 4-PAM needs, the resolution required may be larger than 6 bits for 30dB loss channels and SAR architecture offers more opportunities as the power scales exponentially with resolution, instead of linearly as the case for the SAR.

In addition, various SAR/flash hybrids have been used to leverage both the power efficiency of the SAR (especially at high resolution >6bits) and the speed of the flash architecture. These include the comparator based asynchronous binary search (CABS) architecture [93] shown in Figure 2.16a. In this approach the binary search is implemented using a comparator tree, where the next stage is asynchronously triggered by the output of the previous stage, allowing it to be faster than the traditional SAR architecture without the need of a DAC and digital controller. It has the exponential hardware complexity of a flash ADC but the power scales linearly with the number of bits as only 1 comparator in each step is activated. Multiple comparator based asynchronous SAR [94] shown in Figure 2.16b can be used to avoid the exponential complexity (in terms of hardware not power) of CABS, while still performing faster than the traditional



Figure 2.15: Time interleaved flash and SAR ADCs operating from 10GS/s to 64GS/s.

SAR due to the overlap of DAC settling time and comparator digital ready generation. The digital logic is also simplified significantly (no need for a state machine since the comparator ready signal is used as a state). Finally the folding flash architecture [95] in Figure 2.16c uses a single comparator to resolve the most significant bit (MSB) and rectify the differential signal. This allows a saving of half of the number of comparators compared to the traditional flash, reducing power consumption while sacrificing conversion speed.

## 2.5 **Power Minimization of ADC-based Receivers**

In the context of ADCs for wireline receivers, the ADCs in [79, 96] enable power scaling by reducing the resolution and/or speed for various link conditions. In [96] the ADC was designed to be of a flash architecture working in 3 possible configurations: single channel 5 bit at 2.5GS/s, 2-way time-interleaved 4 bit at 5GS/s, and 4-way time-interleaved 3 bit at 10GS/s. By reducing the resolution at higher speeds, the track and hold design becomes easier for handling all configurations.

Two other methods of reducing the ADC power are: through the use of embedded equalization within the ADC architecture itself [85, 97, 98] and utilizing non-uniform quantization [99–102], in some cases specifically in conjunction with a DFE [103–105].



Figure 2.16: Various hybrid ADC architectures for improved power efficiency: (a) Comparator based asynchronous binary search (CABS), (b) Multiple comparator based asynchronous SAR, (c) Folding flash ADC.

#### 2.5.1 Embedded Equalization

An analog equalizer can be added in front of the ADC in the form of the CTLE or RX FFE as described in Section 2.2 to partially equalize the data before quantizing. In this way the quantization noise of the ADC is not amplified by the front-end equalization and consequentially the digital equalizer following the ADC can be made simpler, i.e. there is a power advantage to paritioning some equalization before quantizing despite the benefits of a fully digital equalizer. Since the ADCs are time-interleaved, the T&Hs sample the data at 1/N rate for a N-way interleave, enabling the use of the sampled RX FFE as was shown in Figure 2.9. In the case of the flash ADC, one way the RX FFE function can be built is by adding another sampler & variable gain amplifier (VGA) path to the main path of each interleave and summing [104]. In the case of the pipeline ADC, one approach [106] is to modify the decoding logic to allow concurrent samples to be present at the same time (at lower resolution due to pipeline stages) to implement a register-less FFE. For the SAR ADC, the FFE can be built by exploiting the existing capacitive digital to analog converter (CDAC) which performs analog summing [85,97] as shown in Figure 2.17 for the 10GS/s 32-way time-interleaved SAR in [85]. Note that this is possible because the CDAC manipulates the analog signal *before* quantizing. The disadvantage is mainly the extra routing and parasitics introduced by the FFE connections seen in Figure 2.17a.

It is also possible to embed a DFE into the ADC architecture as done for a 6GS/s 4-bit time-interleaved pipeline ADC in [98]. For multi-level signalling such as 4-PAM, the slicer is simply a multi-level quantizer, and a feedback DAC is needed for a 1-tap DFE as shown in figure 2.18a. The pipeline architecture can be used to implement the DFE as shown in figure 2.18b by subtracting a fraction of the ISI according to the partial decision at the end of each pipeline stage.

#### 2.5.2 Non-Uniform Quantization and BER Metric

The loop-unrolled DFE structure in Figure 2.12 is also reminiscent of a flash ADC architecture where the comparator thresholds are non-uniformly spaced, clustering around the decision thresholds. In [103, 104], a loop unrolled DFE for 2-PAM is constructed by using a non-uniform quantizing flash architecture as shown in Figure 2.19. The observation made is that subtracting an offset to compensate for ISI in a traditional DFE is equivalent to determining which quantization threshold  $T_j$  is closest to the offset. The bit error rate (BER) expression, where ISI is normalized to the signal bit (±1) for the *ith* bit can be written as

$$BER_{i} = \begin{cases} Q\left(\frac{(1+y_{ISI,i})-T_{j}}{\sigma}\right) & bit_{i} = 1\\ Q\left(\frac{T_{j}-(1+y_{ISI,i})}{\sigma}\right) & bit_{i} = 0 \end{cases}$$
(2.3)



Figure 2.17: A 10GS/s 32-way time-interleaved SAR ADC with a 3-tap FFE: (a) Time-interleaved structure, (b) FFE scaling built into the CDAC, post-cursor and previous cursor taps set by 5 bit  $A_1$  and  $A_{-1}$ 



Figure 2.18: A time-interleaved pipeline ADC with embedded DFE: (a)DFE concept for multilevel signal, (b)DFE embedding at end of each pipeline stage for interleaved structure



Figure 2.19: A loop unrolled DFE utilizing non-uniform quantizing Flash architecture

Notice that the BER is minimized when  $T_j = y_{ISI}$ . Interestingly this is *not* the same as minimizing the quantization error by placing  $T_j = 1 + y_{ISI}$ . For a *N* bit ISI history, if the number of comparators *M* is equal to  $2^N$ , then the thresholds of the comparators can be simply put at the ISI values. If  $M < 2^N$ , the optimal thresholds of the comparators can be obtained by ordering the *N* ISI values, dividing the ISI into *M* groups, and placing the thresholds in the middle of each group as shown: i.e. mapping more than one ISI value to a comparator. The look-up table (LUT) performs this function in conjunction with the stored *N* bit history in the shift-register. In some cases, if comparators have uniformly spaced threshold levels (as in a normal ADC construction), depending on the ISI values, it's possible that some comparators will never be used, hence leading to conclusion that a non-uniform quantizer will perform better. This is true especially if there are ISI values that are very similar. It is however difficult to determine which levels are not needed without channel characterization and this approach stipulates the use of a DFE, which is generally avoided for 4-PAM at low BER target with FEC.

In [99–101], the non-uniform quantization concept is generalized to accommodate both FFE and DFE following the quantizer. For a linear equalizer like the FFE, at first glance, the quantization error introduced by the quantizer must be minimized to improve its performance. The quantization error is dependent on the statistics of the sampled input signal (i.e. the vertically-sliced eye distribution), formally the probability density function (PDF). Concep-

tually the regions with higher signal activity should be allocated more levels. The minimum mean squared quantization error (MSQE) threshold levels can be determined using the iterative Lloyd-Max algorithm [107, 108]. However, in [100] it was shown that minimizing MSQE does not correspond to minimizing the BER, even if only a FFE (and not a DFE) is present. This means that an algorithm for level selection should instead use the system metric BER as the goal. Luckily for 4-PAM this is possible since the BER target is low  $(10^{-6})$  unlike 2-PAM, thus allowing the system to output a BER estimate in reasonable time. As to why MSQE may not give the best result, intuitively, the Lloyd-Max quantizer will still allocate some levels to the outer regions of the eye, for instance for strong 1 and 0 in 2-PAM, however the "BER-aware" ADC does not need these levels. In other words the system can tolerate a large quantization error even though the signal activity is high in that region since it will have little impact on the BER.

A hardware implementation [101] of this idea in a 2-PAM 4Gb/s flash ADC based nonuniform quantizing receiver is shown in Figure 2.20. The flash ADC is implemented as a 4-bit converter with 15 comparators, however the thresholds of the comparators are modified by a 8-bit resistive DAC. In order to tune the thresholds to their optimal location, a modified least mean squares (LMS) algorithm called adaptive minimum BER (AMBER) [109] is used. The update equation is shown in Equation 2.4 where  $x_r$  is the output of the ADC after encoding, b is the transmitted data,  $r_i$  is the set of representation levels,  $\mu$  is the LMS step size, I is an indicator function which is 1 when an error is detected 0 otherwise, and w is the set of weights for the equalizer. Note that this is exactly the same as the update equation for signmagnitude LMS except the update is only done when there is a bit error. The use of the 8-bit DAC allows the non-uniform 3-bit ADC after AMBER update to achieve 10<sup>9</sup> times lower BER than a 4-bit uniform ADC despite the fact that the effective number of bits (ENOB) for the non-uniform 3-bit case is only 1, while it is 2 for the 4-bit case at Nyquist frequency. Thus, ENOB, traditionally measured with a sine wave input and calculated using the signal to noise and distortion ratio (SNDR) may not be a good indicator of ADC-based wireline receiver performance. This of course is due to the fact that maximizing ENOB assumes an uniform quantization noise distribution and that minimum MSQE is best for the system. The disadvantage of this approach is that the construction of a high resolution DAC is difficult, leading to an inefficient large and slow comparator, and the LMS algorithm is not guaranteed to converge given the complex nature of the BER as a function of the quantization levels.

$$r_{i}[n+1] = r_{i}[n] + \mu I[n]sign(e[n]) \sum_{k:x_{r}[n-k]=r_{i}} w[k]$$
$$e[n] = b[n] - \sum_{k=0}^{M-1} w[k]x_{r}[n-k]$$
(2.4)



Figure 2.20: Non-uniform selection ADC using high resolution DAC and modified LMS adapation AMBER.

A different approach to non-uniform quantization is taken in [92] where an architecture similar to a sub-ranging flash without input subtraction is used as shown in Figure 2.21. The 28Gb/s PAM-4 ADC-based receiver is implemented as a 4-way time-interleaved structure using clock phases CLK0, CLK90, CLK180, and CLK270. The key idea is to recognize that ISI in the channel causes adjacent samples to be correlated. In a normal baud-rate ADC, only center samples of the eye are obtained after the CDR is locked. In [92], the edge samples 1/2 UI from the center are also obtained using a series of edge samplers, which also provide information to the CDR. As shown in Figure 2.21, the extra edge sample at clock phase CLK315 for the 0 degree slice is used to select the references for the comparators in the coarse comparator bank, generating a M-bit output. The output of the coarse comparators are then used to select references for the fine comparator bank which generates a L-bit output. This allows the ADC to dynamically quantize the signal based on where it is most probable according to the edge sample, reducing the quantization error introduced. In order to meet timing requirement, the fine comparators are implemented in half-rate slices, considering the 4-way TI this leads them to be octal-rate. The T&Hs take 1UI for track, thus a total of 7UI is available for reference selection and comparator decision. The total time is allocated as follows: 2UI for coarse reference

selection/settling, 3UI for fine reference selection/settling, and 2UI for comparator decision. This partitioned architecture allows the receiver analog front-end (AFE) to scale its power aggressively from 130mW for a 30dB channel to 45mW for a 15dB channel. The drawback of this approach is the timing constraint imposed by reference selection and settling, which may become problematic as data-rate increases and thus the UI shrinks. Another drawback is the need to generate additional phases for the edge clocks.



Figure 2.21: Non-uniform selection ADC using edge samples to set fine references for flash comparators of center samples.

## 2.6 Summary

This chapter introduced basic wire-line modulation in the form of 2-PAM and 4-PAM, demonstrating the necessity of 4-PAM as data-rates increase beyond 50Gb/s. Various equalizer options are introduced and two receiver architectures, the mixed signal receiver and ADC-based receiver, are discussed. It is shown that ADC-based receivers are generally needed for 4-PAM transmission through channels with loss of 20dB or more. Literature survey indicates SAR and flash architectures are generally used in wire-line ADC-based receivers, with several modifications also briefly discussed. In order to minimize power in ADC-based receivers, two key strategies are reviewed: embedded equalization and non-uniform quantization. By modifying the ADC architecture slightly it is possible to arrive at several solutions with their own advantages and disadvantages. In the next chapter, a link model is presented to further investigate the use of ADC-based receivers and a novel power scaling solution is presented in contrast to existing work.

## Chapter 3

# Link Power Scaling and System Level Design

This chapter will introduce a proposed link power scaling strategy based on non-uniform quantization and present an ADC-based receiver system level design. General-purpose ADC designs strive to realize an "ideal" uniform quantizer, however in the case of a wireline receiver it is shown that this can be overly onerous. Dynamic non-linearities can be tolerated, and nonuniform quantization can provide better performance and/or lower power consumption. These observations may be used to lower the power consumption of ADCs in wireline receivers. In this chapter, system modelling and behavioural simulation are used to quantify the impact of these effects on link BER.

## 3.1 Link Power Scaling Using Non-Uniform Level Selection

The question of link power scaling presents the following scenario: if an ADC-based receiver is designed to handle a certain loss range, what is a practical way of reducing power consumption at the lower end of the range. In other words, imagine a 7-bit ADC is designed to handle a 30dB loss channel. If the same ADC is also used for a 20dB channel, what would be the minimum resolution/power possible and how would one go about reconfiguring such an ADC-based receiver from the 30dB case. In order to investigate this, a simple time-domain model is used as shown in Figure 3.1. A discrete-time baud-rate channel representation is used after sampling the channel to maximize the main cursor. The model also includes a variable gain amplifier (VGA) which adjusts the channel output to fill the ADC full-scale range (FSR). In reality such a VGA is often adaptive as the additive Gaussian random noise (AWGN) causes the input distribution to be unbounded. Thus there is some optimal value where the distortion

due to clipping (when the input falls outside of the ADC FSR) and quantization noise error are both considered [110]. In this model, the VGA is simply adjusted to introduce almost no clipping since no analytical solution exists due to channel shaping. The sampled channel response is  $\sum_{n=0}^{N_{ch}-1} c[n]z^{-n}$ , where  $c_n$  are  $N_{ch}$  channel taps, and the signal to noise ratio (SNR) of the link is defined to be  $\frac{\sum_{n=0}^{N_{ch}-1} c[n]^2}{\sigma^2}$ , where  $\sigma$  is the standard deviation of additive Gaussian noise. The ADC is  $B_{ADC}$  bits with  $2^{B_{ADC}}$  quantization/digital representation levels or  $2^{B_{ADC}} - 1$ analog threshold levels, and the equalizer consists of a  $N_{FFE}$  tap (total including main cursor) FFE with  $N_{PRE}$  pre-cursor taps and a  $N_{DFE}$  tap DFE.



Figure 3.1: Basic simulation model of ADC-based receiver for wireline communication

From section 2.5.2, the benefits of non-uniform quantization in saving power are evident as explored in [103–105] for 2-PAM in conjuction with a DFE, and more generally in [99–102] to include 4-PAM. The link scaling strategy proposed here takes advantage of non-uniform quantization to fluidly trade quantizer resolution with system metric of BER. In order to do this, an algorithm must choose the locations of the quantization levels, which is a non-trivial task. In [103] the levels were found through channel characterization by ordering the ISI values. In [100] the levels were found using an iterative modified LMS algorithm which updates in the event of a bit error. This assumes the levels are malleable and can be adjusted with a much better resolution than the quantizer resolution. In link power scaling, the problem is framed a bit differently, where we start at the high resolution quantizer and gradually reduce the resolution while optimizing the link performance until the BER requirement can no longer be satisfied. Therefore the problem becomes choosing a subset of the existing quantization levels, rather than moving them around while maintaining the same number of levels. Note that in an actual implementation, analog *threshold* levels are chosen by modifying analog circuitry to achieve power reduction. For instance in a flash ADC, the decision thresholds of comparators can be removed by turning them off. Luckily even when non-uniform quantization is used, the optimal relationship between quantization levels (r) and threshold levels (t) is the same, where  $t_j = \frac{r_j + r_{j+1}}{2}$ , i.e. the threshold level is the average between two adjacent quantization levels.

The simplest strategy would be to perform a brute-force search, going through all pos-

sible combinations. However, the search space is quite large, for instance for a  $B_{ADC} = 6$  bit quantizer, scaling it down to 5-bit would result in  $\binom{63}{31} \approx 9.1631 \times 10^{17}$  threshold combinations. To reduce this search space, one approximation that can be made is to assume the non-linearity introduced by the channel and AFE is low and thus the sampled input distribution is almost symmetrical around 0. This means the optimal levels would be symmetrical, leading to  $\binom{31}{15} \approx 300540195$  threshold combinations instead. This is still an extremely large number and considering the use of BER as the cost function guiding the search, is not practical due to the adaptation time needed. To shorten the search time, greedy search is proposed to tackle this problem.

#### **3.1.1 Greedy Search Approach**

The iterative greedy search approach works by selecting a subset of the candidate levels at each step, and using that subset as candidates for the subsequent iteration. Figure 3.2 illustrates the progression of a greedy search beginning with 31 threshold levels (5-bit ADC resolution) symmetrically arranged so that levels may be removed in pairs, e.g.  $\pm 15$ ,  $\pm 14$ , etc. The search begins with all threshold levels active and the link operating at a BER well below target. In the 1st iteration, after removing each pair of levels one at a time (15 trials), it is observed that removal of levels  $\pm 14$  causes the smallest increase in BER. Hence, levels  $\pm 14$  are deactivated and a tolerable increase in BER results. In the next iteration, 14 trials are made removing an additional pair of levels and identifying  $\pm 11$  for deactivation. This process is repeated until iteration 6, where after removing levels  $\pm 2$ , the BER exceeds the target. Therefore the search is terminated, and the set of levels corresponding to iteration 5 may be used.

The number of required trails using greedy search is far less than in an exhaustive search, but does not necessarily result in the global optimum. In order to see the performance degradation compared to an exhaustive search, a 4-PAM receiver using a non-uniform quantizer was simulated using the model in Figure 3.1. Using a baud-rate discrete-time channel  $0.12 + z^{-1} + 0.49z^{-2}$ , with AWGN at 30dB SNR and 1 precursor plus 2 postcursor-tap FFE ( $N_{FFE} = 4, N_{PRE} = 1, N_{DFE} = 0$ ), the quantizer is reduced from 5-bit (31) uniformly-spaced threshold levels to 15 levels. The equalizer coefficients are set at the beginning of search using least mean square (LMS) adaptation for minimum mean squared error (MMSE). Figure 3.3a shows the number of combinations satisfying each BER limit, where the total number of combinations, assuming symmetry, is 15C7 = 6435. Among these, 1 combination achieves the lowest optimal BER of  $1.2 \times 10^{-5}$ , while greedy search achieves a BER of  $12 \times 10^{-5}$ , which places it among the best 15 combinations, well above the 99.7th percentile. Moreover, this BER is an order of magnitude lower than for a uniform 4-bit quantizer (also having 15 threshold lev-



Figure 3.2: Greedy search progression illustration: starting from a symmetric 5-bit ADC, threshold levels are removed until the BER limit is no longer satisfied.



Figure 3.3: Threshold level selection for 4-PAM input for channel [0.12, 1, 0.49] and a SNR = 30dB, with a 1pre/2post tap FFE MMSE equalizer.

els), which achieves a BER of  $250 \times 10^{-5}$  and the 4-bit Lloyd-Max (LM) quantizer (BER =  $180 \times 10^{-5}$ ) which minimizes the MSQE. The corresponding levels selected and the input PDF are shown in Figure 3.3b, where x markers indicate LM levels, circular markers indicate greedy search levels, and diamond markers indicate BER-optimal levels. Note that the simulation parameters are chosen for a relatively high BER so that an exhaustive search of all combinations is possible. In practical scenarios, with more combinations and lower BER, simulation of an exhaustive search is impractical.

To investigate the performance of greedy search further, measured Cisco channel profiles provided for the IEEE802.3cd Ethernet task force [60] were used. The frequency response and corresponding sampled impulse response for 56Gb/s 4-PAM is shown in Figure 3.4a indicating a loss of 10dB at Nyquist frequency of 14GHz. The corresponding levels chosen are shown on Figure 3.4b using a SNR of 30dB and equalizer with  $N_{FFE} = 2, N_{PRE} = 0, N_{DFE} = 0$  while scaling the ADC from 4-bit to 3-bit. Under these simulation conditions it can be seen that the BER optimal levels found via brute-force search actually corresponds exactly with greedy search levels, resulting in a BER of  $6.4 \times 10^{-5}$ . The uniform 3-bit case results in a BER of  $58 \times 10^{-5}$ , while 3-bit Lloyd-Max performs the worst at a BER of  $230 \times 10^{-5}$ . Thus greedy search is extremely effective for channel losses of 10dB or less for 4-PAM, where the input PDF can be seen to be quite non-uniform. Note that this is also the case when the combined response of an analog equalizer such as CTLE or TX FFE and the channel are around 10dB or less. Hence, when preceded by an analog equalizer, non-uniform quantization level selection has a clear advantage even for channels having higher loss.

As the loss of the combined response of the channel and AFE increases, assuming 4-PAM



(a) Frequency and sampled impulse response of measured channel

(b) Level selection with input PDF

Figure 3.4: Threshold level selection for 4-PAM input for 10dB channel and a SNR = 30dB, with a 2 tap FFE MMSE equalizer.

signaling, the channel causes the PDF to assume  $4^{N_{ch}}$  modes with  $N_{ch}$  possibly equal to 100 or more. When this PDF is combined with AWGN, the result is more or less a Guassian distribution, where a non-uniform quantizer no longer has an advantage over uniform quantizer. To see how greedy search will perform in this scenario, a channel with 20dB loss at Nyquist and SNR = 30dB was simulated. The equalizer length was set to  $N_{FFE} = 7, N_{PRE} = 2, N_{DFE} = 1$ and the ADC was scaled from 5-bit to 4-bit. In this case, due to the distribution profile, the BER achieved using non-uniform level selection is close to that achieved by uniform selection. Nonetheless, Figure 3.5a shows that greedy search allows the BER to gradually increase with level removal allowing power scaling depending on link BER target. Figure 3.5b shows the Guassian like input distribution with LM and greedy search levels which achieve a similar BER of  $2.2 \times 10^{-5}$  and  $2.6 \times 10^{-5}$  respectively, in comparison with  $3.3 \times 10^{-5}$  achieved with uniform 4-bit levels. Note that in these examples, the common FFE/DFE equalizer is used, however this technique is applicable to any equalizer architecture including for instance discrete Fourier transform (DFT)/inverse-DFT (IDFT) based architecture [111].

Table 3.1 presents simulation results for more channels from the IEEE802.3cd Ethernet task force [60] comparing greedy search levels to uniform and LM levels. To make the comparison more fair, the LM levels are quantized to the same precision as the greedy search levels, unlike in the previous comparison cases where LM levels were allowed to be floating point precision (essentially infinite precision). The advantage of greedy search over LM diminishes as the channel loss increases as expected. However, in all cases greedy search performs better than both the LM and uniform quantizer.



Figure 3.5: Threshold level selection for 4-PAM input for 20dB channel and a SNR = 30dB, with a 7 tap FFE, 1 tap DFE MMSE equalizer.

Table 3.1: Simulation of greedy search performance for representative channels with 4-PAM (SNR = 30dB)

| N <sub>ffe</sub><br>(N <sub>pre</sub> ) | Ndfe | Loss@<br>Nyq (dB) | BER<br>5-bit<br>Uniform | BER<br>4-bit<br>Uniform | BER<br>4-bit<br>Greedy | BER<br>4-bit<br>LM |
|-----------------------------------------|------|-------------------|-------------------------|-------------------------|------------------------|--------------------|
| 9(2)                                    | 1    | 25                | 5.5e-6                  | 2.1e-2                  | 7.4e-3                 | 1.1e-2             |
| 4(1)                                    | 0    | 20                | 3.5e-6                  | 1e-3                    | 4.87e-4                | 7.8e-4             |
| 3(1)                                    | 0    | 15                | 1.1e-6                  | 3.9e-5                  | 1.2e-5                 | 2.3e-4             |
| 1(0)                                    | 1    | 10                | 2e-7                    | 1.62e-5                 | 2e-7                   | 1.56e-5            |

## 3.2 Advanced Link System Model and Design

In this section, the basic link system in Figure 3.1 is augmented in order to study the effects of various impairments on the ideal ADC-based receiver. This advanced baud-rate time domain link model is shown in Figure 3.6 and described in more detail in Appendix A. The transmitter (TX) is defined with a signal to noise ratio ( $SNR_{TX}$ ) and non-linearity level mismatch ratio metric (RLM). RLM is defined in standards such as OIF-CEI-56G-PAM4 [3] according to Figure 3.7 as the minimum eye height divided by the average eye height. Unlike 2-PAM, for 4-PAM non-linearity on the TX or RX side can cause noticeable BER degradation. For instance simulations in [112] report a RLM of 0.92 results in a BER of  $1.17 \times 10^{-14}$  while a RLM of

0.85 results in a BER of  $3.67 \times 10^{-8}$ . For recently published voltage mode transmitters, RLM is usually kept above 0.9 [41, 44, 57, 113]. One FEXT aggressor with a dummy transmitter is also included and the FEXT response scaled to achieve a particular standard deviation, usually quoted in  $mV_{rms}$ . The rms value of FEXT is calculated according to Equation 3.1, where  $p_{XT}$  is the sampled pulse response of the cross-talk (XT) channel and *P* denotes the discrete time PDF of the sampled pulse response.



Figure 3.6: Advanced link model with system impairments for architecture investigation.



Figure 3.7: Transmitter level mismatch metric (RLM) for 4-PAM.

$$\sigma_{XT} = \sqrt{\sum [p_{XT}^2 P(p_{XT})] - \mu^2} \tag{3.1}$$

On the receiver side a CTLE provides a boost of  $G_{CTLE}$  at Nyquist frequency and the VGA provides a gain of  $G_{VGA}$  to adjust the output of the CTLE to fill the ADC FSR. The ADC is implemented in  $N_{TI}$  time-interleaved channels. Time-interleaving mismatches including gain G, offset O, and skew  $\Delta t$  are modelled and augmented in each channel. Table 3.2 summarizes the effect of each TI mismatch, which introduces a periodic modulation effect on the aggregate output, leading to distortion tones which fall into known frequency locations for a single frequency sine-wave input [72]. Note that out of the 3 mismatches caused by time-interleaving, timing skew is the only dynamic effect (error magnitude depends on input frequency). Jitter

| Type of Mismatch | Distortion Frequency                    | Dependency on      | Dependency on      |  |
|------------------|-----------------------------------------|--------------------|--------------------|--|
|                  |                                         | Input Magnitude    | Input Frequency    |  |
| Offset           | $\frac{k}{NTI} \omega_s$                | Independent        | Independent        |  |
| Gain             | $rac{k}{N_{TI}}\omega_{s}\pm\omega$    | Linearly dependent | Independent        |  |
| Timing Skew      | $rac{k}{N_{TI}} \omega_{s} \pm \omega$ | Linearly dependent | Linearly dependent |  |

Table 3.2: Summary of Time-Interleaved Mismatches for  $N_{TI}$  sub-ADCs with Input Frequency of  $\omega$ ,  $k = 1, 2, ..., N_{TI}$ -1

is also added to the sampling operation in this model. In addition to noise  $(N_{RX})$  and static non-linearity in the form of differential non-linearity (DNL), dynamic non-linearity is added as would be introduced in a high speed track and hold.

#### 3.2.1 Modelling Dynamic Effects

#### 3.2.1.1 Skew and Jitter

Clock jitter is usually modelled as a Gaussian random variable resulting from device thermal and flicker noise in the clock generator, such as PLL, which impacts the timing of the sampling clock. In a wireline setting, additional jitter may also be introduced by the CDR. Skew is defined as deterministic timing error resulting from deviation from the ideal clock edge location between sub-rate clocks (at different phases) for samplers of ADC time-interleaved channels. Skew is introduced mostly from device mismatch, routing and supply voltage differences between different clock paths to each ADC channel. Due to efforts to equalize routing delays and voltage variations as much as possible, in reality skew may also follow a Gaussian distribution due to device mismatch. A  $N_{TI} = 8$  ADC is shown in Figure 3.8 where skew  $\Delta t_{skew}$  is introduced on the 8th phase, the amplitude error introduced by this can be modelled using a 1st order Taylor series approximation as shown. To estimate the error e, a derivative estimate is needed. Note that amplitude error due to jitter can be modelled in the same way assuming the derivative is available. The effect of jitter is to increase the noise floor of the ADC converter, while the effect of skew is to introduce tones according to Table 3.2. If the number of channels  $N_{TI}$  tends to a very large number, the effect of skew becomes the same as jitter, as jitter is essentially sampling with a large set of skew chosen from a probability distribution. In both cases, the amplitude error increases as frequency of the input signal increases.



Figure 3.8: Illustration of skew error for a 8-way time-interleaved ADC.

#### 3.2.1.2 Non-Linearity

Static non-linearity in each ADC channel is incorporated through a unique DNL profile drawn from a uniform distribution: as per design this is usually limited to  $\left[-\frac{LSB}{2}, \frac{LSB}{2}\right]$ , where LSB is  $\frac{FSR}{2^{B_{ADC}}}$ . Static effects such as 2nd and 3rd harmonic can also simply be added by scaling  $x^{2}$  and  $x^{3}$  where x is the input to the converter. To model dynamic non-linearity additional terms can be added using a discrete time Volterra series [114] as shown in Equation 3.2, where h multipliers are in effect multi-dimensional impulse responses which have memory depth M. This model is however very complex to generate and implement.

$$y[n] = h_0 + \sum_{k_1=0}^{M-1} h_1[k_1]x[n-k_1] + \sum_{k_1=0}^{M-1} \sum_{k_2=0}^{M-1} h_2[k_1,k_2]x[n-k_1]x[n-k_2] + \sum_{k_1=0}^{M-1} \sum_{k_2=0}^{M-1} \sum_{k_3=0}^{M-1} h_3[k_1,k_2,k_3]x[n-k_1]x[n-k_2]x[n-k_3] + \cdots$$
(3.2)

A major source of dynamic non-linearity is the high speed track and hold structure at the front of each ADC channel and as shown in [115], its non-linearity can be in fact modelled with

a simpler structure to 1st order using only the signal derivative. The corresponding model, assuming 2nd order non-linearity is suppressed in a fully differential implementation is expressed in Equation 3.3 where  $T'_s = N_{TI}T_s$  is the sampling period of the ADC channel. The coefficient  $k_2$  actually incorporates the amplitude error due to timing uncertainty, having units of seconds [s], while coefficient  $k_3$  is the static 3rd order non-linearity term with units of  $[1/V^2]$ . The coefficients  $k_4 - k_6$  scale dynamic non-linearity components and cause frequency dependent degradation in signal to noise and distortion ratio (SNDR).

$$y[n] = k_1 x[n] + k_2 \frac{dx}{dt}\Big|_{t=nT'_s} + k_3 x[n]^3 + k_4 x[n]^2 \frac{dx}{dt}\Big|_{t=nT'_s} + k_5 x[n] \left(\frac{dx}{dt}\right)^2\Big|_{t=nT'_s} + k_6 \left(\frac{dx}{dt}\right)^3\Big|_{t=nT'_s} + k_6 \left(\frac$$

#### **3.2.1.3** Derivative Estimation

As seen in sections 3.2.1.1 and 3.2.1.2, dynamic error models simply involve estimating the derivative of the input to the ADC. Starting at the TX output, the entire system is linear up to the output of the CTLE *y* (input to the non-linear ADC). Therefore one can estimate the derivative of the channel impulse response *h* instead of estimating the derivative of *y*, this way the derivative estimate can be stored for each channel. This is shown conceptually in Figure 3.10a, where *b* is the transmitted symbol sequence through the channel with impulse response *h* and  $b_{XT}$  is the transmitted symbol sequence through the FEXT channel with impulse response  $b_{XT}$ . One caveat is that the baud-rate derivative estimate is very poor as shown in Figure 3.10b, therefore the impulse response must be interpolated through an oversampling ratio (OSR) to  $h_{OSR}$ . The derivative then can be computed using  $h_{OSR}$  and down-sampled to the simulator step time and stored for later use.



Figure 3.9: Illustration of dynamic non-linearity effects on ADC SNDR/SFDR with model coefficients  $k_4 - k_6$  according to Eqn. 3.3.

#### 3.2.1.4 A Note on ENOB

Standalone ADC design characterization involves sine-wave testing to arrive at a signal to noise and distortion (SNDR) ratio, which can be translated into an effective number of bits (ENOB) assuming uniform quantization noise according to Equation 3.4. One may ask why not simply use this SNDR or ENOB definition in simulating the ADC-based receiver. The reason is that Equation 3.4 penalizes the converter significantly due to its assumed sinusoidal input. This leads to a large difference between the low frequency SNDR/ENOB and the high frequency SNDR/ENOB especially for state of the art converters that are limited by dynamic errors such as jitter. For instance, [48] has a low frequency ENOB of 6.5 and a high frequency ENOB of 4.9 while sampling at 28GS/s, and [21] has ENOB of 5.7 and a high frequency ENOB of 4.9 sampling at 32GS/s. Thus there is approximately 1 bit or more of degradation in effective resolution at high frequency.

$$ENOB = \frac{SNDR - 1.76}{6.02}$$
 (3.4)

The reason why this penalty is less severe in a wireline environment is because after transmission through a wireline channel the input to the ADC has significantly attenuated high



Figure 3.10: Generation of channel derivative for dynamic non-linearity modelling.

frequency content, thus on average the data transitions are slower than that of a sine wave. The work in [117] shows that additive jitter causes a SNR dependent on input statistics according to Equation 3.5, where  $r_y$  is the autocorrelation of the input to the sampler and  $r_j$  is the autocorrelation of the discrete time sampled jitter sequence. Note that the denominator contains the autocorrelation of the input derivative  $r_{y'}$ , which decreases SNR as it increases corresponding to faster data transitions. The work further shows that for a uniform distributed input distribution band-limited to  $f_{Nyq}$ , the SNR is  $10log_{10}(3) = 4.77$ dB better than if computed for a Nyquist rate sine wave input.

$$SNR = 10\log_{10}\left(\frac{r_{y}(0)}{r_{y'}(0)r_{j}(0)}\right)$$
(3.5)

For 4-PAM input through a high loss channel, the distribution quantized by the ADC appears Gaussian. While the impact of dynamic errors are reduced as discussed, the unbounded Gaussian signal results in clipping which requires increase of FSR, leading to reduction in

quantizer efficiency. This is captured by the crest factor (CF) which is defined as the peakto-rms ratio of the signal. A sinusoidal signal has a  $CF = \sqrt{2} = 3.01 dB$ , while a Gaussian signal has a practical CF >=3. The higher CF of the Gaussian signal results in a reduced SNDR as captured by Equation 3.6 [118]. Thus a low frequency ENOB of 5 corresponds to a SNDR of 31.86dB for a sinusoidal input, but a reduced SNDR of 25.33dB for a Gaussian input (CF = 3 = 9.54 dB). This means the impact of static errors such as DNL/INL and noise is actually magnified, the opposite effect experienced by dynamic errors.

$$ENOB_{arb} = \frac{SNDR + CF - 4.77}{6.02}$$
(3.6)

#### **3.2.2** Simulation Results

In this section results using the simulation model described above is presented. The model can also be adjusted to give an eye opening estimate by shifting the sampling phase used to produce the baud-rate impulse response from its ideal position. The first set of simulations in Figure 3.11 show the channel responses (both through and FEXT) and worst case vertical eye openings for different ADC resolutions and post-cursors in the RX FFE following the ADC. Figure 3.11a shows this at a target BER of  $10^{-4}$  for a 31dB IL  $1mV_{rms}$  cross-talk channel with  $G_{CTLE} = 12dB$  and a digital equalizer containing 2 precursor FFE taps and 1 DFE tap. Figure 3.11b shows this at a target BER of  $10^{-6}$  for a 20dB IL  $1mV_{rms}$  cross-talk channel with  $G_{CTLE} = 6dB$  and a digital equalizer containing 1 precursor FFE tap. The BER targets are set corresponding to those outlined in OIF-CEI-56G and equalizer gain/length set to commonly used values. The data rate used in both cases is 64Gb/s, targeting the highest data-rate currently used in testing 56G standards, and the only additional error source is transmitter thermal noise at  $SNR_{TX} = 37dB$  and receiver thermal noise  $N_{RX}$  (AWGN) referred at the input of the quantizer. This RX noise is set pessimistically to be equal to the quantization noise, i.e.  $\frac{LSB}{\sqrt{12}}V_{rms}$ , as lowering the thermal noise limit is expensive in design.

For the 30dB high loss channel, a ADC resolution of 6 bits including a FFE with 9 postcursor taps (total FFE length 12) allows the link to achieve 7.7% eye height (EH) and approximately 0.1UI eye width (EW) at a BER of  $10^{-4}$ . In contrast, the 20dB medium loss channel only requires a ADC resolution of 5 bits and FFE with 7 post-cursor taps (total FFE length 9) to achieve 6.8% EH and approximately 0.13UI EW at a BER of  $10^{-6}$ . To further improve the EH beyond 10%, a resolution of 6 bits and FFE with 6 post-cursor taps is required. Note that the CTLE profile here is a simple 1 zero (0.106 *f*<sub>baud</sub>), 2 pole (0.331 *f*<sub>baud</sub>, 0.625 *f*<sub>baud</sub>) transfer function with a DC gain of -2dB that is not optimized to the channel frequency response in any way.



(a) 31dB channel with  $1mV_{rms}$  cross-talk

(b) 20dB channel with  $1mV_{rms}$  cross-talk

Figure 3.11: Simulation of minimal vertical eye opening for medium and high loss channels for different ADC resolutions and RX FFE length, annotations show settings for minimal EH of 5%.

For the 20dB medium loss channel, an ADC resolution of 5 bits is needed to reach BER  $< 10^{-6}$  with some margin. This is however not enough once additional receiver imperfections are included according to Table 3.3. The channel mismatch parameters due to time interleaving are assumed to be calibrated within the levels indicated in the table, as these are unrealistic by design. Note that many parameters are scaled according to the LSB size as would be in a typical design such that a  $B_{ADC}$  bit ADC is not over-designed.

An example 6-bit ADC is simulated with these parameters to illustrate the difference between low frequency ENOB and high frequency ENOB. Figure 3.12a shows a spectrum (4096 point FFT) of a low frequency input showing the effects of static errors not including jitter/skew and dynamic non-linearity. The SNDR is 31.73dB and SFDR is 44.09dB dominated by the 3rd harmonic (HD3); both are constant over input frequency due to error sources being static only. Once jitter/skew and dynamic non-linearity are added, the SNDR/SFDR degrades as input frequency increases as seen in Figure 3.12b, with SFDR degradation dominated by non-linearity in this case. Overall this ADC achieves an ENOB of 4.77 at low frequency and ENOB of 4 at high frequency. If the ENOB at high frequency is considered then this converter should not been able to achieve BER <  $10^{-6}$  in this wireline link (this can be seen clearly in Figure 3.11b), however this is not the case.



(a) Spectrum (4096ppt FFT) Static Errors Only (b) SNDR vs Frequency Including Dynamic Errors

Figure 3.12: Example 6-bit ADC simulation with model parameters in Table 3.3.

Figure 3.13 shows the EH surface corresponding to the case where only static errors are added and when all error sources are added. Clearly, once non-idealities are added, a 5-bit raw ADC resolution is no longer sufficient to achieve BER <  $10^{-6}$ . Table 3.4 further summarizes this for a raw ADC resolution of 6-bits and 6 post-cursor FFE taps (1 pre-cursor tap) which achieves better than 5% EH considering all error sources. The noise only configuration corresponds to the scenario simulated in Figure 3.11b. Note that EW was simulated at a step of

| Parameter                            | Setting                                                                                                        |  |  |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------|--|--|
| 4-PAM Data-Rate                      | 64Gb/s (1UI = 31.25ps)                                                                                         |  |  |
| IL at Nyquist                        | 20dB                                                                                                           |  |  |
| Crosstalk $\sigma$                   | $1mV_{rms}$                                                                                                    |  |  |
| $SNR_{TX}$                           | 37dB                                                                                                           |  |  |
| TX RLM                               | 0.95 (outer eyes compression)                                                                                  |  |  |
| TX Output Swing                      | $1V_{ppd}$                                                                                                     |  |  |
| CTLE                                 | $G_{CTLE} = 6 dB, A_{DC} = -2 dB$                                                                              |  |  |
| VGA Gain                             | Adjusted to no clipping                                                                                        |  |  |
| ADC FSR                              | $500mV_{ppd}$                                                                                                  |  |  |
| <i>SNR</i> <sub>Ideal</sub>          | $[6.02B_{ADC} + 1.76]  \mathrm{dB}$                                                                            |  |  |
| $N_{TI}$                             | 8                                                                                                              |  |  |
| Static Non-Linearity                 | $HD3 = -(SNR_{Ideal} + 6dB)$                                                                                   |  |  |
| Dynamic Non-Linearity                | 10dB SFDR degradation over Nyquist band                                                                        |  |  |
|                                      | $\left(k_4 = \frac{-0.04T_s}{FSR^2}, k_5 = \frac{-0.0128T_s^2}{FSR^2}, k_6 = \frac{-0.004T_s^3}{FSR^2}\right)$ |  |  |
| DNL                                  | $\frac{LSB}{2}$                                                                                                |  |  |
| Gaussian Jitter $\sigma$             | 1% of UI                                                                                                       |  |  |
| ADC Channel Skew Mismatch $\sigma$   | 0.5% of UI                                                                                                     |  |  |
| ADC Channel Gain Mismatch $\sigma$   | 1%                                                                                                             |  |  |
| ADC Channel Offset Mismatch $\sigma$ | $\frac{LSB}{4}$                                                                                                |  |  |
| $N_{RX}$                             | $\frac{LSB}{\sqrt{12}}V_{rms}$                                                                                 |  |  |
| Digital LMS EQ                       | $N_{FFE}$ -tap FFE (1 pre, $N_{FFEpost}$ post)                                                                 |  |  |

Table 3.3: ADC-based Link Simulation Parameters

7% of UI, thus the low jitter/skew effects corresponding to the simulation parameters do not register any degradation even in the horizontal direction. The vertical degradation on the other hand is also minimal due to the effect explained previously in Section 3.2.1.4. The eye diagrams corresponding to Table 3.4 are also shown in Figure 3.14 for the noise only case and after all impairments are added. The non-linearity (TX RLM and RX non-linearity) causes the outer eyes (top and bottom) to shrink relative to middle eye, reducing the minimal EH/EW significantly.



Figure 3.13: Link simulation with model parameters in Table 3.3.

| Table 3.4: S | imulation       | Results for | 6-bit ADC | with 6 Po | st-Cursor | FFE Taps | (1  Pre-C) | Cursor 7 | Гар) |
|--------------|-----------------|-------------|-----------|-----------|-----------|----------|------------|----------|------|
| at BER< 10   | ) <sup>-6</sup> |             |           |           |           |          |            |          |      |

| Config                                       | EH (%) | EW (UI)       |
|----------------------------------------------|--------|---------------|
| Noise Only                                   | 10.2   | pprox 0.27    |
| Static Error                                 | 7.2    | $\approx 0.2$ |
| Static + Jitter/Skew                         | 7.1    | $\approx 0.2$ |
| Static + Jitter/Skew + Dynamic Non-Linearity | 6.5    | ≈0.13         |

## 3.3 Summary

In this section a greedy-search based power-scaling solution is proposed for 4-PAM. The scheme allows for granular power reduction corresponding to link loss while taking advantage of non-uniform quantization. In general for 4-PAM, non-uniform threshold level selection in the quantizer enables an order of magnitude improvement in BER compared to uniform



Figure 3.14: Eye diagram for link simulation for 6-bit ADC with 6 post-cursor FFE taps (1 pre-cursor tap) corresponding to only noise and all error sources cases.

selection for short reach channels. A time-domain model with dynamic error elements is presented for ADC-based link design. Augmenting the model with realistic parameters allows the designer to specify an ADC-based receiver architecture with a certain margin at the required BER. The model further shows that dynamic effects including skew, jitter and non-linearity do not impact the BER as significantly as static errors such as DNL/INL and noise. For designing a receiver for a medium reach channel of 20dB insertion loss and  $1mV_{rms}$  crosstalk, an ADC with 6-bit resolution with 8 tap FFE (1 pre-cursor, 6 post-cursors) is needed to provide >5% EH and >0.1UI EW at a BER of  $10^{-6}$ . The circuit implementation in the next section is centred around this point.

## Chapter 4

# ADC-Based Receiver Circuit Design and Implementation

## 4.1 Proposed 64Gb/s ADC-Based Link and Receiver Architecture

The complete transceiver architecture targetting 4-PAM at a data-rate of 64Gb/s is envisioned as shown in Figure 4.1. On the transmitter side, a voltage mode TX with a 1 precursor and 1 postcursor-tap FFE is included as is common in literature [21,48]. The receiver implementation is based on centering the design for a 20dB loss channel and incorporating greedy search based power scaling described in Section 3.1.1. The receiver has a CTLE incorporating a half-rate sampler as will be described in Section 4.3.1 on analog front-end (AFE) design. The CTLE is then followed by a 32GS/s 6-bit ADC implemented as a one-bit folding stage, as first described in Section 2.4, followed by a 5-bit full flash. The 6-bit resolution was chosen from system model simulation in Section 3.2.2 to generate sufficient margin for a 20dB loss channel. Indeed, due to a resolution of 6-bit, a flash architecture is possible to enable aggressive power scaling according to link loss.

Vertical symmetry of the input PDF, typical for a wireline link, is assumed for the differential input, hence the non-uniform threshold selection may be applied after the 1-bit folding stage to both reduce power consumption of the flash and reduce the search space of threshold levels. The DSP and feedback loop using greedy-search are implemented off-chip in software. The DSP computes a BER estimate based on the PRBS checker output after the data is equalized. In a real 4-PAM system, a forward-error-correction (FEC) unit may provide this BER estimate. The BER is then used to guide greedy search to de-activate the correct level for each iteration. In this implementation, the equalizer coefficients can also be re-adapted for each trail in the greedy search in order to capture any co-dependence between the quantizer threshold level selection and equalizer coefficients.



Figure 4.1: Proposed link architecture using a TX with a 3-tap FFE and 6-bit ADC-based receiver with greedy search enabled link power-scaling.

The receiver is shown in more detail in Figure 4.2. The half-rate sampling CTLE is driven by complementary phases of a 16GHz master clock ( $CLK_M$ ) derived from the PLL output. The ADC is implemented in an 8-way time-interleaved structure, with the sampling CTLE driving 4 channels on each side. In each ADC channel, another track and hold samples the data and delivers it to the VGA. The VGA also includes an additional branch (not shown) added to implement a 2-tap FFE intended for 2-PAM mode. Following the VGA is the 1-bit folding stage and 5-bit flash with non-uniform threshold selection enabled for link power scaling.



Figure 4.2: Receiver architecture including front-end half-rate sampling CTLE and 8-way time-interleaved ADC.

## 4.2 Design Technology Considerations

In order to reach the target data-rate of 64Gb/s with good power efficiency, the ADC-based receiver was designed and implemented in TSMC 16nm FinFET CMOS process. In a FinFET process, the width of the transistor is quantized due to the construction of fins, with a 2 fin device having a width of 58nm, and an increase in width of 48nm/fin thereafter. For example, an 8 fin device would have a width equal to 58 + 6(48) = 346nm. The strength of NMOS and PMOS in this process is roughly equal, with  $\frac{g_{mN}}{g_{mP}} = 1.08$  for a 4-fin minimum length ultra-low threshold voltage (ULVT) device. The ULVT devices have a typical threshold voltage approximately equal to 250mV and operate at a nominal supply of 0.8, while handling a maximum voltage ( $v_{ds}$  and  $v_{gs}$ ) of 1.05V. The nominal analog supply (LVCC) was increased to 0.9V in this design in order to achieve sufficient linearity. An additional higher 1.2V supply (HVCC) was also used while respecting the maximum voltage rating of the ULVT devices.

Layout rules are much more restrictive in 16nm, requiring uniformity and additional dummy surroundings, with the number of DRC rules approximately double that of a planar 28nm CMOS process. An example layout illustration for a 4-fin 6 finger ( $W_{tot} = 924$ nm, L = 16nm) device is shown in Figure 4.3, surrounded by 2 finger dummies on either side. In addition to layers such as poly (PO), dummy PO (DPO), cut-PO (CPO) present in planar 28nm CMOS, additional layers such as the FIN GRID, M0OD and M0PO are needed. FIN GRID is an extra layer that specifies FinFET construction when interacting with OD. There may be multiple FIN GRIDs available in a particular process to accomodate various fin pitches, for instance in high density digital circuits versus analog circuits. The local interconnect metal layer (M0) is needed to contact up to M1 and beyond. When interacting with OD, in order to connect source (S) and drain (D), it is known as MOOD. When interacting with PO, in order to connect gate (G), it is known as MOPO. Thus it is essentially like a routable contact layer. This layer is highly resistive and the number of contacts to M1 may be limited due to layout rules and style. For instance in this example, in order to minimize drain source capacitance  $C_{ds}$  the contacts are staggered, lowering the number. Nevertheless, it may be useful for instance in digital standard cells to minimize area and for dummy connection, where the G, D, S are shorted together through M0OD extension as shown. Lower metals and contacts also limit performance due to electro-migration (EM) requirements. In particular, FinFETs suffer from self-heating effect (SHE), which can cause up to a 20°C increase in temperature [119], proportional to the number of fins. For minimum length devices, this design restricts the number of fins to be  $\leq = 8$ , targeting a maximum temperature of 80°C without SHE. Note that due to the lower supply and stringent EM requirements, biasing schemes such as peak transit frequency biasing are no longer applicable in FinFET processes.


Figure 4.3: Example layout for a 4-fin 6 finger ( $W_{tot} = 924nm, L = 16nm$ ) device surrounded by 2 finger dummies on either side.

# 4.3 Analog Front-end Design

## 4.3.1 Half-Rate Sampling CTLE

### 4.3.1.1 Design Considerations

Several considerations are taken into account when designing the analog front-end (AFE). First, the number of stages is minimized to save power. Second, timing skew calibration is simplified by using a half-rate sampling structure before sub-sampling at individual ADC channels. This allows the skew calibration to be equivalent to duty cycle correction of the master 16GHz clock, which is generally required anyways. To satisfy both requirements and guarantee a bandwidth larger than Nyquist frequency of 16GHz, a structure based on the cascode sampler in [116] is used. The fully differential implementation is shown in Figure 4.4. The structure allows one side to track the input while the other side is holding the input, illustrated here for when CLK is high. When tracking, the transistors on the bottom form a cascode, which provides a high impedance. Thus, the PMOS triode load, which has programmable width and a maximum resistance of  $50\Omega$ , together with capacitance at the output determines the bandwidth. In hold mode, both the PMOS triode load and cascode transistor are off, thus the output is held.

The difference between the original sampler in [116] and in this work is the differential

pair trans-conductor is resistively and capacitively degenerated to provide boost at 16GHz, thus merging the CTLE function with the sampling operation and reducing the number of stages in the AFE. One drawback of the original structure is its non-linearity due to device stacking and open-loop operation, however this is no longer a problem given degeneration. The supply used is only 0.9V compared to 1.2V in the original work to meet linearity requirements. A variant of the original sampler was also used in [120], where the load devices were replaced with PMOS devices that are reset, leading to an integrating front-end. This was possible due to the mixed signal architecture following the sampler, however in this design the common mode must be defined for the following ADC. The degeneration resistor ( $R_s$ ) is implemented as triode MOS switches in 7 slices to save area. There is also a low impedance switch (not shown) in parallel for non-peaking/flat mode. The degeneration capacitor ( $C_s$ ) is implemented as a metal-oxidemetal (MOM) capacitor using M2 as the bottom layer and M7 as the top layer. The clocking structure is simplified by removing large RC-based level shifters in the original work as will be discussed next.



Figure 4.4: Two-way time-interleaved CTLE sampler when CLK is high: left side is in track mode and right side is in hold mode.

There are some disadvantages of combining the sampler with the CTLE operation, namely the boost provided by the CTLE is low (<6dB) due to voltage headroom considerations, and extra power must be dissipated in the clocking network. The first point is not a big issue for equalizing medium loss channels. The second point may be an issue because the thermal noise of the sampler in track mode, in addition to sizing of the cascode transistor (to keep it in saturation) results in a large  $M_2$  device size that must be driven by the 16GHz master clock. The clock swing however is reduced significantly from rail-to-rail for the cascode transistor to keep it in saturation. The PMOS load on the other hand uses a full-swing clock to reduce its size for the same on-resistance.

Another consideration is any timing mismatch between the clock driving the cascode (*CLK<sub>s</sub>* /*CLKB<sub>s</sub>*) and the one driving the load (*CLK/CLKB*) will cause common mode problems. For instance as shown in Figure 4.5a, if the cascode device turns off first, the output common mode ( $V_{ocm}$ ) will be pulled high. To remedy these issues, Figure 4.5b shows the added alignment control between clocks (+/-4ps in 1ps resolution, unit delay cell shown in Figure 4.5c) to stabilize the common mode and a CMOS-style reduced swing buffer, which provides similar rise/fall time to the full swing buffered clock, generating *CLK<sub>s</sub>* for the cascode transistor. This reduced swing buffer works by preventing pull-down and pull-up through  $M_n$  and  $M_p$  respectively. For instance, if the main clock driver (INV1) is driving the clock from "0" to "1", node X will fall due to INV2, turning on  $M_p$  and preventing pull-up to solid 1.



Figure 4.5: Clock generation for CTLE cascode sampler: (a) CM shift problem due to clock timing mismatch, (b) clock phase alignment to prevent CM shift, and reduced swing buffer for generating clock for cascode NMOS, (c) unit delay cell for alignment path.

#### 4.3.1.2 Layout Considerations

As stated in Section 4.2, layout in 16nm FinFET is much more restrictive than older planar technologies. Considering EM restrictions and higher metal & contact resistances, lower metal layers should be restricted to  $3\mu$ m or less in length in analog designs. An illustrative layout of

the input pair of the sampler is shown in Figure 4.6a. Double continuous gate contacts are used as the transistor has 8 fins, however double gate contacts may not be optimal for 4 fins or less as it will require gate extension. The active source (S) and drain (D) regions are extended to provide more contacts to higher horizontal metals. The source and drain vias are staggered to reduce  $C_{ds}$  and rectangular vias are used wherever possible as opposed to square vias. In order to fit rectangular vias, the poly pitch must be increased to 96nm (centre to centre), increasing source and drain area but an overall better trade-off for high speed design.

Analog layout is also more sensitive to its surroundings. Whenever possible, it is good practice to adapt a "sea of gates" approach as shown for the biasing tail transistor (L = 96nm) surrounded by a guard-ring in Figure 4.6b. The individual fingers are divided and arrayed so that the surroundings are uniform with similar gate, source and drain connections, and metal densities. Similarly, a large input device may also needs to be arrayed to benefit from the best matching. This becomes even more important as the process technology scales to 7nm and beyond.

## 4.3.2 Sampling Consideration for Sub-ADCs

Two timing schemes are possible for passing the sample from the front-end half-rate sampler (1st rank sampler) to the individual ADC channels: the charge redistribution scheme and the switch multiplexing scheme as shown in Figure 4.7a and 4.7b respectively. In charge redistribution, the input is first sampled by the master clock  $CLK_M$  onto the explicit capacitor  $C_I$ , then sampled again onto  $C_s$  by the sub-ADC clock. There are two kT/C noise events and the signal experiences attenuation. This attenuation can be avoided by placing a buffer after the 1st rank sampler, but this leads to additional power consumption and possible signal distortion. Another disadvantage of the charge redistribution approach is the duty cycle of  $CLK_M$  must be reduced to <50% to allow sufficient margin for  $CLK_{subADC}$ . For instance, in [84], the master clock has a duty cycle of 25%. This duty-cycle in reality is difficult to generate if the master clock is at the highest frequency generated by the PLL. In switch multiplexing,  $CLK_M$  and  $CLK_{subADC}$  are aligned so that the switches are on at the same time, sampling only on to  $C_s$ , avoiding one noise event and signal attenuation at the expense of reduced bandwidth. In this case the capacitance  $C_I$  is the interconnect capacitance and not an explicit sampling capacitor.

In this work, switch multiplexing is used in order to allow the 1st rank sampler maximum time for tracking while maintaining a 50% duty cycle on  $CLK_M$  as shown in Figure 4.8, illustrating one side of the half-rate CTLE sampler driving 4 sub-ADCs. To minimize the bandwidth penalty, the vertical dimension of each sub-ADC slice is made as small as possible so that the longest route to the sub-ADC T&H is approximately  $100\mu$ m. The PMOS (to accept high com-



(a) Input Device



(b) Bias Device

Figure 4.6: Layout of input transistor (a) showing gate, source and drain connections, and bias tail device (b) showing "sea of gates" style layout.



Figure 4.7: Front-end timing possibilities: a) Charge Redistribution b) Switch Multiplexing

mon mode from CTLE sampler) T&H switch inside each sub-ADC is also sized aggressively for a ON resistance of 45 $\Omega$  at SS corner. In addition to improved PMOS drive strength, Fin-FET processes benefit greatly from reduced  $R_{on}$  compared to planar processes. Note that from Figure 4.8, it can be clearly seen as long as  $CLK_M$  falls before  $CLK_{subADC}$ , the skew will only be the duty cycle distortion (DCD) of  $CLK_M$ . The track width of  $CLK_{subADC}$  needs to be wider than  $CLK_M$  (i.e. >31.25ps) across PVT variations, but not so wide that it overlaps with the pulses from other sub-ADCs (i.e. <62.5ps), which would lead to crosstalk between sub-ADCs and degraded bandwidth. In this work, a tracking width of 42ps is used for the sub-ADCs, and  $CLK_M$  and  $CLK_{subADC}$  are aligned by adjusting a delay line in the master clock path. There is no need for a separate delay adjustment on each sub-ADC clock as long as the skew <11ps, which can be guaranteed by design. A sampling capacitor of approximately 20fF comprised of layout parasitics and input capacitance of the next stage is used for  $C_s$ .

The entire clocking network is shown in Figure 4.9. A 16GHz master clock ( $CLK_M$ ) is provided by an on-chip PLL, this is delivered by a clock spine and buffered to the AFE. Two paths are created: one for the sub-ADC clock generation and one for the front-end sampling CTLE. The front-end sampling CTLE path contains a duty cycle corrector (DCC), acting as skew control, designed with a resolution of 250fs to cover a range of  $\pm 8\%$  DCD. The clock jitter contributed by the buffers and the DCC is approximately 114fs-rms in RCC (resistor, capacitor, coupling capacitor) layout extracted simulation. This results in a total jitter of 230fsrms considering 200fs-rms from the PLL clock after distribution. The clock then goes into the circuitry shown previously in Figure 4.5b for alignment (sampler-align) and conditioning to drive the sampling CTLE.

On the sub-ADC path, a delay stage is inserted with 2-bit coarse control (20ps step) (trackcoarse-align) and 3-bit fine control (2.5ps step) (track-fine-align) to align the tracking window



Figure 4.8: Implemented sampling structure with front-end sampler passing samples to sub-ADCs.

of the sub-ADC clocks to the master clock to enable switch-multiplexing. The clock then goes into a divide by 4 to generate 8 phases of a 4GHz clock for 8 sub-ADCs with a simulated jitter of 644fs-rms post-extraction. Two adjacent phases of the 8 phases are routed to each sub-ADC and the tracking pulse is generated locally at each sub-ADC with a width of 42ps as previously stated. The entire block as shown, excluding the local duty cycle modifier at the sub-ADCs, consumes approximately 9.9mW.

## 4.3.3 Simulation Results

In order to verify the performance of the AFE, the entire network in Figure 4.8 including clock generation and mock top level route to sub-ADCs, and T&H switches was RCC extracted and simulated at TT corner 80°C. The skew between the 8 phases due to layout was verified to be approximately  $\sigma = 1.2$ ps due to the divider and 6ps due to routing length difference to sub-ADCs, where a H-tree was not used in order to minimize routing length at the expense of skew. The resulting systematic skew is still less than the tolerable amount of 11ps. In the following simulations, an aggregate sampling frequency of 32.5GS/s was used with CTLE load set to 5% from maximum, input common mode set to 550mV and duty cycle corrector tuned to optimal setting. An input of -1dBFS corresponding to a FSR of 400mV<sub>ppd</sub> was used and the peaking



Figure 4.9: Clocking network for front-end sampler and sub-ADC phases.

on the sampler was disabled.

Using the optimal coarse delay setting for track-coarse-align, Figure 4.10 shows the SNDR of the aggregate output for a Nyquist frequency input for all track-fine-align settings and 3 sampler-align settings. It is clear that the SNDR achieves a maximum when the track phase and sampler alignment are swept and optimized. In order to verify that the half-rate front-end sampler can minimize the effect of skew, the AFE was simulated with a skew vector of  $[0.8,0.05,0.7,0.3,0.4,0.8,0.6,0.25] \times 8ps$  ( $\sigma = 2.2ps$ ), corresponding to added delay on each channel with code setting frozen (not optimized). Figure 4.11 shows the input fundamental amplitude and SNDR across frequency. It is clear that the bandwidth of the AFE is much higher than Nyquist frequency of 16GHz and skew causes <3dB degradation in SNDR. On the other hand, without the front-end sampler, this skew would result in approximately  $SNR = -20log_{10}(2\pi \times 16GHz \times 2.2ps) = 13.1dB$ .

# 4.4 Sub-ADC Design

## 4.4.1 Architecture and Timing

As shown in Figure 4.12, each 4GS/s ( $f_{sn} = f_s/8$ ) 6-bit sub-ADC consists of a PMOS sampling switch, a trans-conductor (gm) stage which also acts as variable gain amplifier (VGA), a 1-bit folding stage which is triggered by the MSB comparator decision, and a 5-bit full flash buffered



Figure 4.10: Optimizing AFE SNDR using clock alignment controls.



Figure 4.11: AFE input amplitude and SNDR versus frequency with skew added.

by a PMOS source follower (SF) followed by a Wallace tree adder. The VGA is the only circuit in the RX-AFE which uses a 1.2V (HVCC) supply in order to accept the high output common mode ( $\approx 0.7V$ ) of the CTLE sampler. The use of the 1.2V supply also allows the VGA to incorporate an additional branch to implement a 2-tap sampled FFE, which will be discussed later. A 0.9V (LVCC) supply is used for all other blocks. Each comparator in the full-flash consists of a clock-gated enable (EN) so that a subset may be de-activated for non-uniform quantization to save power.



Figure 4.12: Sub-ADC circuit architecture: gm-stage implementing VGA followed by 1-bit folding and 5-bit full flash with 31 comparators with clock gating and Wallace tree adder.

Figure 4.13 illustrates the complete timing diagram. A divide-by-4 generates 8 phases of a 50% duty-cycle 4GHz clock from the 16GHz master clock as was shown previously. All clocks within each sub-ADC are generated from 2 adjacent phases of the 4GHz clock which are routed to each sub-ADC. These 2 adjacent phases in conjunction with a variable delay allows the track pulse width to be set >1/32GHz. All other clocks are generated with simple combinatorial logic as well. The decision of the comparator activates the folding switches in the appropriate direction immediately after it is ready. The switches are then opened slightly before *CLK<sub>MSB</sub>* falls low. This allows the voltage to remain unaffected by the kickback caused by the reset of the MSB comparator. Because the switches are open during MSB comparator reset, they are also used to pipeline the conversion as highlighted in red; the tracking window, the settling time of the VGA, as well as the initial triggering of the MSB comparator all take place at the same time as the LSB comparison of the previous sample. This works as long as the folding switches are inactive during LSB comparison, which is guaranteed by design. As a safety precaution, the MSB comparator activation time is also tunable, i.e. the settling time of the VGA is variable by +/-10ps to potentially allow the folding activation to occur later. The pipelining used in this work is different from prior art [11] in which an additional S&H is added between the VGA and the input to MSB comparator.



Figure 4.13: Sub-ADC timing diagram: folding switches are used to pipeline the conversion.

In order to reduce intersymbol interference (ISI), the input to the SF is reset at the end of the LSB comparison cycle. This improves the SNDR significantly especially when the input is at  $f_{sn}/4$ . This is because the 1 bit folding operation essentially doubles the highest frequency content of the waveform. This can be easily seen by considering a triangular wave input: the 1 bit folded waveform is simply a triangular wave at double the frequency with a DC offset. Similarly, a Nyquist signal at input frequency near  $f_{sn}/2$  now appears near  $f_{sn}$ , which when aliased becomes a very "slow" signal near DC, while a signal at input frequency near  $f_{sn}/4$ . One disadvantage of this reset is it increases the meta-stability probability of the LSB

comparator as it overlaps the decision time as shown in Figure 4.13.

## 4.4.2 VGA and FFE Design

The VGA operates off a 1.2V supply to accept the high common mode (600-700mV) output from the CTLE sampler. In order to implement the FFE functionality, several approaches can be used. In prior art [104] the approach shown in Figure 4.14a is used, where a separate T&H is added to each sub-ADC channel to sample the data. In this example, the previous sample is used to implement a post-cursor tap. This increases the complexity of clock generation needed at the sub-ADC and also increases the load of the previous stage driving the sub-ADCs by a factor of 2. In order to prevent bandwidth degradation at this critical point, the approach in Figure 4.14b is used instead. Here the samples are converted to current by a gm stage and then routed to the appropriate adjacent sub-ADCs. Taking advantage of the 1.2V supply, a common gate (CG) device is added to buffer the potentially long route so that bandwidth degradation at the summing voltage output node is minimal. In addition, the CG device helps isolate the output when the FFE is turned off. This is similiar to [121] in which cascode devices are clocked to implement a switching matrix.



Figure 4.14: FFE implementation from prior art (a) and in this work (b) using current mode routing with common gate buffering.

Figure 4.15 shows the circuit implementation implementing 1 post-cursor tap with relevant device sizes. Here the post-cursor tap is hard-wired for subtraction as the sign of the post-cursor in the pulse response in a wire-line setting is almost always positive. When the FFE is not in use, both the bias ( $M_B$ ) and cascode device ( $M_C$ ) are turned off using *ENB* for the



Figure 4.15: VGA/FFE circuit implementation with relevant device sizes.

best isolation. In order to lower thermal noise in VGA mode, a resistive load is used instead of an active load. Note that thermal noise in FinFET processes is much higher than in planar processes. In this 16nm FinFET process, the excess noise factor ( $\gamma$ ) is around 1.8 (cf.  $\gamma = 2/3$ for long channel devices).

The use of the resistive load has a disadvantage of reducing the gain by 6dB in FFE mode in order to keep the output common mode the same by switching in another resistive load branch in parallel. The degeneration is provided by a triode PMOS device whose gate voltage *Vtune* is controlled by a 5-bit resistive DAC (RDAC). In FFE mode, the main cursor is automatically set to maximum gain and the RDAC is used to control the post-cursor with a range of 3dB. The maximum gain achieved in VGA mode is 5.4dB at an input common mode of 700mV with a bandwidth of 8.9GHz after RCC layout extraction with appropriate loading. The integrated output noise is approximately  $1.1mV_{rms}$ .

## 4.4.3 Comparator Design

The MSB comparator, shown in Figure 4.16a, is a modified double-tail latch [122] utilizing a single phase clock with a dynamic pre-amplifier as the 1st stage and latch as the 2nd stage. Offset cancellation is implemented using a signed 5-bit C-DAC (2 bit thermometer, 3 bit binary) at nodes  $D_{IN}$  and  $D_{IP}$ . The C-DAC covers >3 $\sigma$  ( $\sigma$ = 1.5LSB, LSB = 6.25mV) of comparator

offset with a nominal step size of approximately  $\frac{1}{4}$  LSB. The C-DAC provides lower thermal noise than adding an additional input pair or current DAC for calibration. The disadvantage is added propagation delay in the comparator due to extra loading. The LSB comparators are of similar structure as shown in Figure 4.16b where the input is changed to NMOS devices and an additional input pair is added for differential references. The devices are sized smaller since its operation is less critical than the MSB comparator. The LSB comparator achieves an input referred noise of  $1.24mV_{rms}$  at a power consumption of 0.73mW (including clock drivers). Note the LSB comparator is sized for a thermal noise level lower than the quantization noise level ( $1.8mV_{rms}$ ). Similarly the MSB comparator achieves an input referred noise of  $0.7mV_{rms}$ .

In this technology, the LSB comparators achieve a simulated regeneration time constant of approximately 3ps after layout extraction. With a propagation delay of  $\approx$ 45ps, folder reset time  $\approx$ 30ps, the regeneration time is  $\approx$ 50ps with half of a period (125ps) of total decision time. This corresponds to a metastability probability of  $\approx$ 1e-6. This is fine since the SR latch and digital logic following the dynamic comparator provides additional regeneration and for 4-PAM the BER target is 10<sup>-6</sup>. Due to high thermal noise in FinFET technologies compared to planar, the LSB comparators, given their sizes, cause significant kickback onto the source follower dirving them. In order to alleviate this, two techniques are used as shown in Figure 4.17. In Figure 4.17a, the dynamic pre-amplifier is augmented so that  $M1_k \& M2_k$  reset the source node so that the sampled voltage doesnt drift,  $M3_k \& M4_k$  converts differential kickback to common-mode, and  $M5_k \& M6_k$  are added at the gate of the input devices to supply charge to restore common mode [123]. Since the input to the full-flash is folded, only positive differential voltages cause unidirectional kickback. Thus, as shown in Figure 4.17b, the references to the comparators are shifted by 2 LSBs to account for this systematic shift. This allows the CDAC for offset cancellation to have additional margin.

### 4.4.4 Reference Generation

For reference generation, the R-ladder shown in Figure 4.18 draws only  $50\mu$ A, with a 1.2pF decoupling capacitor at each node for a maximum single-ended droop of  $\frac{1}{2}$ LSB. A single-stage op-amp is used in feedback with a loop gain of 40dB to set the center tap voltage of the ladder. This center tap voltage is adjustable through an R-DAC, and the bias current of the ladder is also adjustable. This enables LSB size and therefore input full scale range adjustment, or additional calibration range for PVT variations.



| Parameter           | Value        |
|---------------------|--------------|
| (W/L) <sub>M1</sub> | 6x8fin/16nm  |
| (W/L) <sub>M2</sub> | 10x8fin/16nm |
| (W/L) <sub>M3</sub> | 6x8fin/16nm  |
| (W/L) <sub>M4</sub> | 6x8fin/16nm  |
| (W/L) <sub>M5</sub> | 12x8fin/16nm |
| (W/L) <sub>M6</sub> | 4x8fin/16nm  |
| (W/L) <sub>M7</sub> | 12x8fin/16nm |
| (W/L) <sub>M8</sub> | 10x8fin/16nm |
|                     |              |

(a) MSB Comparator



| Parameter           | Value       |
|---------------------|-------------|
| (W/L) <sub>M1</sub> | 2x8fin/16nm |
| (W/L) <sub>M2</sub> | 6x8fin/16nm |
| (W/L) <sub>M3</sub> | 4x8fin/16nm |
| (W/L) <sub>M4</sub> | 4x8fin/16nm |
| (W/L) <sub>M5</sub> | 8x8fin/16nm |
| (W/L) <sub>M6</sub> | 2x8fin/16nm |
| (W/L) <sub>M7</sub> | 8x8fin/16nm |
| (W/L) <sub>M8</sub> | 6x8fin/16nm |
|                     |             |

(b) LSB Comparator

Figure 4.16: Comparator design with key device sizes.



Figure 4.17: Kickback compensation techniques for LSB comparators by: (a) augmenting the dynamic pre-amplifier and (b) shifting the references to account for systematic offset.



Figure 4.18: Implementation of resistive ladder for LSB comparator reference generation.

## 4.4.5 Wallace Tree Adder Encoder

A Wallace tree adder is used to encode the thermometer outputs of the LSB comparators into 5-bit binary. This encoder, an ones adder, has the advantage that the order of the comparators doesn't matter, which is beneficial given comparators are being turned off for power scaling. Another advantage is that the structure is regular, allowing pipe-lining to be done easily with a theoretical maximum clock frequency  $T_{clk} < T_{cq} + T_{adder} + T_{su}$ , where  $T_{cq}$  and  $T_{su}$  is clock-to-q delay and setup time of the register respectively, and  $T_{adder}$  is the propagation delay of the adder. The unit adder is implemented as a transmission gate based full adder and the full structure is built using full custom design with sub-blocks that are pipelined with a minimal number of stages to meet required data-rate of 4GHz. The simulated power consumption is only 1.5mW. An example 7:3 encoder and 15:4 encoder sub-block is shown in Figure 4.19.



Figure 4.19: Implementation of 7:3 (a) and 15:4 (b) Wallace encoders.

### 4.4.6 Layout and Floorplan

The layout of the sub-ADC is shown in Figure 4.20 with key components labelled. As stated previously, the vertical dimension is minimized so that the routing from the front-end sampler to the sub-ADCs is minimized when they are stacked vertically. This results in an active area of  $60\mu m \times 290\mu m$ , majority of which is taken by the LSB comparator array and R-ladder and decap. The LSB comparators are arranged in two rows, with the their input and clock routed in between the two rows on high metals. The references are then provided from the right through the R-ladder. The Wallace encoder area is minimized through full custom layout and placed below the LSB comparators, whose output is then routed and available at the bottom left corner of the sub-ADC.



Figure 4.20: Layout floorplan with sub-ADC components illustrated.

## 4.4.7 Simulation Results

Figure 4.21 shows the simulated power breakdown of the sub-ADC at 4GS/s after RCC layout extraction at TT corner 80°C with a total power consumption of 32.6mW . The majority of dynamic power is consumed by the LSB comparators (0.73mW/comparator) and static power by the source follower (SF) (3.6mW) driving them. The entire sub-ADC was also RCC layout extracted and simulated using parasitic reduction ( $f_{max} = 150$ GHz). A rough calibration was performed to set the comparator thresholds to remove the effects of systematic offset. Figure 4.22 shows the low frequency and high frequency spectrum (256 point FFT) with a low frequency SNDR of 31.88dB and high frequency SNDR of 31.9dB (ENOB = 5.01). The SFDR is limited by the third harmonic, which can be improved through better calibration.

Transient noise was turned on with  $f_{min} = 10$ MHz (limited by simulation time) and  $f_{max} = 200$ GHz. Figure 4.23 compares the performance with transient noise on and off: turning noise on results in only about 1dB degradation in SNDR. This means the comparators could possibly be sized smaller to save power. Note that the SNDR is also consistently flat across frequency, meaning that the folding effect is well compensated for.

# 4.5 Top Level Design

## 4.5.1 Floorplan

At the top-level, 4 sub-ADCs are arranged vertically on each side of the CTLE sampler, which is placed in the middle as shown in Figure 4.24. The clock is distributed from the right from the clock spine to the duty cycle correcter (DCC) and divide by 4 (DIV4) used to generate sub-ADC phases. The differential input is distributed from the left from the RX input bumps.



Figure 4.21: Power breakdown of 4GS/s sub-ADC with a total power consumption of 32.6mW from 0.9/1.2V supplies.



Figure 4.22: Sub-ADC FFT low frequency and Nyquist frequency spectrum (256 points), achieving an ENOB of 5 at Nyquist.

The primary and secondary ESD devices and the route to the CTLE sampler causes the input bandwidth to be limited to 25GHz (RCC extraction simulated). The output of the 8 sub-ADCs are routed down through the midsection while being retimed to a common clock phase and finally sent to a de-serializer gearbox. This gearbox converts 8 streams of data (4GHz each) into 28 streams ( $\approx$ 1.14GHz each) in order to store the ADC data in memory. The parallelization



Figure 4.23: SNDR/SFDR across frequency for transient noise ON and OFF, showing approximately 1dB degradation in SNDR.

factor is chosen to conform to the existing memory block provided by research collaborator Huawei and would not be chosen in a real design. The power of this gearbox is not optimized and was simulated to be 23mW.

## 4.5.2 Simulation of 4-way TI ADC

To verify the performance, 4 sub-ADCs with retiming were RCC extracted and simulated at TT 80°C with a dummy master switch of 50 $\Omega$  resistance and same track time as the CTLE sampler. This was to verify the systematic offset, gain, and skew (bandwidth) mismatch due to layout were within design tolerance. Figure 4.25a shows the SNDR/SFDR of all 4 sub-ADCs as well as input amplitude across the Nyquist band. The maximum systematic offset is negligible at 0.1LSB, and the maximum systematic gain mismatch observed is <1.5dB, much smaller than covered by the VGA range and R-ladder LSB tuning. Figure 4.25b shows the SNDR/SFDR of the 4-way time-interleaved ADC before and after gain mismatch is removed. Gain mismatch severely restricts SFDR and in turn SNDR as can be seen in Figure 4.26, which shows the spectrum (1024 point FFT) with Nyquist frequency input before and after gain mismatch removal which causes degradation in SNDR at high frequency, this can be improved by increasing the tracking window of the sub-ADCs.



Figure 4.24: Top level ADC floorplan with CTLE sampler in the middle and 4 sub-ADCs on either side.



Figure 4.25: SNDR/SFDR RCC extracted simulation for 16GS/s 4-way TI ADC.



Figure 4.26: Spectrum (1024 ppt FFT) illustrating effect of gain mismatch on Nyquist frequency input for 16GS/s 4-way TI ADC.

## 4.5.3 Digital Back-End

In order to implement greedy search and enable non-uniform quantization, post-processing blocks were constructed in software as shown in Figure 4.27. An encoder maps the output of the Wallace adder depending on the comparator enable signals to the correct new quantization level. For a hardware implementation, it can be constructed as a look-up table (LUT) with a simulated power consumption of  $\approx 1.5$ mW per sub-ADC in this technology. The digital equalizer implementing FFE/DFE taps follows the LUT and its software construction enables investigation of various structures depending on link loss. For measurement purposes, two types of adaptation are used to converge the taps of the equalizer. The first being the typical LMS decision directed blind equalizer used when the BER is low. Its update equation is shown in Equation 4.1 where *w* is the set of equalizer, and  $\hat{y}$  is the output of the slicer (decision device). The second is a simple Sato's blind equalizer [124] as shown in Equation 4.2 and used for high BER, where *s* is the transmitted symbol sequence and *E* is the expectation operator. Note that for 2-PAM, the coefficient  $R_1$  is equal to 1 and Sato's equalizer is the same as its LMS decision directed counterpart.

$$w[n+1] = w[n] - \mu(y[n] - \hat{y}[n])x[n]$$
(4.1)

•

$$w[n+1] = w[n] - \mu(y[n] - R_1 sign(y[n]))x[n], \quad R_1 = \frac{E[s[n]^2]}{E[|s[n]|]}$$
(4.2)



Figure 4.27: Digital back-end implemented in software for enabling greedy search.

# **Chapter 5**

# **Measurement Results and Discussion**

The prototype receiver was fabricated in TSMC 16nm FinFET CMOS process. The RX occupies an active area of  $650\mu$ m×250 $\mu$ m as shown in Figure 5.1 with die photo and layout floorplan. In order to test the RX, a TX designed by research collaborator Huawei Canada on the same chip was used. The TX achieves an RLM>0.99 at a maximum  $1.1V_{ppd}$  full swing. It also features 3 tap FFE (1 pre-tap, 1 post-tap) which was used in link measurements. Deembedded TX eye diagrams show a minimum EH and EW of 328mV and 11.47ps respectively at 64.375Gb/s for a PRBS15 pattern. Clock pattern jitter decomposition at 32.1875Gb/s shows an RJ of 162fs-rms and a TJ of 2.82ps at  $10^{-12}$ . As the TX is located closer to the PLL, the RJ at the RX is expected to be higher than this number.



Figure 5.1: Die photo and layout view with important blocks annotated of ADC-based receiver, core area of  $650\mu$ m $\times 250\mu$ m.

# 5.1 ADC Performance

### 5.1.1 Static Performance

The ADC was first characterized with sine-wave testing. In order to calibrate the comparators, an off-chip DAC with  $\pm 2$ mV differential error ( $\approx 1/3$  LSB) was applied to the input and stepped. The VGA gain was adjusted to approximately center the comparator CDAC codes. Figure 5.2 shows the conceptual calibration scheme with offset sources. Note that in this prototype additional calibration was not present in the VGA or folding switches. This leads to a residual offset due to the folding switch offset, for instance due to threshold mismatch and also since the impedance presented between different folding modes is not the same. After the MSB comparator is calibrated first, the LSB comparators are calibrated by averaging the comparator offset CDAC codes for a  $V_{cal} = +V$  input and a  $V_{cal} = -V$  input to account for this effect. The number of comparator outputs averaged per step is 1000, where binary search is implemented first, then linear search with a total number of steps equal to 10.



Figure 5.2: Offset calibration of sub-ADC using off-chip DAC to generate  $V_{cal}$ .

Figure 5.3 shows an example calibration with comparator CDAC codes converging for all comparators in one sub-ADC, where diamond markers indicate an input of -V and round markers indicate an input of +V. If error in the folding operation is negligible then we expect the codes for both situations to be the same, however this is not the case. The CDAC code histogram is shown Figure 5.4, showing a standard deviation ( $\sigma$ ) of 7.75 codes for all 248 LSB comparators across the 8 sub-ADCs. This is higher than the simulated monte-carlo offset of 6 DAC codes, however the CDAC has more than enough range (±31) to cover it. The measured DNL/INL for all 8 sub-ADCs is plotted in Figure 5.5 and measured using a low frequency sine wave using the histogram method. The DNL is bounded between -0.97/+1.38 LSB and INL is



bounded between -1.6/+1.38 LSB, corresponding to a full-scale range of  $400mV_{ppd}$ .

Figure 5.3: Offset calibration example for a single sub-ADC showing CDAC convergence for both positive and negative  $V_{cal} = |V|$ .



Figure 5.4: Histogram of all LSB comparators across 8 sub-ADCs showing a standard deviation of 7.75 DAC codes

## 5.1.2 Dynamic Performance

#### 5.1.2.1 Sub-ADC Performance

Figure 5.6 shows the SNDR/SFDR for all 8 sub-ADCs across their Nyquist band at a sampling rate of 4.0234GS/s (rate defined by PLL master clock). The worst and best SNDR at Nyquist frequency is 27.71dB and 32.47dB respectively. Notice that the odd numbered sub-ADCs in general perform better than the even number ones, pointing to a systematic error caused by placement as they were arranged on either side of the CTLE sampler depending on if they were odd or even. This may be potentially due to supply network distribution which was stronger on the odd numbered side due to blockage caused by RX input bumps on the even numbered side. The SFDR is limited by the third harmonic and the overall performance is very close to the simulated results presented in Section 4.4.7. Due the folding operation, the SNDR/SFDR curves have a slight droop at mid-band, however do not drop significantly at half Nyquist frequency of  $\approx$ 1GHz thanks to the reset introduced in this prototype. Figure 5.7 shows the spectrum before and after reset is enabled, clearly illustrating its effectiveness by improving the SFDR by 6.6dB.



Figure 5.5: DNL and INL after comparator threshold calibration for all 8 sub-ADCs showing -0.97/+1.38 LSB and -1.6/+1.38 LSB respectively.

#### 5.1.2.2 Time-Interleaved ADC Performance

After analog only correction using the VGA and comparator offset calibration, residual gain and offset mismatch between the 8 sub-ADCs of 2.2% and 2.18LSB respectively are corrected off-chip using a single-tone low frequency calibration step. The skew is calibrated by adjusting the duty cycle corrector (DCC) code (250fs resolution) to maximize the SNDR at Nyquist frequency of  $\approx$ 16GHz. Figure 5.8a shows the resulting SNDR as a function of DCC code, with a duty cycle distortion of 4.8% being observed and the skew tone being reduced to -45dBFS after correction. The SNDR as a function of input amplitude (relative to FSR of 400mV<sub>ppd</sub>) is shown in Figure 5.8b, showing compression when the input reaches around -2dBFS.

The spectrum for the Nyquist frequency input is shown in Figure 5.9a showing a SNDR of 27.8dB and a SFDR of 36.6dB (limited by the 3rd harmonic). Figure 5.9b shows SNDR/SFDR versus frequency for input full-scale ranges (FSR) of  $400mV_{ppd}$  and  $500mV_{ppd}$ , and sampling frequencies of 30GS/s and 31.1875GS/s. At higher FSR, the high frequency linearity is degraded however the relative error due to DNL/INL and noise is reduced, thus keeping the SNDR relatively constant. Note that since there is no easy way to de-embed the CTLE as it is combined with the sampler, all these measurements include the effects of the front-end T&H.



Figure 5.6: SNDR/SFDR of all 8 sub-ADCs: worst and best SNDR at Nyquist frequency is 27.71dB and 32.47dB respectively.



Figure 5.7: Effect of resetting folder output to reduce ISI leading to a 6.6dB improvement in sub-ADC SFDR (2048 point FFT).



Figure 5.8: SNDR/SFDR vs DCC code and input amplitude for Nyquist frequency input of 32GS/s 8-way TI ADC.



(a) Spectrum FSR =  $500mV_{ppd}$  at 31.1875GS/s (b) SNDR/SFDR vs FSR and Sampling Frequency

Figure 5.9: Spectrum (16k point FFT) and SNDR/SFDR for different FSR and  $f_s$  of 8-way TI ADC.

# 5.2 Receiver Performance

Figure 5.10a shows the performance of the CTLE sampler for all degeneration settings. The sampler exhibits a maximum boost of  $\approx$ 6dB at Nyquist frequency of 16GHz with the 2nd pole over 20GHz. The RX noise is estimated by observing the code histogram of ADC output with zero input. This measurement may be pessimistic due to DNL influence. Figure 5.10b shows the noise is a function of the CTLE boost (degeneration capacitor setting, with degeneration resistor set to provide maximum boost) and reaches a maximum value of  $3.85mV_{rms}$  as the maximum peaking is activated over the broadest frequency range.

Unfortunately the FFE performance is poor due to the several design choices. First, the



Figure 5.10: CTLE gain relative to flat setting for all degeneration settings and RX noise as function of degeneration capacitance at maximum boost.

implementation of the degeneration using a triode MOSFET with an analog control voltage causes its value to be very sensitive to the input common mode of the VGA/FFE. This common mode is hard to stabilize given the switching front-end T&H, therefore the lowest measured gain range out of the 8 sub-ADCs is only 2dB. Since the secondary branch for the FFE shares the same degeneration setting, this means the FFE can only cancel a large post-cursor value (equal to main cursor to 2dB less than main cursor). This may be fine for a high loss channel, however due to the use of resistive load, the gain in FFE mode is reduced by 6dB, meaning the the swing of the input is reduced even further. Furthermore, the main cursor is automatically set to maximum in FFE mode, and can not be tuned to compensate for gain mismatch in FFE mode. Thus, given these limitations, even for NRZ/2-PAM input, the FFE has difficulty as shown in Figure 5.11 for a 21.8dB loss channel at a data-rate of 32.1875Gb/s. The eye diagrams were constructed by sweeping the transmitter clock phase through the phase-interpolator (PI) code (resolution  $\approx$ 1ps). Since 4-PAM shrinks the amplitude even further, the FFE was not used for 4-PAM input. To remedy these issues in future work, separate main and post-cursor controls would be needed in addition to a digitally controlled degeneration implemented in slices.

In order to perform link measurements for 4-PAM input, channels with varying insertion loss (IL) at Nyquist frequency were used as shown in Figure 5.12a. The shortest channel (IL = -8.6dB), channel A, is a direct loop-back consisting of a cable, the package and PCB loss on both the TX and RX side. Channels B, C, and D use an off-chip test PCB to increase the loss to 13.5dB, 21.7dB, and 29.5dB respectively as shown in the setup in Figure 5.12b. A PRBS15 data pattern (single stream, one bit mapped to 4-PAM MSB, then next to 4-PAM LSB and so on) is used for all measurements. For channel A, the TX FFE is set to [-13.6%,



Figure 5.11: Eye diagrams illustrating FFE performance for 2-PAM input for a 21.8dB channel at 32.1875Gb/s.

77.3%, -9.1%] and the CTLE is set to provide a boost of 5.6dB with a DC gain of -1.5dB. The RX-AFE is configured with a FSR of  $500mV_{ppd}$ . The eye diagram is open at the output of the ADC as shown in Figure 5.13a, indicating that the ADC can be configured as a slicer with only 2 comparators active per sub-ADC (due to folding). The ADC in slicer mode achieves a BER <  $10^{-6}$  while consuming 100mW (excluding de-serializer, PLL, CDR). For channel C, the eye is completely closed at the output of the ADC as shown in Figure 5.13b even with the TX FFE set to [-13.6%, 63.7%, -22.7%] and CTLE set to a maximum boost of 6dB. An 8-tap FFE (2 pre/5 post-taps) in software is used to open the eye at a BER <  $10^{-6}$  with the ADC in 6-bit mode, which consumes 283.9mW.

For channel D with the highest loss, with the same TX FFE and CTLE settings, the FFE filter length is increased in order to open the eye at a BER  $< 10^{-4}$ . The BER floor is caused by metastability in the memory interface and LSB comparator, RX noise and residual long tail ISI that needs a higher boost and stronger shaping CTLE to cancel. Figure 5.14a shows the autocorrelation of the ADC output samples indicating that there is correlation even at a lag of 30. The FFE loses its effectiveness as the tap length increases since it also amplifies RX thermal and quantization noise [125]. The BER reaches a value  $< 10^{-4}$  and plateaus after a total length of 11 taps (including main cursor) as can be seen in Figure 5.14b. The overall bathtub curves, shown in Figure 5.15 for channels A, C & D, are generated by finding and freezing the optimal TX-FFE, CTLE, and DSP EQ coefficients at a chosen sampling phase and then sweeping the TX phase-interpolator (PI) to cover the entire UI in the same way as the eye diagrams.

In order to see if non-linearity plays a major role in limiting the BER, the data-rate was low-





Figure 5.12: Test set-up and insertion loss for links used in 4-PAM measurements.



Figure 5.13: ADC output eye diagrams ( $\approx$ 52K samples) without DSP EQ, for (a) channel A (IL= -8.6dB) showing an open eye and (b) channel C (IL = -21.7dB) showing a closed eye, with TX FFE and CTLE at optimal settings.



Figure 5.14: Autocorrelation of ADC output indicating a high number of FFE taps is needed for a 30dB loss channel, and resulting BER as a function of FFE tap length.

ered to 56Gb/s for channel D and a digital correction filter was implemented off-chip for each sub-ADC. These correction filters, shown in Figure 5.16a, are of the same structure as implemented in [126] and are tuned using the same error generated by the LMS loop as the FFE coefficients. The coefficients  $\vec{k}$  are separate for each sub-ADC and scale  $[xx', x'^2, x^2, x^3, x^2x', xx'^2]$ , where *x* is the output of the ADC and *x'* is its derivative. Here the derivative is generated simply using adjacent sub-ADC samples, i.e. x'[n] = x[n] - x[n-1], rather than a complex interpolation filter. Figure 5.16b shows the resulting bathtub with an 8-tap FFE EQ before and after non-linearity correction. Clearly, for this prototype, non-linearity does not play a large role in



Figure 5.15: Bathtub curves for channels A, C and D generated by sweeping TX phaseinterpolator with all EQ coefficients frozen.

BER degradation as the improvement is minimal.

## 5.2.1 Greedy Search

Channel B is used to investigate greedy search based power scaling. The TX FFE and RX CTLE are set to the optimal settings. The ADC is configured and calibrated for full 6-bit operation with a software 6-tap FFE and power-scaled for a target BER of  $10^{-5}$  (to save testing time reading ADC memory from chip). Figure 5.17a shows the experimental result with active comparators indicated by solid circles and inactive comparators indicated in white. After iteration 18, the BER starts increasing dramatically above  $10^{-5}$  and the search can be terminated. The threshold levels at iteration 16 correspond to the same number of levels as a uniform 5-bit ADC and are used for comparison purposes. These levels are illustrated in Figure 5.17b with the ADC output PDF (eye) plotted in conjunction. The BER obtained using greedy search thresholds is almost a magnitude better:  $4.2 \cdot 10^{-6}$  compared to  $2.8 \cdot 10^{-5}$  obtained using a 5-bit ADC configuration with uniformly-spaced threshold levels.

A link with similar loss profile to channel B was then used to see the performance of the algorithm for various digital equalizer structures. Figure 5.18 shows the 5-bit non-uniform levels chosen in the case of a 6-tap FFE, 1-tap DFE, and 2-tap DFE following the ADC. For the DFE cases, the levels are more concentrated towards the center of the eye as expected. Clearly,



Figure 5.16: Non-linearity correction filter implementation and resulting BER bathtub before and after filter is turned on.

the greedy search is able to co-optimize the equalizer architecture and ADC quantization levels.


(b) Resulting Thresholds

Figure 5.17: Link power scaling with greedy search illustrating non-uniform quantization selection: optimal 5-bit non-uniform levels shown in blue and ADC output eye PDF shown in black at optimal sampling point

Note that in a real link environment, periodic recalibration may be required to combat threshold drift due to temperature, channel environment changes and other effects. In terms of convergence time, if the feedback loop is implemented on-chip, assuming the BER estimate requires 1e9 bits and the equalizer training requires 1e6 bits, the total time for greedy search in this case would be approximately 6 seconds at this data-rate. Therefore any impairment with a time constant larger than this can be captured with recalibration. The number of bits



Figure 5.18: Illustration of 5-bit non-uniform level selection using greedy search for different equalizer structures.

required for the BER estimate of course depends on the BER itself and is greatly reduced by the inclusion of FEC in 4-PAM standards. For an application with a raw BER of 1e-12, a proxy for the BER such as eye-height may be required to lessen the convergence time.

#### 5.3 Comparison

Table 5.1 compares the performance of this receiver to other 4-PAM ADC-based receivers operating above 50Gb/s, the TX performance is also included to highlight its parameters used in the link measurements. The RX in this work is the only one which enables non-uniform quantization based power scaling at a data-rate >50Gb/s with a novel greedy search approach. It allows for more power-savings at low channel loss compared to other works; for instance the work by Upadhyaya published in ISSCC 2018 achieves only  $\approx$ 10% power reduction when link loss is decreased by more than 20dB. In this work, a RX-AFE power saving of 64.8% is obtained when the link loss is reduced by 21dB. The power scaling property of the RX-AFE in this work is a result of the flash architecture as shown in Figure 5.19, however it is even more granular than simple integer resolution scaling thanks to greedy search. Note that the recent work presented in VLSI 2018 by Hudner [127] is the fastest 4-PAM CMOS receiver ever published at 112Gb/s, achieving a BER of  $2 \cdot 10^{-5}$  using a 7 bit SAR ADC with a 31-tap FFE and 1-tap DFE equalizer for a 20dB loss channel. The power consumption of the RX-

AFE is 590mW, which means the total power consumption of the link including DSP, PLL, CDR, and TX would be over 1W for a single lane. Considering its SAR architecture, the power scaling capability of the RX-AFE will be poor and the work clearly illustrates the need for link optimization moving forward depending on the distribution of channel loss in a link environment.

|                                  | Hudner      | Gopalakrishnan | Frans       | Upadhyaya        | This work    |
|----------------------------------|-------------|----------------|-------------|------------------|--------------|
|                                  | VLSI2018    | ISSCC2016      | VLSI2016    | ISSCC2018        |              |
| Data Rate (Gb/s)                 | 112         | 50             | 56          | 56               | 64.375       |
| RX ADC Res (bits)                | 7           | 7              | 8           | 7                | 6            |
| RX AFE ENOB at                   | -           | 4.9**          | 4.9         | 4.43             | 4.31         |
| Nyquist (bits)                   |             |                |             |                  |              |
| RX AFE Power                     | 590@        | -              | 370*        | 215@7.4dB        | 100@8.6dB    |
| (mW)                             | 20dB        |                |             | 245@32dB         | 283.9@29.5dB |
| TX Swing (V <sub>pp-diff</sub> ) | 0.4         | 1.4            | 1.2         | 1                | 1.1          |
| TX RJ (fs <sub>rms</sub> )       | -           | 240            | 200         | 180              | 162          |
| TX RLM (%)                       | -           | -              | 97          | 98               | 99           |
| TX Power (mW)                    | -           | -              | 140         | 80.08            | 89.7         |
| Supplies (V)                     | 0.9/1.2/1.8 | 0.9/1.2        | 0.9/1.2/1.8 | 0.85/0.9/1.2/1.8 | 0.9/1.2      |
| Chip Area (mm <sup>2</sup> )     | 0.674       | 30.87***       | 2.8         | 8.81****         | TX: 0.09     |
|                                  |             |                |             |                  | RX:0.1625    |
| Technology                       | 16nm        | 28nm           | 16nm        | 16nm FinFET      | 16nm         |
|                                  | FinFET      |                | FinFET      |                  | FinFET       |

Table 5.1: Comparison with Other ADC-Based 4-PAM Receivers >50Gb/s

\*Including retimer, \*\*at 10GHz in figure

\*\*\*area including 2 RX/TX/PLL & DSP, \*\*\*\*area including 4 RX/TX/PLL & DSP



Figure 5.19: RX-AFE power scaling with resolution for prototype 64Gb/s ADC-based receiver.

## Chapter 6

## Conclusion

#### 6.1 Summary

Chapter 1 motivated the need for a power-scalable solution in link transceiver design to improve energy efficiency of data-centres. Chapter 2 presented the need for 4-PAM modulation as the industry moves to data-rates of 100Gb/s and beyond. Key receiver equalization architectures were introduced, and the mixed-signal and ADC-based receivers were differentiated. ADC-based receivers were seen to be needed to satisfy BER requirements for 4-PAM transmission across channels with 20dB or more of loss at Nyquist frequency and  $1mV_{rms}$  or more of crosstalk. Relevant ADC architectures were discussed and current power-scaling efforts were presented with their advantages and disadvantages. Chapter 3 discussed the proposed greedy search power scaling solution and its application to 4-PAM. A link system model including dynamic effects was constructed and simulations show a need for a maximum ADC resolution of 6-bits for 20dB loss channels, the space where the majority of links occupy. In Chapter 4, the circuit design of the prototype 6-bit ADC-basd receiver was presented. The RX-AFE was simplified by combining the functionality of the sampler and the CTLE, and the folding flash architecture allows for aggressive power-scaling and non-uniform quantization enabled by greedy search. Chapter 5 showed the measurement results of the prototype, demonstrating its fidelity in link environments ranging from 8dB-30dB channels with its strong power-scaling capability. To summarize, the contributions outlined in this thesis are as follows:

- 1. A 64.375Gb/s 4-PAM ADC-based receiver was implemented in TSMC 16nm FinFET CMOS and measured to achieve state of the art performance. It was presented in [41] and [128]. This receiver features several key components:
  - A combined half-rate CTLE sampler for moderate equalization of medium loss channels achieving 6dB boost at 16GHz.

- A proposed non-uniform quantization scheme according to BER metric through greedy search for power-scaling, achieving an order of magnitude improvement in BER compared to uniform quantization for a 13.5dB channel.
- A folding flash architecture which takes advantage of vertical eye symmetry to speed up optimal quantization level selection and allows for aggressive power scaling. Overall the RX is capable of achieving a BER < 10<sup>-6</sup> at 100mW for a 8.6dB channel and a BER < 10<sup>-4</sup> at 283.9mW for a 29.5dB channel. A separate 4GS/s sub-ADC prototype was presented at ISCAS in May 2017 and published in TCAS-II [129] achieving 300fJ/conv-step. It was also designed and fabricated in 16nm FinFET CMOS process and is one of the earliest circuit design works in this technology to be published by an academic institution.
- A link system model including dynamic effects skew, jitter and non-linearity for architecture development of ADC-based receivers. A subsequent application of non-linearity correction to the ADC prototype to improve link performance.

#### 6.2 Future Work

A lot of exciting research opportunities exist for wireline links given its rapid growth in datarate and associated challenges. In connection to this work, there are various avenues still to be explored. The first would be to expand this work to take advantage of finer threshold level selection as seen in prior art [101]. One limitation of this work is that since the quantizer starts from the maximum resolution (with uniform levels) and levels are removed, it's not possible to achieve a BER better than that achieved by the quantizer at its full resolution. Since threshold levels usually can be moved with a finer precision due to the inherent need for offset compensation necessary to combat circuit variation/mismatch, it's possible to improve the BER even further by considering this space. In this case, the search space becomes very large: imagine a 5-bit flash ADC with 31 comparators with each comparator threshold moving at a step of 1/4 LSB, the number of combinations scaling from 5 to 4 bit would then be 124 choose 15 or approximately 8e18. The emerging field of machine learning and artificial intelligence may help address this search space. In addition, as shown in Chapter 3, greedy search is a sub-optimal approach. There may be an improvement for operating the search in a combined manner: for instance trying all combinations of removing 2 levels at a time, then once the optimal is found, preceding with greedy search as usual to remove additional levels. This would allow some trade-off between speed and performance. There may also be advantages in investigating quantizer optimization based on other information metrics aside from BER, such as mutual information and entropy, to improve search speed.

One could also apply a similar BER based strategy to timing recovery. BER-directed CDR tuning has been seen in [130] for 2-PAM, however 4-PAM offers unique challenges in CDR design. In terms of receiver architecture, expanding this work to handle 30dB loss channels with higher margins is necessary to complete it for use across all channels. Other hybrid ADC architectures provided in Section 2.4 may be of relevance, where a combination of flash and SAR-based designs may yield the best partitioning for power-scaling. Finally the goal towards a digital friendly architecture can not be overstated as CMOS technology scaling has mostly stopped benefiting analog performance, while significantly improving digital performance. For instance, the power saving in digital logic going from 16nm to 7nm is approximately 50%. This enables the DSP to be extremely powerful, potentially allowing correction of ADC imperfections previously constrained by analog design. For instance, imprefections such as dynamic non-linearity and other memory effects which may impact future wireline links at higher modulation formats and data-rates. In addition, investigation of ADC-based receiver architectures in conjunction with MLSD may be necessary, potentially moving wireline design towards a system reminiscent of long-haul optical links today. Work such as [131] which combine digitally enabled ADC architecture (stochastic flash, where potentially digital P&R can be used to place hundreds of comparators) and MLSD may be of relevance in the near future.

## Appendix A

# Link Simulation Model for ADC-Based Receivers

This appendix describes the time-domain system model used to simulate the ADC-based receiver in Section 3.2. The code structure is shown in Figure A.1. By generating derivative information, dynamic effects including jitter, skew and dynamic non-linearity can be modelled. This is done at the sub-ADC level using *subADC\_macro\_generic.m*, which is also presented below in Figure A.2.



Figure A.1: System model code structure illustration.

```
function [out_sig] = subADC_macro_generic(in_sig,in_sig_diff,h3mat,vga_gain,vga_off,tlvl,qlvl,sigma_noise,skew,sigma_jitter)
%Model of generic subADC
%inputs:
% in_sig: input signal to be quantized
% in_sig_diff: derivative of input signal
% h3mat: nonlinearity matrix terms x^3, x^2*x', x*x'^2, x'^3
% vga_gain, vga_off: gain and offset of VGA
% tlvl, qlvl: threshold and quantization levels of ADC
% sigma_noise: guassian noise sigma
% skew: skew value, amplitude error is skew*dx/dt
% sigma_jitter: guassian jitter sigma
%outputs
% out_sig: output of the subADC channel
dterr = skew + sigma_jitter.*randn(size(in_sig));
%Apply skew, jitter & nonlinearity
err_nonlin = h3mat(1).*in_sig.^3 + h3mat(2).*in_sig.^2.*in_sig_diff + h3mat(3).*in_sig.*in_sig_diff.^2 + h3mat(4).*in_sig_diff.^3;
out_sig = in_sig + in_sig_diff.*dterr + err_nonlin;
%Models VGA gain offset, ADC as (possibly) non-uniform quantizer
out_sig = vga_gain.*out_sig + vga_off + sigma_noise.*randn(size(out_sig));
%quantize the rest of the bits using 2^(numbits)-1 comparators
for i = 1:numel(in_sig)
  comp_out = out_sig(i) > tlvl;
  out_sig(i) = qlvl(sum(comp_out)+1);
end
end
```

Figure A.2: MATLAB function *subADC\_macro\_generic.m* incorporating dynamic effects.

## **Bibliography**

- Cisco, "The Zettabyte Era: Trends and Analysis," 2017. [Online]. Available: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/ visual-networking-index-vni/vni-hyperconnectivity-wp.html
- [2] Google, "Google Data Centers," 2018. [Online]. Available: https://www.google.com/ about/datacenters/
- [3] Optical Internetworking Forum, "IA Title: Common Electrical I/O (CEI) Electrical and Jitter Interoperability agreements for 6G+ bps, 11G+ bps and 25G+ bps and 56G+ bps," Optical Internetworking Forum, Fremont, CA, Tech. Rep., 2017. [Online]. Available: http://www.oiforum.com/wp-content/uploads/OIF-CEI-04.0.pdf
- [4] A. Shehabi, S. Smith, D. Sartor, R. Brown, M. Herrlin, J. Koomey, E. Masanet, N. Horner, I. Azevedo, and W. Lintner, "United States Data Center Energy Usage Report," Ernest Orlando Lawrence Berkeley National Laboratory, Tech. Rep., 2016.
  [Online]. Available: http://eta-publications.lbl.gov/sites/default/files/lbnl-1005775{-} v2.pdf
- [5] B. Cutler, S. Fowers, J. Kramer, and E. Peterson, "Want an Energy-Efficient Data Center? Build It Underwater," *IEEE Spectrum*, 2017. [Online]. Available: http://spectrum. ieee.org/computing/hardware/want-an-energyefficient-data-center-build-it-underwater
- [6] P. Upadhyaya, A. Bekele, D. T. Melek, H. Zhao, J. Im, J. Cho, K. H. Tan, S. McLeod, S. Chen, W. Zhang, Y. Frans, and K. Chang, "A Fully-Adaptive Wideband 0.5-32.75Gb/s FPGA Transceiver in 16nm FinFET CMOS Technology," in *IEEE Sympo*sium on VLSI Circuits, Digest of Technical Papers, 2016.
- [7] G. Gangasani, J. F. Bulzacchelli, M. Wielgos, W. Kelly, V. Sharma, A. Prati, G. Cervelli,
   D. Gardellini, M. Baecher, M. Shannon, T. Beukema, J. Garlett, H. H. Xu, T. Toifl,
   M. Meghelli, J. Ewen, and D. Storaska, "A 28.05Gb/s Transceiver using Quarter-Rate
   Triple-Speculation Hybrid-DFE Receiver with Calibrated Sampling Phases in 32nm

CMOS," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2017, pp. 326–327.

- [8] H. Miyaoka, F. Terasawa, M. Kudo, H. Kano, A. Matsuda, N. Shirai, S. Kawai, T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Yamaguchi, T. Mori, Y. Koyanagi, H. Tamura, Y. Ide, K. Terashima, H. Higashi, T. Higuchi, and N. Naka, "A 28.3 Gb/s 7.3 pJ/bit 35 dB Backplane Transceiver with Eye Sampling Phase Adaptation in 28 nm CMOS," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2016.
- [9] Y. Frans, M. Elzeftawi, H. Hedayati, J. Im, V. Kireev, T. Pham, J. Shin, P. Upadhyaya, L. Zhou, S. Asuncion, C. Borrelli, G. Zhang, H. Zhang, and K. Chang, "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm Fin-FET," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2016.
- [10] H. Kimura, P. Aziz, T. Jing, A. Sinha, R. Narayan, H. Gao, P. Jing, G. Hom, A. Liang, E. Zhang, A. Kadkol, R. Kothari, G. Chan, Y. Sun, B. Ge, J. Zeng, K. Ling, M. Wang, A. Malipatil, S. Kotagiri, L. Li, C. Abel, and F. Zhong, "28Gb/s 560mW Multi-Standard SerDes with Single-Stage Analog Front-End and 14-Tap Decision-Feedback Equalizer in 28nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2014, pp. 38–39.
- [11] B. Zhang, K. Khanoyan, H. Hatamkhani, H. Tong, K. Hu, S. Fallahi, K. Vakilian, and A. Brewster, "A 28Gb/s Multi-Standard Serial-Link Transceiver for Backplane Applications in 28nm CMOS," in *IEEE International Solid-State Circuits Conference Digest* of Technical Papers, 2015, pp. 52–53.
- [12] J. Jaussi, G. Balamurugan, S. Hyvonen, T. C. Hsueh, T. Musah, G. Keskin, S. Shekhar, J. Kennedy, S. Sen, R. Inti, M. Mansuri, M. Leddige, B. Horine, C. Roberts, R. Mooney, and B. Casper, "A 205mW 32Gb/s 3-Tap FFE/6-tap DFE Bidirectional Serial Link in 22nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2014, pp. 440–441.
- [13] V. Balan, O. Oluwole, G. Kodani, C. Zhong, S. Maheswari, R. Dadi, A. Amin, G. Bhatia, P. Mills, A. Ragab, and E. Lee, "A 130mW 20Gb/s Half-Duplex Serial Link in 28nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2014, pp. 438–439.
- [14] M. Mansuri, J. E. Jaussi, J. T. Kennedy, T. C. Hsueh, S. Shekhar, G. Balamurugan, F. O'Mahony, C. Roberts, R. Mooney, and B. Casper, "A Scalable 0.128-to-1Tb/s 0.8-

to-2.6pJ/b 64-Lane Parallel I/O in 32nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2013, pp. 402–403.

- [15] B. Raghavan, D. Cui, U. Singh, H. Maarefi, D. Pi, A. Vasani, Z. Huang, A. Momtaz, and J. Cao, "A Sub-2W 39.8-to-44.6Gb/s Transmitter and Receiver Chipset with SFI-5.2 Interface in 40nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2013, pp. 32–33.
- [16] J. F. Bulzacchelli, C. Menolfi, T. J. Beukema, D. W. Storaska, J. Hertle, D. R. Hanson, P. H. Hsieh, S. V. Rylov, D. Furrer, D. Gardellini, A. Prati, T. Morf, V. Sharma, R. Kelkar, H. A. Ainspan, W. R. Kelly, L. R. Chieco, G. A. Ritter, J. A. Sorice, J. D. Garlett, R. Callan, M. Brandli, P. Buchmann, M. Kossel, T. Toifl, and D. J. Friedman, "A 28-Gb/s 4-Tap FFE/15-Tap DFE Serial Link Transceiver in 32-nm SOI CMOS Technology," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2012, pp. 324–325.
- [17] A. Ramachandran, A. Natarajan, and T. Anand, "A 16Gb/s 3.6pJ/b Wireline Transceiver with Phase Domain Equalization Scheme: Integrated Pulse Width Modulation (iPWM) in 65nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2017, pp. 488–489.
- [18] J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, and C. T. Gray, "A 0.54pJ/b 20Gb/s Ground-Referenced Single-Ended Short-Haul Serial Link in 28nm CMOS for Advanced Packaging Applications," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2013, pp. 404–405.
- [19] P. J. Peng, J. F. Li, L. Y. Chen, and J. Lee, "A 56Gb/s PAM-4/NRZ Transceiver in 40nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2017, pp. 110–111.
- [20] J. Han, Y. Lu, N. Sutardja, and E. Alon, "A 60Gb/s 288mW NRZ Transceiver with Adaptive Equalization and Baud-Rate Clock and Data Recovery in 65nm CMOS Technology," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2017, pp. 112–113.
- [21] K. Gopalakrishnan, A. Ren, A. Tan, A. Farhood, A. Tiruvur, B. Helal, C. F. Loi, C. Jiang, H. Cirit, I. Quek, J. Riani, J. Gorecki, J. Wu, J. Pernillo, L. Tse, M. Le, M. Ranjbar, P. S. Wong, P. Khandelwal, R. Narayanan, R. Mohanavelu, S. Herlekar, S. Bhoja, and V. Shvydun, "A 40/50/100Gb/s PAM-4 Ethernet Transceiver in 28nm CMOS," in *IEEE*

*International Solid-State Circuits Conference Digest of Technical Papers*, 2016, pp. 62–63.

- [22] T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Miyaoka, F. Terasawa, M. Kudo, H. Kano, A. Matsuda, S. Kawai, T. Arai, H. Higashi, N. Naka, H. Yamaguchi, T. Mori, Y. Koyanagi, and H. Tamura, "A 56Gb/s NRZ-Electrical 247mW/lane Serial-Link Transceiver in 28nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2016, pp. 64–65.
- [23] T. Norimatsu, T. Kawamoto, K. Kogo, N. Kohmu, F. Yuki, N. Nakajima, T. Muto, J. Nasu, T. Komori, H. Koba, T. Usugi, T. Hokari, T. Kawamata, Y. Ito, S. Umai, M. Tsuge, T. Yamashita, M. Hasegawa, and K. Higeta, "A 25Gb/s Multistandard Serial Link Transceiver for 50dB-Loss Copper Cable in 28nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2016, pp. 60–61.
- [24] T. Kawamoto, T. Norimatsu, K. Kogo, F. Yuki, N. Nakajima, M. Tsuge, T. Usugi, T. Hokari, H. Koba, T. Komori, J. Nasu, T. Kawamata, Y. Ito, S. Umai, J. Kumazawa, H. Kurahashi, T. Muto, T. Yamashita, M. Hasegawa, and K. Higeta, "Multi-Standard 185fsrms 0.3-to-28Gb/s 40dB Backplane Signal Conditioner with Adaptive Pattern-Match 36-Tap DFE and Data-Rate-Adjustment PLL in 28nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2015, pp. 54–55.
- [25] P. Upadhyaya, J. Savoj, F. T. An, A. Bekele, A. Jose, B. Xu, D. Wu, D. Furker, H. Aslanzadeh, H. Hedayati, J. Im, S. W. Lim, S. Chen, T. Pham, Y. Frans, and K. Chang, "A 0.5-to-32.75Gb/s Flexible-Reach Wireline Transceiver in 20nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2015, pp. 56–57.
- [26] Y. Hidaka, T. Horie, Y. Koyanagi, T. Miyoshi, H. Osone, S. Parikh, S. Reddy, T. Shibuya, Y. Umezawa, and W. W. Walker, "A 4-Channel 10.3Gb/s Transceiver with Adaptive Phase Equalizer for 4-to-41dB Loss PCB Channel," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2011, pp. 346–347.
- [27] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, T. Takemoto, F. Yuki, and T. Saito, "A 12.3mW 12.5Gb/s Complete Transceiver in 65nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2010, pp. 314–315.
- [28] M. Meghelli, S. Rylov, J. Bulzacchelli, W. Rhee, A. Rylyakov, H. Ainspan, B. Parker, M. Beakes, A. Chung, T. Beukema, P. Pepeljugoski, L. Shan, Y. Kwark, S. Gowda, and D. Friedman, "A 10Gb/s 5-Tap-DFE/4-Tap-FFE Transceiver in 90nm CMOS," in

*IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2006, pp. 213–222.

- [29] Y. Hidaka, W. Gai, T. Horie, J. H. Jiang, Y. Koyanagi, and H. Osone, "A 4-Channel 10.3Gb/s Backplane Transceiver Macro with 35dB Equalizer and Sign-Based Zero-Forcing Adaptive Control," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2009, pp. 188–189.
- [30] T. O. Dickson, Y. Liu, A. Agrawal, J. F. Bulzacchelli, H. Ainspan, Z. Toprak-deniz, B. D. Parker, M. Meghelli, and D. J. Friedman, "A 1.8-pJ/bit 16x16-Gb/s Source Synchronous Parallel Interface in 32nm SOI CMOS with Receiver Redundancy for Link Recalibration," in *Proceedings of the Custom Integrated Circuits Conference*, 2015.
- [31] B. Casper, J. Jaussi, F. O'Mahony, M. Mansuri, K. Canagasaby, J. Kennedy, E. Yeung, and R. Mooney, "A 20Gb/s Forwarded Clock Transceiver in 90nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2006, pp. 263–272.
- [32] S. D. Vamvakos, C. Boecker, E. Groen, A. Wang, S. Desai, S. Irwin, V. Rao, A. Bottelli, J. Chen, X. Chen, P. Choudhary, K. C. Hsieh, P. Jennings, H. Lin, D. Pechiu, C. Rao, and J. Yeung, "A 8.125-15.625Gb/s SerDes Using a Sub-Sampling Ring-Oscillator Phase-Locked Loop," in *Proceedings of the Custom Integrated Circuits Conference*, 2014.
- [33] J. Savoj, H. Aslanzadeh, D. Carey, M. Erett, W. Fang, Y. Frans, K. Hsieh, J. Im, A. Jose, D. Turker, P. Upadhyaya, D. Wu, and K. Chang, "Wideband Flexible-Reach Techniques for a 0.5-16.3Gb/s Fully-Adaptive Transceiver in 20nm CMOS," in *Proceedings of the Custom Integrated Circuits Conference*, 2014.
- [34] M. Harwood, S. Nielsen, A. Szczepanek, R. Allred, S. Batty, M. Case, S. Forey, K. Gopalakrishnan, L. Kan, B. Killips, P. Mishra, R. Pande, H. Rategh, A. Ren, J. Sanders, A. Schoy, R. Ward, M. Wetterhorn, and N. Yeung, "A 225mW 28Gb/s SerDes in 40nm CMOS With 13dB of Analog Equalization for 100GBASE-LR4 and optical transport Lane 4.4 Applications," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2012, pp. 326–327.
- [35] D. Cui, B. Raghavan, U. Singh, A. Vasani, Z. Huang, D. Pi, M. Khanpour, A. Nazemi, H. Maarefi, T. Ali, N. Huang, W. Zhang, B. Zhang, A. Momtaz, and J. Cao, "A Dual 23Gb/s CMOS Transmitter/Receiver Chipset for 40Gb/s RZ-DQPSK and CS-RZ-DQPSK Optical Transmission," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2012, pp. 330–331.

- [36] M. Ramezani, M. Abdalla, A. Shoval, M. V. Ierssel, A. Rezayee, A. McLaren, C. Holdenried, J. Pham, E. So, D. Cassan, and S. Sadr, "An 8.4mW/Gb/s 4-Lane 48Gb/s Multi-Standard-Compliant Transceiver in 40nm Digital CMOS Technology," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2011, pp. 352–353.
- [37] A. K. Joy, H. Mair, H. C. Lee, A. Feldman, C. Portmann, N. Bulman, E. C. Crespo, P. Hearne, P. Huang, B. Kerr, P. Khandelwal, F. Kuhlmann, S. Lytollis, J. Machado, C. Morrison, S. Morrison, S. Rabii, D. Rajapaksha, V. Ravinuthula, and G. Surace, "Analog-DFE-based 16Gb/s SerDes in 40nm CMOS That Operates Across 34dB Loss Channels at Nyquist with a Baud Rate CDR and 1.2Vpp Voltage-Mode Driver," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2011, pp. 350–351.
- [38] S. Quan, F. Zhong, W. Liu, P. Aziz, T. Jing, J. Dong, C. Desai, H. Gao, M. Garcia, G. Hom, T. Huynh, H. Kimura, R. Kothari, L. Li, C. Liu, S. Lowrie, K. Ling, A. Malipatil, R. Narayan, T. Prokop, C. Palusa, A. Rajashekara, A. Sinha, C. Zhong, and E. Zhang, "A 1.0625-to-14.025Gb/s Multimedia Transceiver with Full-rate Source-Series-Terminated Transmit Driver and Floating-Tap Decision-Feedback Equalizer in 40nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2011, pp. 162–163.
- [39] F. Spagna, L. Chen, M. Deshpande, Y. Fan, D. Gambetta, S. Gowder, S. Iyer, R. Kumar, P. Kwok, R. Krishnamurthy, C. C. Lin, R. Mohanavelu, R. Nicholson, J. Ou, M. Pasquarella, K. Prasad, H. Rustam, L. Tong, A. Tran, J. Wu, and X. Zhang, "A 78mW 11.8Gb/s Serial Link Transceiver with Adaptive RX Equalization and Baud-Rate CDR in 32nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2010, pp. 366–367.
- [40] F. O'Mahony, J. Kennedy, J. E. Jaussi, G. Balamurugan, M. Mansuri, C. Roberts, S. Shekhar, R. Mooney, and B. Casper, "A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2010, pp. 156–157.
- [41] L. Wang, Y. Fu, M. LaCroix, E. Chong, and A. C. Carusone, "A 64Gb/s PAM-4 Transceiver Utilizing an Adaptive Threshold ADC in 16nm FinFET," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 110–111.
- [42] G. Balamurugan, F. O'Mahony, M. Mansuri, J. E. Jaussi, J. T. Kennedy, and B. Casper, "A 5-to-25Gb/s 1.6-to-3.8mW/(Gb/s) Reconfigurable Transceiver in 45nm CMOS," in

*IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2010, pp. 372–373.

- [43] M. Erett, D. Carey, J. Hudner, R. Casey, K. Geary, P. Neto, M. Raj, S. Mcleod, H. Zhang, A. Roldan, H. Zhao, P.-c. Chiang, H. Zhao, K. Tan, Y. Frans, and K. Chang, "A 126mW 56Gb / s NRZ Wireline Transceiver for Synchronous Short-Reach Applications in 16nm," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 274–275.
- [44] E. Depaoli, E. Monaco, G. Steffan, M. Mazzini, H. Zhang, W. Audoglio, O. Belotti, A. A. Rossi, G. Albasini, M. Pozzoni, S. Erba, and A. Mazzanti, "A 4.9pJ/b 16-to-64Gb/s PAM-4 VSR Transceiver in 28nm FDSOI CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 112–113.
- [45] K. Huang, D. Luo, Z. Wang, X. Zheng, F. Li, C. Zhang, and Z. Wang, "A 190mW 40Gbps SerDes Transmitter and Receiver Chipset in 65nm CMOS Technology," in *Proceedings of the Custom Integrated Circuits Conference*, 2015.
- [46] T. O. Dickson, Y. Liu, S. V. Rylov, A. Agrawal, S. Kim, P.-H. Hsieh, J. F. Bulzacchelli, M. Ferriss, H. Ainspan, A. Rylyakov, B. D. Parker, C. Baks, L. Shan, Y. Kwark, J. Tierno, and D. J. Friedman, "A 1.4-pJ/b, Power-Scalable 16x12-Gb/s Source-Synchronous I/O with DFE Receiver in 32nm SOI CMOS Technology," in *Proceedings* of the Custom Integrated Circuits Conference, 2014.
- [47] S. Hwang, S. Moon, J. Song, and C. Kim, "A 32 Gb/s Rx Only Equalization Transceiver with 1-tap Speculative FIR and 2-tap Direct IIR DFE," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2016.
- [48] Y. Frans, J. Shin, L. Zhou, P. Upadhyaya, J. Im, V. Kireev, M. Elzeftawi, H. Hedayati, T. Pham, S. Asuncion, C. Borrelli, G. Zhang, H. Zhang, and K. Chang, "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm Fin-FET," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 4, pp. 1101–1110, 2017.
- [49] T. Ali, L. Rao, U. Singh, M. Abdul-latif, Y. Liu, A. A. Hafez, H. Park, A. Vasani, Z. Huang, A. Iyer, B. Zhang, A. Momtaz, and N. Kocaman, "A 3.8 mW/Gbps Quad-Channel 8.5-13 Gbps Serial Link with a 5-Tap DFE and a 4-Tap Transmit FFE in 28 nm CMOS," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2015, pp. 348–349.

- [50] E. H. Chen, M. Hossain, B. Leibowitz, R. Navid, J. Ren, A. Chou, B. Daly, M. Aleksic, B. Su, S. Li, M. Shirasgaonkar, F. Heaton, J. Zerbe, and J. Eble, "A 40-Gb/s Serial Link Transceiver in 28-nm CMOS Technology," in *IEEE Symposium on VLSI Circuits, Digest* of Technical Papers, 2014.
- [51] S. Saxena, G. Shu, R. K. Nandwana, M. Talegaonkar, A. Elkholy, T. Anand, S. J. Kim, W. S. Choi, and P. K. Hanumolu, "A 2.8mW/Gb/s 14Gb/s Serial Link Transceiver in 65nm CMOS," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2015, pp. 352–353.
- [52] L. Tang, W. Gai, L. Shi, X. Xiang, K. Sheng, and A. He, "A 32Gb/s 133mW PAM-4 Transceiver with DFE Based on Adaptive Clock Phase and Threshold Voltage in 65nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 114–115.
- [53] T. Hashida, Y. Tomita, Y. Ogata, K. Suzuki, S. Suzuki, T. Nakao, Y. Terao, S. Honda, S. Sakabayashi, R. Nishiyama, A. Konmoto, Y. Ozeki, H. Adachi, H. Yamaguchi, Y. Koyanagi, and H. Tamura, "A 36 Gbps 16.9 mW/Gbps Transceiver in 20-nm CMOS with 1-tap DFE and Quarter-Rate Clock Distribution," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2014.
- [54] A. Nazemi, H. Maarefi, B. Çatl, M. R. Ahmadi, S. Fallahi, T. Ali, M. Abdul-latif, Y. Liu, J. Kim, A. Momtaz, and N. Kocaman, "A 2.8 mW/Gb/s Quad-Channel 8.5-11.4 Gb/s Quasi-Digital Transceiver in 28 nm CMOS," in *IEEE Symposium on VLSI Circuits, Di*gest of Technical Papers, 2013, pp. 276–277.
- [55] G. R. Gangasani, C. M. Hsu, J. F. Bulzacchelli, T. Beukema, W. Kelly, H. H. Xu, D. Freitas, A. Prati, D. Gardellini, R. Reutemann, G. Cervelli, J. Hertle, M. Baecher, J. Garlett, P. A. Francese, J. F. Ewen, D. Hanson, D. W. Storaska, and M. Meghelli, "A 32 Gb/s Backplane Transceiver With On-Chip AC-Coupling and Low Latency CDR in 32 nm SOI CMOS Technology," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 11, pp. 2474–2489, 2014.
- [56] J. Savoj, K. Hsieh, P. Upadhyaya, F. T. An, A. Bekele, S. Chen, X. Jiang, K. W. Lai, C. F. Poon, A. Sewani, D. Turker, K. Venna, D. Wu, B. Xu, E. Alon, and K. Chang, "A Wide Common-Mode Fully-Adaptive Multi-Standard 12.5Gb/s Backplane Transceiver in 28nm CMOS," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2012, pp. 104–105.

- [57] P. Upadhyaya, C. F. Poon, S. W. Lim, J. Cho, A. Roldan, W. Zhang, J. Namkoong, T. Pham, B. Xu, W. Lin, H. Zhang, N. Narang, K. H. Tan, G. Zhang, Y. Frans, and K. Chang, "A Fully Adaptive 19-to-56Gb/s PAM-4 Wireline Transceiver with a Configurable ADC in 16nm FinFET," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 108–109.
- [58] J. M. Wilson, W. J. Turner, J. W. Poulton, B. Zimmer, X. Chen, S. S. Kudva, S. Song, S. G. Tell, N. Nedovic, W. Zhao, S. R. Sudhakaran, C. T. Gray, and W. J. Dally, "A 1.17pJ/b 25Gb/s/pin Ground Referenced Single-Ended Serial Link for Off- and On-Package Communication in 16nm CMOS Using a Process- and Temperature-Adaptive Voltage Regulator," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 276–277.
- [59] M. S. Jalali, M. H. Taghavi, A. Melaren, J. Pham, K. Farzan, D. Diclemente, M. Van Ierssel, W. Song, S. Asgaran, C. Holdenried, and S. Sadr, "A 4-Lane 1.25-to-28.05Gb/s Multi-Standard 6pJ/b 40dB Transceiver in 14nm FinFET with Independent TX/RX Rate Support," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2018, pp. 106–108.
- [60] IEEE 802.3cd Ethernet Task Group, "IEEE 802.3cd 50Gb/s, 100Gb/s, and 200 Gb/s Ethernet Task Force Contributed Channel Data," 2018. [Online]. Available: http://www.ieee802.org/3/cd/public/channel/index.html
- [61] O. E. Agazzi, M. R. Hueda, D. E. Crivelli, H. S. Carrer, A. Nazemi, G. Luna, F. Ramos, R. Lopez, C. Grace, B. Kobeissy, C. Abidin, M. Kazemi, M. Kargar, C. Marquez, S. Ramprasad, F. Bollo, V. Posse, S. Wang, G. Asmanis, G. Eaton, N. Swenson, T. Lindsay, and P. Voois, "A 90 nm CMOS DSP MLSD transceiver with integrated AFE for electronic dispersion compensation of multimode optical fibers at 10 Gb/s," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 12, pp. 2937–2957, 2008.
- [62] J. Kim, A. Balankutty, A. Elshazly, Y. Y. Huang, H. Song, K. Yu, and F. O'Mahony, "A 16-to-40Gb/s Quarter-Rate NRZ/PAM4 Dual-Mode Transmitter in 14nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2015, pp. 60–61.
- [63] M. S. Chen, Y. N. Shih, C. L. Lin, H. W. Hung, and J. Lee, "A Fully-Integrated 40-Gb/s Transceiver in 65-nm CMOS Technology," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 3, pp. 627–640, 2012.

- [64] E. Mammei, F. Loi, F. Radice, A. Dati, M. Bruccoleri, M. Bassi, and A. Mazzanti, "Analysis and Design of a Power-Scalable Continuous-Time FIR Equalizer for 10 Gb/s to 25 Gb/s Multi-Mode Fiber EDC in 28 nm LP CMOS," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 12, pp. 3130–3140, 2014.
- [65] A. Agrawal, J. F. Bulzacchelli, T. O. Dickson, Y. Liu, J. A. Tierno, and D. J. Friedman, "A 19-Gb/s Serial Link Receiver With Both 4-Tap FFE and 5-Tap DFE Functions in 45nm SOI CMOS," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 12, pp. 3220–3231, 2012.
- [66] M. S. Chen, M. C. F. Chang, and C. K. K. Yang, "A Low-PDP and Low-Area Repeater Using Passive CTLE for On-Chip Interconnects," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2015, pp. 244–245.
- [67] S. Gondi and B. Razavi, "Equalization and Clock and Data Recovery Techniques for 10-Gb/s CMOS Serial-Link Receivers," *IEEE Journal of Solid-State Circuits*, vol. 42, no. 9, pp. 1999–2011, 2007.
- [68] K. K. Parhi, "Design of Multigigabit Multiplexer-Loop-Based Decision Feedback Equalizers," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 13, no. 4, pp. 489–493, 2005.
- [69] I. Ozkaya, A. Cevrero, P. A. Francese, C. Menolfi, T. Morf, M. Brandli, D. M. Kuchta, L. Kull, C. W. Baks, J. E. Proesel, M. Kossel, D. Luu, B. G. Lee, F. E. Doany, M. Meghelli, Y. Leblebici, and T. Toifl, "A 64-Gb/s 1.4-pJ/b NRZ Optical Receiver Data-Path in 14-nm CMOS FinFET," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 12, pp. 3458–3473, 2017.
- [70] M. Russell and J. W. M. Bergmans, "A Technique to Reduce Error Propagation in Mary Decision Feedback Equalization," *IEEE Transactions on Communications*, vol. 43, no. 12, pp. 2878–2881, 1995.
- [71] J. Im, D. Freitas, A. Roldan, R. Casey, S. Chen, A. Chou, T. Cronin, K. Geary, S. Mcleod, L. Zhou, I. Zhuang, J. Han, S. Lin, P. Upadhyaya, G. Zhang, Y. Frans, and K. Chang, "A 40-to-56Gb / s PAM-4 Receiver with 10-Tap Direct Decision-Feedback Equalization in 16nm FinFET," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2017, pp. 114–115.

- [72] L. Wang, "Timing Skew Calibration for Time Interleaved Analog to Digital Converters," MASc. Thesis, 2014. [Online]. Available: https://tspace.library.utoronto.ca/handle/1807/ 67846
- [73] B. Zhang, A. Nazemi, A. Garg, N. Kocaman, M. R. Ahmadi, M. Khanpour, H. Zhang, J. Cao, and A. Momtaz, "A 195mW/55mW Dual-Path Receiver AFE for Multistandard 8.5-to-11.5 Gb/s Serial Links in 40nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2013, pp. 34–36.
- [74] B. Raghavan, A. Varzaghani, L. Rao, H. Park, X. Yang, Z. Huang, Y. Chen, R. Kattamuri, C. Wu, B. Zhang, J. Cao, A. Momtaz, and N. Kocaman, "A 125 mW 8.5-11.5 Gb/s Serial Link Transceiver with a Dual Path 6-bit ADC/5-tap DFE receiver and a 4-tap FFE Transmitter in 28 nm CMOS," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2016.
- [75] W. Cheng, W. Ali, M.-J. Choi, K. Liu, T. Tat, D. Devendorf, L. Linder, and R. Stevens,
   "A 3b 40GS/s ADC-DAC in 0.12um SiGe," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2004, pp. 262–263.
- [76] S. Verma, S. Zogopoulos, and S. Sidiropoulos, "A 10.3-GS/s, 6-Bit Flash ADC for 10G Ethernet Applications," *IEEE Journal of Solid-State Circuits*, vol. 48, no. 12, pp. 1–11, 2013.
- [77] D. Crivelli, M. Hueda, H. Carrer, J. Zachan, V. Gutnik, M. Del Barco, R. Lopez, G. Hatcher, J. Finochietto, M. Yeo, A. Chartrand, N. Swenson, P. Voois, and O. Agazzi, "A 40nm CMOS Single-Chip 50Gb/s DP-QPSK/BPSK Transceiver With Electronic Dispersion Compensation for Coherent Optical Channels," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2012, pp. 328–330.
- [78] Y. M. Greshishchev, J. Aguirre, M. Besson, R. Gibbins, C. Falt, P. Flemke, N. Ben-Hamida, D. Pollex, P. Schvan, and S. C. Wang, "A 40GS/s 6b ADC in 65nm CMOS," in *IEEE International Solid-State Circuits Conference (ISSCC)*, 2010, pp. 390–391.
- [79] J. Cao, B. Zhang, U. Singh, D. Cui, A. Vasani, A. Garg, W. Zhang, N. Kocaman, D. Pi, B. Raghavan, H. Pan, I. Fujimori, and A. Momtaz, "A 500 mW ADC-Based CMOS AFE with Digital Calibration for 10Gb/s Serial Links Over KR-Backplane and Multimode Fiber," *IEEE Journal of Solid-State Circuits*, vol. 45, no. 6, pp. 1172–1185, 2010.

- [80] P. Schvan, R. Gibbins, Y. Greshishchev, and N. Ben-hamida, "A 24GS/s 6b ADC in 90nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2008, pp. 544–546.
- [81] E. Z. Tabasy, A. Shafik, K. Lee, S. Hoyos, and S. Palermo, "A 6b 10GS/s TI-SAR ADC with Embedded 2-Tap FFE/1-Tap DFE in 65nm CMOS," in VLSI Circuits Symposium, Digest of Technical Papers, 2013, pp. 274–275.
- [82] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli, M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici, "A 35mW 8b 8.8GS/s SAR ADC with Low-Power Capacitive Reference Buffers in 32 nm Digital SOI CMOS," in VLSI Circuits Symposium, Digest of Technical Papers, 2013, pp. 260–261.
- [83] M. El-Chammas and B. Murmann, "A 12-GS/s 81-mW 5-bit Time-Interleaved Flash ADC With Background Timing Skew Calibration," *IEEE Journal of Solid-State Circuits*, vol. 46, no. 4, pp. 838–847, apr 2011.
- [84] S. Le Tual, P. N. Singh, C. Curis, and P. Dautriche, "A 20GHz-BW 6b 10GS/s 32mW Time-Interleaved SAR ADC with Master T&H in 28nm UTBB FDSOI Technology," in *IEEE International Solid-State Circuits Conference*, 2014, pp. 382–383.
- [85] A. Shafik, E. Z. Tabasy, S. Cai, K. Lee, S. Hoyos, and S. Palermo, "A 10Gb/s Hybrid ADC-Based Receiver with Embedded 3-Tap Analog FFE and Dynamically-Enabled Digital Equalization in 65nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2015, pp. 62–64.
- [86] L. Kull, J. Pliva, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli, M. Kossel, T. Morf, T. M. Andersen, Y. Leblebici, and A. Interleaver, "A 110mW 6 bit 36GS/s interleaved SAR ADC for 100 GBE occupying 0.048mm<sup>2</sup> in 32nm SOI CMOS," in *IEEE Asian Solid-State Circuits Conference*, 2014, pp. 89–92.
- [87] X. Yang, R. Payne, and J. Liu, "A 10GS/s 6b Time-Interleaved ADC with Partially Active Flash sub-ADCs," in *Custom Integrated Circuits Conference (CICC)*, 2013, pp. 6–9.
- [88] V. H. Chen and L. Pileggi, "A 69.5 mW 20GS/s 6b Time Interleaved ADC With Embedded Time-to-Digital Calibration in 32 nm CMOS SOI," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 12, pp. 2891–2901, 2014.

- [89] Y. Duan and E. Alon, "A 6b 46GS/s ADC with >23GHz BW and Sparkle-Code Error Correction," in *IEEE Symposium on VLSI Circuits, Digest of Technical Papers*, 2015, pp. 162–163.
- [90] D. Cui, H. Zhang, N. Huang, A. Nazemi, B. Catli, H. G. Rhew, B. Zhang, A. Momtaz, and J. Cao, "A 320mW 32Gb/s 8b ADC-Based PAM-4 Analog Front-End with Programmable Gain Control and Analog Peaking in 28nm CMOS," in *Digest of Technical Papers - IEEE International Solid-State Circuits Conference*, 2016, pp. 58–59.
- [91] J. Cao, D. Cui, A. Nazemi, T. He, G. Li, B. Catli, M. Khanpour, K. Hu, T. Ali, H. Zhang, H. Yu, B. Rhew, S. Sheng, Y. Shim, B. Zhang, and A. Momtaz, "A Transmitter and Receiver for 100Gb/s Coherent Networks with Integrated 464GS/s 8b ADCs and DACs in 20nm CMOS," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2017, pp. 484–485.
- [92] Aurangozeb, A. D. Hossain, and M. Hossain, "Channel Adaptive ADC and TDC for 28 Gb/s PAM-4 Digital receiver," in *Proceedings of the Custom Integrated Circuits Conference*, 2017.
- [93] G. Van Der Plas and B. Verbruggen, "A 150 MS/s 133uw 7 bit ADC in 90 nm digital CMOS," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 12, pp. 2631–2640, 2008.
- [94] T. Jiang, W. Liu, F. Y. Zhong, C. Zhong, K. Hu, and P. Y. Chiang, "A single-channel, 1.25-GS/s, 6-bit, 6.08-mW asynchronous successive-approximation ADC with improved feedback delay in 40-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 10, pp. 2444–2453, 2012.
- [95] B. Verbruggen, J. Craninckx, M. Kuijk, P. Wambacq, and G. Van Der Plas, "A 2.2mW 5b 1.75GS/S 5 Bit Folding Flash ADC in 90nm Digital CMOS," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 3, pp. 874–882, 2009.
- [96] R. Yousry, H. Park, E. H. Chen, and C. K. K. Yang, "A Digitally-Calibrated 10GS/s Reconfigurable Flash ADC in 65-nm CMOS," in *IEEE International Symposium on Circuits and Systems (ISCAS)*, 2013, pp. 2444–2447.
- [97] E. Z. Tabasy, A. Shafik, K. Lee, S. Hoyos, and S. Palermo, "A 6 bit 10 GS/s TI-SAR ADC With Low-Overhead Embedded FFE/DFE Equalization for Wireline Receiver Applications," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 11, pp. 2560–2574, 2014.

- [98] A. Varzaghani and C. K. K. Yang, "A 6-GSamples/s Multi-Level Decision Feedback Equalizer Embedded in a 4-bit Time-Interleaved Pipeline A/D Converter," *IEEE Journal* of Solid-State Circuits, vol. 41, no. 4, pp. 935–944, 2006.
- [99] R. Narasimha, N. Shanbhag, and A. Singer, "BER-Based Adaptive ADC-Equalizer Based Receiver for Communication Links," in *IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation*, 2010, pp. 64–69.
- [100] R. Narasimha, M. Lu, N. R. Shanbhag, and A. C. Singer, "BER-Optimal Analog-to-Digital Converters for Communication Links," *IEEE Transactions on Signal Processing*, vol. 60, no. 7, pp. 3683–3691, 2012.
- [101] Y. Lin, M. S. Keel, A. Faust, A. Xu, N. R. Shanbhag, E. Rosenbaum, and A. C. Singer, "A Study of BER-Optimal ADC-Based Receiver for Serial Links," *IEEE Transactions* on Circuits and Systems I: Regular Papers, vol. 63, no. 5, pp. 693–704, 2016.
- [102] Aurangozeb, A. D. Hossain, M. Mohammad, and M. Hossain, "Channel-Adaptive ADC and TDC for 28 Gb/s PAM-4 Digital Receiver," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 3, pp. 772–788, 2018.
- [103] J. Kim, E. H. Chen, J. Ren, B. S. Leibowitz, P. Satarzadeh, J. L. Zerbe, and C. K. K. Yang, "Equalizer Design and Performance Trade-Offs in ADC-Based Serial Links," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 58, no. 9, pp. 2096– 2107, 2011.
- [104] E. H. Chen, R. Yousry, and C. K. K. Yang, "Power optimized ADC-based serial link receiver," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 4, pp. 938–951, 2012.
- [105] A. D. Hossain, Aurangozeb, M. Mohammad, and M. Hossain, "A 35 mW 10 Gb/s ADC-DSP less Direct Digital Sequence Detector and Equalizer in 65nm CMOS," in *IEEE* Symposium on VLSI Circuits, Digest of Technical Papers, 2016.
- [106] S. Shahramian and T. Chan Carusone, "Hardware Reduction by Combining Pipelined A/D Conversion and FIR Filtering for Channel Equalization," in *IEEE International Symposium on Circuits and Systems (ISCAS)*, 2004, pp. 489–492.
- [107] S. P. Lloyd, "Least Squares Quantization in PCM," *IEEE Transactions on Information Theory*, vol. 28, no. 2, pp. 129–137, 1982.
- [108] J. Max, "Quantizing for Minimum Distortion," *IRE Transactions on Information Theory*, vol. 6, no. 1, pp. 7–12, 1960.

- [109] C. C. Yeh and J. R. Barry, "Adaptive Minimum Bit-Error Rate Equalization for Binary Signaling," *IEEE Transactions on Communications*, vol. 48, no. 7, pp. 1226–1235, 2000.
- [110] M. Bernhard, D. Rörich, T. Handte, and J. Speidel, "Analytical and numerical studies of quantization effects in coherent optical OFDM transmission with 100 Gbit/s and beyond," in *ITG-Fachbericht 233: Photonische Netze*, 2012.
- [111] G. Kim, L. Kull, D. Luu, M. Braendli, C. Menolfi, P. A. Francese, C. Aprile, T. Morf, M. Kossel, A. Cevrero, I. Ozkaya, T. Toifl, and Y. Leblebici, "Parallel Implementation Technique of Digital Equalizer for Ultra-High-Speed Wireline Receiver," in *International Symposium on Circuits and Systems (ISCAS)*, 2018.
- [112] H. Zhang, B. Jiao, Y. Liao, and G. Zhang, "PAM4 Signaling for 56G Serial Link Applications A Tutorial," in *DesignCon*, 2016. [Online]. Available: www.designcon.com
- [113] G. Steffan, E. Depaoli, E. Monaco, N. Sabatino, W. Audoglio, A. A. Rossi, S. Erba, M. Bassi, and A. Mazzanti, "A 64Gb/s 4-PAM Transmitter with 4-Tap FFE and 2.26pJ/b Energy Efficiency in 28nm CMOS FDSOI," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2017, pp. 116–117.
- [114] S. P. Boyd, "Volterra Series: Engineering Fundamentals," Ph.D. dissertation, University of California, Berkeley, 1980.
- [115] P. Nikaeen, "Digital Compensation of Dynamic Acquisition Errors at the Front-End of High-Performance A/D Converters," Ph.D. dissertation, Stanford University, 2008.
- [116] Y. Duan and E. Alon, "A 12.8 GS/s Time-Interleaved ADC With 25 GHz Effective Resolution Bandwidth and 4.6 ENOB," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 8, pp. 1725–1738, 2014.
- [117] N. Da Dalt, M. Harteneck, C. Sandner, and A. Wiesbauer, "On the Jitter Requirements of the Sampling Clock for Analog-to-Digital Converters," *IEEE Transactions on Circuits* and Systems I: Fundamental Theory and Applications, vol. 49, no. 9, pp. 1354–1360, sep 2002.
- [118] R. A. Belcher, "ADC Standard IEC 60748-4-3: Precision Measurement of Alternative ENOB Without a Sine Wave," *IEEE Transactions on Instrumentation and Measurement*, vol. 64, no. 12, pp. 3183–3200, 2015.

- [119] E. Bury, B. Kaczer, D. Linten, L. Witters, H. Mertens, N. Waldron, X. Zhou, N. Collaert, N. Horiguchi, A. Spessot, and G. Groeseneken, "Self-heating in FinFET and GAA-NW using Si, Ge and III/V channels," in *IEEE International Electron Devices Meeting* (*IEDM*), 2016, pp. 15.6.1–15.6.4.
- [120] J. Han, Y. Lu, N. Sutardja, K. Jung, and E. Alon, "Design Techniques for a 60 Gb/s 173 mW Wireline Receiver Frontend in 65 nm CMOS Technology," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 4, pp. 871–880, 2016.
- [121] C. Thakkar, N. Narevsky, C. D. Hull, and E. Alon, "Design Techniques for a Mixed-Signal I/Q 32-Coefficient Rx-Feedforward Equalizer, 100-Coefficient Decision Feedback Equalizer in an 8 Gb/s 60 GHz 65 nm LP CMOS receiver," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 11, pp. 2588–2607, 2014.
- [122] M. Miyahara and A. Matsuzawa, "A Low-Offset Latched Comparator Using Zero-Static Power Dynamic Offset Cancellation Technique," in *Proceedings of Technical Papers -IEEE Asian Solid-State Circuits Conference (ASSCC)*, 2009, pp. 233–236.
- [123] K. M. Lei, P. I. Mak, and R. P. Martins, "Systematic analysis and cancellation of kickback noise in a dynamic latched comparator," *Analog Integrated Circuits and Signal Processing*, vol. 77, no. 2, pp. 277–284, 2013.
- [124] Y. Sato, "A Method of Self-Recovering Equalization for Multilevel Amplitude-Modulation Systems," *IEEE Transactions on Communications*, vol. 23, no. 6, pp. 679– 682, 1975.
- [125] K. Zheng, B. Murmann, H. Zhang, and G. Zhang, "Feedforward Equalizer Location Study for High Speed Serial Systems," in *DesignCon*, 2018, pp. 1–25.
- [126] B. Setterberg, K. Poulton, S. Ray, D. J. Huber, V. Abramzon, G. Steinbach, J. P. Keane, B. Wuppermann, M. Clayson, M. Martin, R. Pasha, E. Peeters, A. Jacobs, F. Demarsin, A. Al-Adnani, and P. Brandt, "A 14b 2.5GS/s 8-Way-Interleaved Pipelined ADC with Background Calibration and Digital Dynamic Linearity Correction," in *IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2013, pp. 466–467.
- [127] J. Hudner, D. Carey, R. Casey, K. Hearne, P. Wilson, P. W. D. A. F. Neto, I. Chlis, M. Erett, C. Poon, A. Laraba, H. Zhang, S. L. C. Ambatipudi, D. Mahashin, P. Upadhyaya, Y. Frans, and K. Chang, "A 112Gb/s PAM4 Wireline Receiver using a 64-way Time-Interleaved SAR ADC in 16nm FinFET," in VLSI Circuits Symposium, Digest of Technical Papers, 2018, pp. 47–48.

- [128] L. Wang, Y. Fu, M. LaCroix, E. Chong, and A. Chan Carusone, "A 64Gb/s PAM-4 Transceiver Utilizing An Adaptive Threshold ADC in 16nm FinFET," *IEEE Journal of Solid-State Circuits*.
- [129] L. Wang, M. A. Lacroix, and A. C. Carusone, "A 4-GS/s Single Channel Reconfigurable Folding Flash ADC for Wireline Applications in 16-nm FinFET," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 64, no. 12, pp. 1367–1371, 2017.
- [130] E.-H. Chen, W. Leven, N. Warke, A. Joy, S. Hubbins, A. Amerasekera, and C.-K. K. Yang, "Adaptation of CDR and full scale range of ADC-based SerDes receiver," in *VLSI Circuits Symposium, Digest of Technical Papers*, 2009, pp. 12–13.
- [131] S. Song, K. D. Choo, T. Chen, S. Jang, M. P. Flynn, and Z. Zhang, "A Maximum-Likelihood Sequence Detection Powered ADC-Based Serial Link," *IEEE Transactions* on Circuits and Systems I: Regular Papers, vol. 65, no. 7, pp. 2269–2278, 2018.