Download Design and Implementation High Speed Low Power Cam

ISSN 2319-8885 Vol.04,Issue.01 January-2015, Pages:0088-0094 www.ijsetr.com Design and Implementation High Speed Low Power Cam with A Parity Bit and Power-Gated ML Sensing PERIKALA MAHESH BABU1, M. LAVANYA LATHA2, SHARAD KULKARNI3 1 PG Scholar, Dept of ECE, Audi Sankara Institute of Technology, Gudur, AP, India, Email: [email protected]. 2 Assoc Prof, Dept of ECE, Audi Sankara Institute of Technology, Gudur, AP, India. 3 HOD, Dept of ECE, Audi Sankara Institute of Technology, Gudur, AP, India. Abstract: Content addressable memory (CAM) offers high-speed search function in a single clock cycle. Due to its parallel match-line ML comparison, CAM is power-hungry. Thus, robust, high-speed and low-power ML sense amplifiers are highly sought-after in CAM designs. In this paper, we introduce a parity bit that leads to 39% sensing delay reduction at a cost of less than 1% area and power overhead. Furthermore, we propose an effective gated-power technique to reduce the peak and average power consumption and enhance the robustness of the design against process variations. A feedback loop is employed to autoturn off the power supply to the comparison elements and hence reduces the average power consumption by 64%. The proposed design can work at a supply voltage down to 0.5 V. Keywords: CMOS, Content Addressable Memory (CAM), Match-Line. I. INTRODUCTION Content addressable memory (CAM) is a type of solidstate memory in which data are accessed by their contents rather than physical locations. It receives input search data, i.e., a search word, and returns the address of a similar word that is stored in its data-bank [2]. In general, a CAM has three operation modes: READ, WRITE, and COMPARE, among which “COMPARE” is the main operation as CAM rarely reads or writes [5]. Fig. 1(a) shows a simplified block diagram of a CAM core with an incorporated search data register and an output encoder. It starts a compare operation by loading an n-bit input search word into the search data register. The search data are then broadcast into the memory banks through n pairs of complementary search-lines (SLs) and directly compared with every bit of the stored words using comparison circuits. Each stored word has a ML that is shared between its bits to convey the comparison result. Location of the matched word will be identified by an output encoder, as shown in Fig. 1(a). During a pre-charge stage, the MLs are held at ground voltage level while both SL and ~SL are at VDD. During evaluation stage, complementary search data is broadcast to the SLs and ~SLs. When mismatch occurs in any CAM cell (for example at the first cell of the row D = “1”; ~D = “0”; SL = “1”; ~SL = “0”), transistor P 3 and P4 will be turned on, charging up the ML to a higher voltage level. A sense amplifier (MLSA) is used to detect the voltage change on the ML and amplifies it to a full CMOS voltage output. If mismatch happens to none of the cells on a row, no charge up path will be formed and the voltage on the ML will remain unchanged, indicating a match. Since all available words in the CAMs are compared in parallel, result can be obtained in a single clock cycle. Hence, CAMs are faster than other hardware- and software-based search systems [2]. They are therefore preferred in high-throughput applications such as network routers and data compressors. However, the full parallel search operation leads to critical challenges in designing a low-power system for high-speed high-capacity CAMs [2]: 1) the power hungry nature due to the high switching activity of the SLs and the MLs and 2) a huge surge-on current (i.e., peak current) occurs at the beginning of the search operation due to the concurrent evaluation of the MLs may cause a serious IR drop on the power grid, thus affecting the operational reliability of the chip [2]. As a result, numerous efforts have been put forth to reduce both the peak and the total dynamic power consumption of the CAMs [3]–[9]. For example, Zukowski et al. and Pagiamtzis et al. introduced selective pre-charge and pipe-line architecture, respectively to reduce the peak and average power consumption of the CAM [9]. [6], [7] and [4] utilized the ML pre-charge low scheme (i.e., low ML swing) to reduce the average power consumption. These designs however are sensitive to process and supply voltage variations. As will be shown later in Section IV, they can hardly be scaled down to sub-65-nm CMOS process. In this work, a paritybit is introduced to boost the search speed of the parallel CAM with less than 1% power and area overhead. Concurrently, a power-gated ML sense amplifier is proposed to improve the performance of the CAM ML Copyright @ 2014 IJSETR. All rights reserved. PERIKALA MAHESH BABU, M. LAVANYA LATHA, SHARAD KULKARNI comparison in terms of power and robustness. It also operations just like an SRAM cell. Next operation is reduces the peak turn-on current at the beginning of each MATCH operation. search cycle. The rest of paper is organized as follows. B. Cam Architecture with Modified Cell Section II Modified CAM Design .In Section III, Match The Modified block diagram of a CAM core contains Line Structures .Performance analyses are presented in incorporated search data register and an output encoder. Section IV. Section V concludes this paper. Content Addressable Memory starts comparing operation by loading an n-bit input search word into the search data register. The search data that stored in SRAM are then broadcast into the memory banks through n pairs of complementary search-lines and directly compared with every bit of the stored words using comparison circuits as shown in Fig.1. Fig.1. Block diagram of a conventional CAM. II. MODIFIED CAM DESIGN A. CAM Cell Basic CAM cell operates is often discovered as twofold one for bit storage in Random Access Memory (RAM) and another for bit comparison i.e. distinctive to CAM as shown in Fig.2. At junction transistor i.e. circuit level CAM structure enforced as NAND-type and its variants has been explained by [0000]. However, at branch of knowledge level bit storage uses easy SRAM cell and comparison operate is equivalent to XOR i.e. XNOR logical operation. So our elementary chip cell style is abstracted as a vector product of SRAM and XNOR circuits. Fig.2. Basic CAM cell. The basic CAM cell is based on the static memory cell. Data is stored in two cross coupled inverters (SRAM). The word line used to control the two NMOS transistors and permit the CAM to be writing the data as shown in Fig.3. The remaining additional transistors are used in CAM for matching the stored and given input data. Static RAM contains two transistors that help to store the bit used for searching. This cell performs browse associated WRITE 1. Static RAM Array There is a wide range of techniques to improve SRAM cell (in fig as MC) performance and power in literature, e.g. dynamic voltage scaling, lower retention voltage etc. But these techniques are not directly related to the CAM operation, and can be applied almost regardless of the chosen CAM cell architecture. Considering SRAM oriented techniques common to all CAM improvement methods in this report, we will not include SRAM optimization in this study, but rather elaborate on CAM specific methods. 2. Match lines (ML) For the conventional CAM design, the ML is precharged, and then evaluated to discharge or stay high based on the match or not match decision of the evaluation circuit. The area, power and delay considerations of the ML routing limits the CAM array size and performance for many applications. In this study we will investigate methods to reduce ML delay, activity factor and power. 3. Search Lines (SL) Conventional CAM design uses complementary search lines, which results in a SL activity factor of 1. This high activity factor shows that SL improvement techniques are crucial for minimizing the CAM power and delay. In this study we present a way to avoid using complementary search lines, and reducing the activity factor to obtain area, power and delay benefits. C. NAND cell The NAND cell helps to implement the comparison between the stored bit D in the array, and corresponding search data on the corresponding search lines , (SL and SL̅̅̅ ), using the three comparison transistors , MD, MD̅̅̅ and M 1, which are all typically minimum-size to maintain high cell density those stored in the CAM array the bit-comparison operation of a NAND cell examined by using an example. By considering the case of a match when SL=1 and D=1 and Pass transistor MD is ON and passes the logic “1” on the search line (SL) to node B where the Node B is the bitmatch node which is logic “1” if there is a match in the cell . The logic “ 1” on node B turns ON transistor M 1.The transistor M 1 is also turned ON in the other match case when SL=0 and D=0 should be noted . During this case , the transistor MD̅̅̅ passes logic high to raise node B . In other cases in NAND, SL≠D result in a miss-match condition and International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.01, January-2015, Pages: 0088-0094 Design and Implementation High Speed Low Power Cam with A Parity Bit and Power-Gated ML Sensing accordingly node B is logic “0” and the transistor M1 is consumption of the Content Addressable Memory. A feature OFF. Pass-transistor in node B implementation of the of the NAND match line is that a miss stops signal XNOR function the NAND nature of this cell becomes clear propagation specified there is no consumption of power past when multiple NAND cells are connected in series. The the ultimate matching electronic transistor within the serial MLn and MLn+1 node are joined to form a word during this nMOS chain. Typically, only one match line is within the condition. A serial nMOS chain of all the M1 transistors match state, consequently most match line have solely a tiny resembles the pull-down path of a CMOS NAND logic gate. low range of transistors within the chain that area unit ON The match condition occurs only if every cell in a word is in and so solely a tiny low quantity of power is consumed. In the match condition for the entire word. the new CAM cell incorporates a similar topology of that of the traditional style, their layouts are similar. These two cell layouts have a similar length however completely different heights. Within the new design, VDDML can't be shared between two adjacent rows, leading to a taller cell layout. III. MATCH LINE STRUCTURES Basically, match line is one of the key structures in CAMs. The NOR cell and NAND cell are used to construct a CAM match line. Fig.3. Block Diagram of CAM cell. D. Match-Line Sensing Technique Each stored word has a Match-Line (ML) that is shared between the bits to convey the comparison process. Location of the matched word can be identified by an encoder as an output. In the first stage, i.e., pre-charge stage, the MLs is held at the ground voltage level while both SL and 𝑆𝐿̅̅̅̅̅̅ are at Vdd. Within the Evaluation stage, complementary search data is broadcast to the SLs and 𝑆𝐿̅̅̅s̅̅̅ . Cell transistor and will be turned on, charging up the to a next voltage level, if any mismatch occurs in any CAM. A Match-Line Sense Amplifier (MLSA) is used to detect the voltage change and amplifies it to a full CMOS voltage output. If any does not mismatch happens to cells in a row, no charge up path will be formed and the voltage on the will remain unchanged, indicating a match. Since all available words in the CAMs are compared in parallel, the output should be obtained in a single clock cycle. So, CAMs are faster than other hardware based and software based search applications. They are therefore preferred in highthroughput applications such as network routers and data compressors, where full parallel search operation leads to critical challenges in designing a low-power system for high-speed high-capacity CAMs 1)the power hungry nature due to the high switching activity 2) a huge surge-on current (i.e., Peak current) occurs at the beginning of the search operation due to the concurrent evaluation of the may cause a serious IR drop on the power grid, that affects the operational reliability of the chip. As a result, numerous efforts have been put forth to reduce both the peak and the total dynamic power A. NOR Match Line The schematic form of NOR match line is shown in Fig.4. The NOR cells are connected in parallel to form a NOR match line. A typical NOR search cycle operates in three phases: search line pre-charge, match line pre-charge, and match line evaluation. First, the search lines are precharged low to disconnect the match lines from ground by disabling the pull down paths in each CAM cell. Second, with the pull down paths disconnected, the Mpre transistor pre-charges the match lines high. Finally, the search lines are driven to the search word values, triggering the match line evaluation phase. In the case of a match, the ML voltage, VML, stays high as there is no discharge path to ground. In the case of a miss, there is at least one path to ground that discharges the match line. The match line sense amplifier (MLSA) senses the voltage on ML, and generates a corresponding full-rail output match result. The main feature of the NOR match line is its high speed of operation. In the slowest case of a one-bit miss in a word, the critical evaluation path is through the two series transistors in the cell that form the pull down path. Even in this worst case, NOR-cell evaluation is faster than the NAND case, where between 8 and 16 transistors form the evaluation path. Fig.4. Structure of a NOR match line with n cells. International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.01, January-2015, Pages: 0088-0094 PERIKALA MAHESH BABU, M. LAVANYA LATHA, SHARAD KULKARNI transistors for the comparison circuitry. NOR cells avoid B. NAND Match Line Fig.5 shows the NAND match line. A number of n cells this problem by applying maximum gate voltage to all CAM are cascaded to form the NOR match line. The pre-charge cell transistors when conducting. pMOS transistor, Mpre, sets the initial voltage of the match line, ML, to the supply voltage, VDD. Next, the evaluation IV. PERFORMANCE COMPARISONS nMOS transistor, Meval, turns ON. In the case of a match, In this section, performance of the proposed design will all nMOS transistors, M1through Mn are ON, effectively be evaluated using the conventional circuit and those in [6], creating a path to ground from the ML node, hence [7] as references. In [6], the power consumption is limited discharging ML to ground. In the case of a miss, at least one by the amount of charge injected to the ML at the beginning of the series nMOS transistors, M1 through Mn, is OFF, of the search. In [7], a similar concept is utilized with a leaving the ML voltage high. A sense amplifier, MLSA, positive feedback loop to boost the sensing speed. Both detects the difference between the match (low) voltage and designs are very power efficient (fig 6). the miss (high) voltage. The NAND match line has an explicit evaluation transistor, Meval, unlike the NOR match line, where the CAM cells themselves perform the evaluation. There is a potential charge-sharing problem in the NAND match line. Charge sharing can occur between the ML node and the intermediate MLi nodes. This charge sharing may cause the ML node voltage to drop sufficiently low such that the MLSA detects a false match. A technique that eliminates charge sharing is to pre-charge high, in addition to ML, the intermediate match nodes ML1 through MLn-1. This procedure eliminates charge sharing, since the intermediate match nodes and the ML node are initially shorted. However, there is an increase in the power consumption due to the search line pre-charge. Fig.5. Structure of a NAND match line. A feature of the NAND match line is that a miss stops signal propagation such that there is no consumption of power past the final matching transistor in the serial nMOS chain. Typically, only one match line is in the match state, consequently most match lines have only a small number of transistors in the chain that are ON and thus only a small amount of power is consumed. Two drawbacks of the NAND match line are a quadratic delay dependence on the number of cells, and a low noise margin. The quadratic delay-dependence comes from the fact that adding a NAND cell to a NAND match line adds both a series resistance due to the series nMOS transistor and a capacitance to ground due to the nMOS diffusion capacitance. These elements form an RC ladder structure whose overall time constant has a quadratic dependence on the number of NAND cells. The low noise margin is caused by the use of nMOS pass Fig.6. Waveforms of some important nodes during evaluation of three rows of 128-bit of the proposed design. Fig.7. Layout of the proposed CAM cell Nets ML and VDDML are routed horizontally on Metal 4—(i.e., purple) while net VDD is routed vertically (i.e., blue). International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.01, January-2015, Pages: 0088-0094 Design and Implementation High Speed Low Power Cam with A Parity Bit and Power-Gated ML Sensing As will be shown latter, the proposed design consumes Our simulation result has shown that for an 8K×128 slightly higher power consumption when compared with [6] CAM array implemented in a 65-nm CMOS process, the worst-case IR drop at the center of the conventional CAM and [7] but is more robust against PVT variations. can be as large as 0.18 V (i.e., 18% V DD) while that of the A. Peak Current and IR Drop Attenuation proposed design is only 8 mV (i.e., 0.8% VDD). Also, it only The proposed power controller demonstrates a great requires the VDDML net to have a width of only 150 nm reduction in the transient peak current. This can be instead of 2 μm vertical VDD. The new vertical VDD now explained by the bottleneck effect of transistor Px. Fig.8 only supply the leakage current to the SRAM cell and thus shows the transient current as a function of the number of does not require a large metal width. mismatches occurring in a row of 128 CAM cells during the COMPARE cycle of the proposed and the conventional B. Dynamic Power Consumption designs. The conventional design’s peak current increases Because the power-gated transistor is turned off after the almost linearly from 25 μA (1 mismatch) to 1.45 mA (64 output is obtained at the sense amplifier, the proposed mismatches) and finally 2.8 mA (128 mismatches). technique renders lower average power consumption. This Although the overall transient ML charge up current of the is mainly due to the reduced voltage swing on the ML bus. proposed design also increases with the number of Another contributing factor to the reduced average power mismatches, it will soon reach its limit due to the presence consumption is that the new design does not need to preof the gated-power transistor Px. For instance, when 128 charge the SL buses because the EN signal turns off mismatches occurs, the peak current is capped at 155 μA, transistor Px of each row and hence the SL buses do not which is less than eight times as compared to the case when need to be pre-charged, which in turn saves 50% power on only one mismatch occurs (i.e., 21 μA). This drastic the SL buses. Fig.9 illustrates the average energy reduction in the peak current translates to a vast consumption (divided into ML power and SL power) of the proposed design as compared to other three benchmark improvement in operation reliability. designs, including all the power overhead of the control circuitry. Since [6], [7], and the proposed design do not precharge the SLs before each compare cycle, their SLs energy consumption is only half of that of the conventional circuit. As for the ML energy, at 1V supply voltage the proposed design only dissipates 0.41 fJ/search/bit while that of the conventional design is 1.148 fJ/search/bit. Our ML energy consumption is higher than that of [6] (10.8%) and [7] (32%) but as will be shown below, our proposed design is much more robust against process and environment variations. Fig.8. Simulated transient current occurred on a row of 128 CAM cells during the compare cycle of the (left) conventional and the (right) proposed designs. C. Supply Voltage Scaling Analysis We investigate the ability of the four designs to work at low supply voltage, by re-implementing the designs in [6], [7] and the conventional one into the same 65-nm technology. Fig.9. Total average energy consumption of the four designs in consideration. Fig.10. (a) Search energy per bit. (b) ML sensing delay of the four designs in consideration against supply voltage scaling from 1 to 0.5 V. Sensing delay is defined as the sensing delay of the 1-mismatch ML, i.e., the worst-case scenario. International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.01, January-2015, Pages: 0088-0094 PERIKALA MAHESH BABU, M. LAVANYA LATHA, SHARAD KULKARNI Fig.11. Standby leakage current of the four designs in consideration against supply voltage scaling from 1 to 0.5 V. Designs in [6] and [7] demonstrate poor adaptability to voltage scaling. They cannot operate at a supply voltage lower than 0.9V. On the contrary, when the supply voltage scales to 0.5V, both the proposed and the conventional design can work well. First, the search energy of the four designs in consideration is presented in Fig. 10(a). It can be seen that at 1V supply voltage, [6] and [7] have the lowest energy consumption per search, followed by the proposed design. However, they cease to work when the supply voltage scales down to below 0.9V. Between the conventional and the proposed design, the proposed design consumes 62% less power consumption at any supply voltage value. Second, the sensing delay comparison is shown in Fig.10 where the proposed design has 39% improvement when compared to the conventional design and is the fastest design. This figure also suggests that sensing delay increases dramatically when supply voltage enters the near-sub threshold region. Finally, the corresponding leakage a current of the four designs against voltage scaling is shown in Fig. 11. The proposed design is the second-best circuit after the conventional design. Both of them have about 20% and 37% lower leakage current when compared to [6] and [7] at 1V, respectively. This feature confirms that the proposed design is more suitable for ultra-low power applications in 65-nm CMOS process and beyond. D. Temperature Variation Analysis We also carry out the temperature variation analysis on the four designs (see Fig. 12). It can be seen that the [7] is the most vulnerable design and thus can only work in a narrow range of temperature variation. [6] Can work throughout the whole temperature range but having more than 30% speed fluctuation. In contrast, the proposed and the conventional design are much more stable with less than 4% sensing delay variation. Fig.12. ML sensing delay of the four designs in consideration against temperature variations. Sensing delay is defined as the sensing delay of the 1-mismatch ML, i.e., the worst-case scenario. E. Process Variation Analysis Process variation is a critical issue in nano-scale CMOS technologies. We simulate the performance of the proposed design against empirical process variation data from the foundry. It is worth mentioning here that the feedback loop to turn off the gated-power transistor Px operates digitally and hence is almost insensitive to process variations. Similar to the conventional design, there are two scenarios where the proposed design may sense the results wrongly: 1) the sense amplifier is enabled too early, the 1-mismatch ML has not been pulled up to a voltage higher than the threshold value and thus trigger the output inverter and 2) the delay of the enable signal is too long, resulting in the matched ML to be pulled up by the leakage current, indicating wrong miss. We use 50000-cycle Monte Carlo simulations on these designs at different supply voltages and count the number of errors accordingly. The [6] and [7] are very sensitive to process variations with more than 1000 and 10 000 errors count, respectively. Also, they stop working at 0.9V supply. On the contrary, the proposed and the conventional design have no sensing error even if more than 30% speed fluctuation. In contrast, the proposed and the conventional design are much more stable with less than 4% sensing delay variation. VDD scales down to 0.7V. At lower supply voltage, the conventional design continues to work 100% correctly while the proposed design has 51 and 298 error counts at 0.6 and 0.5V, respectively. This is because both a design operates at the same frequency but the proposed design has a smaller pull-up current due to the gated-power transistor Px and hence sometimes error happens. We have carried out a separate simulation for the proposed design with a slightly slower frequency and have confirmed that no error occurs. It is worth mentioning here that extending the period for [6] and [7] does not result in any error count International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.01, January-2015, Pages: 0088-0094 Design and Implementation High Speed Low Power Cam with A Parity Bit and Power-Gated ML Sensing reduction since these designs are based on feedback loop structure and decisions are made at the very beginning of the sensing cycle. V. CONCLUSION We proposed an effective gated-power technique and a parity-bit based architecture that offer several major advantages, namely reduced peak current (and thus IR drop), average power consumption (36%), boosted search speed (39%) and improved process variation tolerance. It is much more stable than recently published designs while maintain their low-power consumption property. When compared to the conventional design, its stability is degraded by 0.6% only at extremely low supply voltages. At 1V operating condition, both designs are equally stable with no sensing errors, according to our Monte Carlo simulations. Its area overhead is about 11%. It is therefore the most suitable design for implementing high capacity parallel CAM in sub-65-nm CMOS technologies. VI. REFERENCES [1] Anh-Tuan Do, Shoushun Chen, Zhi-Hui Kong, and Kiat Seng Yeo, “A High Speed Low Power CAM with a Parity Bit and Power-Gated ML Sensing”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 21, No. 1, January 2013. [2] K. Pagiamtzis and A. Sheikholeslami, “Contentaddressable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006. [3] A. T. Do, S. S. Chen, Z. H. Kong, and K. S. Yeo, “A low-power CAM with efficient power and delay trade-off,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2011, pp. 2573–2576. [4] I. Arsovski and A. Sheikholeslami, “A mismatchdependent power allocation technique for match-line sensing in content-addressable memories,” IEEE J. SolidState Circuits, vol. 38, no. 11, pp. 1958–1966, Nov. 2003. [5] N. Mohan and M. Sachdev, “Low-leakage storage cells for ternary content addressable memories,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 5, pp. 604–612, May 2009. [6] O. Tyshchenko and A. Sheikholeslami, “Match sensing using match line stability in content addressable memory’s (CAM),” IEEE J. Solid-State Circuits, vol. 43, no. 9, pp. 1972–1981, Sep. 2008. [7] N. Mohan, W. Fung, D. Wright, and M. Sachdev, “A low-power ternary CAM with positive-feedback match-line sense amplifiers,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 3, pp. 566–573, Mar. 2009. [8] S. Baeg, “Low-power ternary content-addressable memory design using a segmented match line,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 6, pp. 1485–1494, Jul. 2008. [9] K. Pagiamtzis and A. Sheikholeslami, “A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1512–1519, Sep. 2004. International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.01, January-2015, Pages: 0088-0094

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Design and Implementation High Speed Low Power Cam