www.it-ebooks.info Cassandra High Availability www.it-ebooks.info Table of Contents Cassandra High Availability Credits About the Author About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Errata Piracy Questions 1. Cassandra’s Approach to High Availability ACID The monolithic architecture The master-slave architecture Sharding Master failover Cassandra’s solution Cassandra’s architecture Distributed hash table Replication Replication across data centers Tunable consistency The CAP theorem Summary 2. Data Distribution Hash table fundamentals Distributing hash tables Consistent hashing The mechanics of consistent hashing Token assignment Manually assigned tokens www.it-ebooks.info vnodes How vnodes improve availability Adding and removing nodes Node rebuilding Heterogeneous nodes Partitioners Hotspots Effects of scaling out using ByteOrderedPartitioner A time-series example Summary 3. Replication The replication factor Replication strategies SimpleStrategy NetworkTopologyStrategy Snitches Maintaining the replication factor when a node fails Consistency conflicts Consistency levels Repairing data Balancing the replication factor with consistency Summary 4. Data Centers Use cases for multiple data centers Live backup Failover Load balancing Geographic distribution Online analysis Analysis using Hadoop Analysis using Spark Data center setup RackInferringSnitch PropertyFileSnitch GossipingPropertyFileSnitch Cloud snitches Replication across data centers Setting the replication factor Consistency in a multiple data center environment The anatomy of a replicated write Achieving stronger consistency between data centers Summary 5. Scaling Out www.it-ebooks.info Choosing the right hardware configuration Scaling out versus scaling up Growing your cluster Adding nodes without vnodes Adding nodes with vnodes How to scale out Adding a data center How to scale up Upgrading in place Scaling up using data center replication Removing nodes Removing nodes within a data center Decommissioning a data center Other data migration scenarios Snitch changes Summary 6. High Availability Features in the Native Java Client Thrift versus the native protocol Setting up the environment Connecting to the cluster Executing statements Prepared statements Batched statements Caution with batches Handling asynchronous requests Running queries in parallel Load balancing Failing over to a remote data center Downgrading the consistency level Defining your own retry policy Token awareness Tying it all together Falling back to QUORUM Summary 7. Modeling for High Availability How Cassandra stores data Implications of a log-structured storage Understanding compaction Size-tiered compaction Leveled compaction Date-tiered compaction CQL under the hood Single primary key Compound keys www.it-ebooks.info Partition keys Clustering columns Composite partition keys The importance of the storage model Understanding queries Query by key Range queries Denormalizing with collections How collections are stored Sets Lists Maps Working with time-series data Designing for immutability Modeling sensor data Queries Time-based ordering Using a sentinel value Satisfying our queries When time is all that matters Working with geospatial data Summary 8. Antipatterns Multikey queries Secondary indices Secondary indices under the hood Distributed joins Deleting data Garbage collection Resurrecting the dead Unexpected deletes The problem with tombstones Expiring columns TTL antipatterns When null does not mean empty Cassandra is not a queue Unbounded row growth Summary 9. Failing Gracefully Knowledge is power Monitoring via Java Management Extensions Using OpsCenter Choosing a management toolset Logging www.it-ebooks.info Cassandra logs Garbage collector logs Monitoring node metrics Thread pools Column family statistics Finding latency outliers Communication metrics When a node goes down Marking a downed node Handling a downed node Handling slow nodes Backing up data Taking a snapshot Incremental backups Restoring from a snapshot Summary Index www.it-ebooks.info Cassandra High Availability www.it-ebooks.info Cassandra High Availability Copyright © 2014 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: December 2014 Production reference: 1221214 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78398-912-6 www.packtpub.com www.it-ebooks.info Credits Author Robbie Strickland Reviewers Richard Low Jimmy Mårdell Rob Murphy Russell Spitzer Commissioning Editor Kunal Parikh Acquisition Editors Richard Harvey Owen Roberts Content Development Editors Samantha Gonsalves Azharuddin Sheikh Technical Editor Ankita Thakur Copy Editors Pranjali Chury Merilyn Pereira Project Coordinator Sanchita Mandal Proofreaders Simran Bhogal Maria Gould Ameesha Green Paul Hindle Indexer Rekha Nair Graphics www.it-ebooks.info Sheetal Aute Disha Haria Abhinash Sahu Production Coordinator Alwin Roy Cover Work Alwin Roy www.it-ebooks.info About the Author Robbie Strickland got involved in the Apache Cassandra project in 2010, and he initially went into production with the 0.5 release. He has made numerous contributions over the years, including his work on drivers for C# and Scala, and multiple contributions to the core Cassandra codebase. In 2013, he became the very first certified Cassandra developer, and in 2014, DataStax selected him as an Apache Cassandra MVP. While this is Robbie’s first published technical book, he has been an active speaker and writer in the Cassandra community and is the founder of the Atlanta Cassandra Users Group. Other examples of his writing can be found on the DataStax blog, and he has conducted numerous webinars and spoken at many conferences over the years. I would like to thank my wife for encouraging me to go forward with this project and for continuing to be supportive throughout the significant time commitment required to write a book. Also, I am truly appreciative of my excellent reviewers: Richard Low, Jimmy Mårdell, Rob Murphy, and Russell Spitzer. They helped keep me honest, and their deep expertise added materially to the quality of the content. I would also like to thank the entire staff at Packt Publishing who were involved in the book’s publishing process. Lastly, I want to thank Logan Johnson who initially pointed me toward Cassandra. The risk has paid off, and Logan is responsible for starting me off on this path. www.it-ebooks.info About the Reviewers Richard Low has worked with Cassandra since Version 0.6 and has managed and supported some of the largest Cassandra deployments. He has contributed fixes and features to the project and has helped many users build their first Cassandra deployment. He is a regular speaker at Cassandra events and a contributor to Cassandra online forums. Jimmy Mårdell is a senior software engineer and Cassandra contributor who has spent the last 4 years working with large distributed systems using Cassandra. Since 2013, he has been leading a database infrastructure team at Spotify, focusing on improving the Cassandra ecosystem at Spotify and empowering other teams to operate Cassandra clusters. Jimmy likes algorithms and competitive programming and won the programming competition Google Code Jam in 2003. Rob Murphy is a solutions engineer at DataStax with more than 16 years of experience in the field of data-driven application development and design. Rob’s background includes work with most RDMS platforms as well as DataStax/Apache Cassandra, Hadoop, MongoDB, Apache Accumulo, and Apache Spark. His passion for solving “data problems” goes beyond the system level to the data itself. Rob has a Master’s degree in Predictive Analytics from Northwestern University with a specific research interest in machine learning and predictive algorithms at the “Internet scale”. Russell Spitzer received his PhD in Bioinformatics from UCSF in 2013, where he became increasingly interested in data analytics and distributed computation. He followed these interests and joined DataStax, the enterprise company behind the Apache Cassandra distributed database. At DataStax, he works on the testing and development of the integration between Cassandra and other groundbreaking open source technologies, such as Spark, Solr, and Hadoop. I would like to thank my wife, Maggie, who put up with a lot of late-night laptop screen glow so that I could help out with this book. www.it-ebooks.info www.PacktPub.com www.it-ebooks.info Support files, eBooks, discount offers, and more For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt’s online digital book library. Here, you can search, access, and read Packt’s entire library of books. www.it-ebooks.info Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser www.it-ebooks.info Free access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access. www.it-ebooks.info Preface Cassandra is a fantastic data store and is certainly well suited as the foundation for a highly available system. In fact, it was built for such a purpose: to handle Facebook’s messaging service. However, it hasn’t always been so easy to use, with its early Thrift interface and unfamiliar data model causing many potential users to pause—and in many cases for a good reason. Fortunately, Cassandra has matured substantially over the last few years. I used to advise people to use Cassandra only if nothing else would do the job because the learning curve for it was quite high. However, the introduction of newer features such as CQL and vnodes has changed the game entirely. What once appeared complex and overly daunting now comes across as deceptively simple. A SQL-like interface masks the underlying data structure, whose familiarity can lure an unsuspecting new user into dangerous traps. The moral of this story is that it’s not a relational database, and you still need to know what it’s doing under the hood. Imparting this knowledge is the core objective of this book. Each chapter attempts to demystify the inner workings of Cassandra so that you no longer have to work blindly against a black box data store. You will learn to configure, design, and build your system based on a fundamentally solid foundation. The good news is that Cassandra makes the task of building massively scalable and incredibly reliable systems relatively straightforward, presuming you understand how to partner with it to achieve these goals. Since you are reading this book, I presume you are either already using Cassandra or planning to do so, and that you’re interested in building a highly available system on top of it. If so, I am confident that you will meet with success if you follow the principles and guidelines offered in the chapters that follow. www.it-ebooks.info What this book covers Chapter 1, Cassandra’s Approach to High Availability, is an introduction to concepts related to system availability and the problems that have been encountered historically while trying to make data stores highly available. This chapter outlines Cassandra’s solutions to these problems. Chapter 2, Data Distribution, outlines the core mechanisms that underlie Cassandra’s distributed hash table model, including consistent hashing and partitioner implementations. Chapter 3, Replication, offers an in-depth look at the data replication architecture used in Cassandra, with a focus on the relationship between consistency levels and replication factors. Chapter 4, Data Centers, enables you to thoroughly understand Cassandra’s robust data center replication capabilities, including deployment on EC2 and building separate clusters for analysis using Hadoop or Spark. Chapter 5, Scaling Out, is a discussion on the tools, processes, and general guidance required to properly increase the size of your cluster. Chapter 6, High Availability Features in the Native Java Client, covers the new native Java driver and its availability-related features. We’ll discuss node discovery, cluster- aware load balancing, automatic failover, and other important concepts. Chapter 7, Modeling for High Availability, explains the important concepts you need to understand while modeling highly available data in Cassandra. CQL, keys, wide rows, and denormalization are among the topics that will be covered. Chapter 8, Antipatterns, complements the data modeling chapter by presenting a set of common antipatterns that proliferate among inexperienced Cassandra developers. Some patterns include queues, joins, high delete volumes, and high cardinality secondary indexes among others. Chapter 9, Failing Gracefully, helps the reader to understand how to deal with various failure cases, as failure in a large distributed system is inevitable. We’ll examine a number of possible failure scenarios, and discuss how to detect and resolve them. www.it-ebooks.info What you need for this book This book assumes you have access to a running Cassandra installation that’s at least as new as release 1.2.x. Some features discussed will be applicable only to the 2.0.x series, and we will point these out when this applies. Users of versions older than 1.2.x can still gain a lot from the content, but there will be some portions that do not directly translate to those versions. For Chapter 6, High Availability Features in the Native Java Client, coverage of the Java driver, you will need the Java Development Kit 1.7 and a suitable text editor to write Java code. All command-line examples assume a Linux environment since this is the only supported operating system for use with a production Cassandra system. www.it-ebooks.info Who this book is for This book is for developers and system administrators who are interested in building an advanced understanding of Cassandra’s internals for the purpose of deploying high availability services using it as a backing data store. This is not an introduction to Cassandra, so those who are completely new would be well served to find a suitable tutorial before diving into this book. www.it-ebooks.info Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: “The PropertyFileSnitch configuration allows an administrator to precisely configure the topology of the network by means of a properties file named cassandra- topology.properties.” A block of code is set as follows: CREATE KEYSPACE AddressBook WITH REPLICATION = { ‘class’ : ‘NetworkTopologyStrategy’, ‘dc1’ : 3, ‘dc2’ : 2 }; When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: CREATE KEYSPACE AddressBook WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 3 }; Any command-line input or output is written as follows: # nodetool status New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: “Then, fill in the host, port, and your credentials in the dialog box and click on the Connect button.” Note Warnings or important notes appear in a box like this. Tip Tips and tricks appear like this. www.it-ebooks.info Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of. To send us general feedback, simply send an e-mail to , and mention the book title through the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors. www.it-ebooks.info Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. www.it-ebooks.info Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section. www.it-ebooks.info Piracy Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at with a link to the suspected pirated material. We appreciate your help in protecting our authors, and our ability to bring you valuable content. www.it-ebooks.info Questions If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem. www.it-ebooks.info Chapter 1. Cassandra’s Approach to High Availability What does it mean for a data store to be “highly available”? When designing or configuring a system for high availability, architects typically hope to offer some guarantee of uptime even in the presence of failure. Historically, it has been sufficient for the vast majority of systems to be available for less than 100 percent of the time, with some attempting to achieve the “five nines”, or 99.999, percent uptime. The exact definition of high availability depends on the requirements of the application. This concept has gained increasing significance in the context of web applications, real- time systems, and other use cases that cannot afford any downtime. Database systems must not only guarantee system uptime, the ability to fulfill requests, but also ensure that the data itself remains available. Traditionally, it has been difficult to make databases highly available, especially the relational database systems that have dominated the scene for the last couple of decades. These systems are most often designed to run on a single large machine, making it challenging to scale out to multiple machines. Let’s examine some of the reasons why many popular database systems have difficulty being deployed in high availability configurations, as this will allow us to have a greater understanding of the improvements that Cassandra offers. Exploring these reasons can help us to put aside previous assumptions that simply don’t translate to the Cassandra model. Therefore, in this chapter, we’ll cover the following topics: The atomicity, consistency, isolation and durability (ACID) properties Monolithic architecture Master-slave architecture, covering sharding and leader election Cassandra’s approach to achieve high availability www.it-ebooks.info ACID One of the most significant obstacles that prevents traditional databases from achieving high availability is that they attempt to strongly guarantee the ACID properties: Atomicity: This guarantees that database updates associated with a transaction occur in an all-or-nothing manner. If some part of the transaction fails, the state of the database remains unchanged. Consistency: This assures that the integrity of data will be preserved across all instances of that data. Changes to a value in one location will definitely be reflected in all other locations. Isolation: This attempts to ensure that concurrent transactions that manipulate the same data do so in a controlled manner, essentially isolating in-process changes from other clients. Most traditional relational database systems provide various levels of isolation with different guarantees at each level. Durability: This ensures that all writes are preserved in nonvolatile storage, most commonly on disk. Database designers most commonly achieve these properties via write masters, locks, elaborate storage area networks, and the like—all of which tend to sacrifice availability. As a result, achieving some semblance of high availability frequently involves bolt-on components, log shipping, leader election, sharding, and other such strategies that attempt to preserve the original design. www.it-ebooks.info The monolithic architecture The simplest design approach to guarantee ACID properties is to implement a monolithic architecture where all functions reside on a single machine. Since no coordination among nodes is required, the task of enforcing all the system rules is relatively straightforward. Increasing availability in such architectures typically involves hardware layer improvements, such as RAID arrays, multiple network interfaces, and hot-swappable drives. However, the fact remains that even the most robust database server acts as a single point of failure. This means that if the server fails, the application becomes unavailable. This architecture can be illustrated with the following diagram: A common means of increasing capacity to handle requests on a monolithic architecture is to move the storage layer to a shared component such as a storage area network (SAN) or network attached storage (NAS). Such devices are usually quite robust with large numbers of disks and high-speed network interfaces. This approach is shown in a modification of the previous diagram, which depicts two database servers using a single NAS. www.it-ebooks.info You’ll notice that while this architecture increases the overall request handling capacity of the system, it simply moves the single failure point from the database server to the storage layer. As a result, there is no real improvement from an availability perspective. www.it-ebooks.info The master-slave architecture As distributed systems have become more commonplace, the need for higher capacity distributed databases has grown. Many distributed databases still attempt to maintain ACID guarantees (or in some cases only the consistency aspect, which is the most difficult in a distributed environment), leading to the master-slave architecture. In this approach, there might be many servers handling requests, but only one server can actually perform writes so as to maintain data in a consistent state. This avoids the scenario where the same data can be modified via concurrent mutation requests to different nodes. The following diagram shows the most basic scenario: However, we still have not solved the availability problem, as a failure of the write master would lead to application downtime. It also means that writes do not scale well, since they are all directed to a single machine. www.it-ebooks.info Sharding A variation on the master-slave approach that enables higher write volumes is a technique called sharding, in which the data is partitioned into groups of keys, such that one or more masters can own a known set of keys. For example, a database of user profiles can be partitioned by the last name, such that A-M belongs to one cluster and N-Z belongs to another, as follows: An astute observer will notice that both master-slave and sharding introduce failure points on the master nodes, and in fact the sharding approach introduces multiple points of failure—one for each master! Additionally, the knowledge of where requests for certain keys go rests with the application layer, and adding shards requires manual shuffling of data to accommodate the modified key ranges. Some systems employ shard managers as a layer of abstraction between the application and the physical shards. This has the effect of removing the requirement that the application must have knowledge of the partition map. However, it does not obviate the need for shuffling data as the cluster grows. www.it-ebooks.info Master failover A common means of increasing availability in the event of a failure on a master node is to employ a master failover protocol. The particular semantics of the protocol vary among implementations, but the general principle is that a new master is appointed when the previous one fails. Not all failover algorithms are equal; however, in general, this feature increases availability in a master-slave system. Even a master-slave database that employs leader election suffers from a number of undesirable traits: Applications must understand the database topology Data partitions must be carefully planned Writes are difficult to scale A failover dramatically increases the complexity of the system in general, and especially so for multisite databases Adding capacity requires reshuffling data with a potential for downtime Considering that our objective is a highly available system, and presuming that scalability is a concern, are there other options we need to consider? www.it-ebooks.info Cassandra’s solution The reality is that not every transaction in every application requires full ACID guarantees, and ACID properties themselves can be viewed as more of a continuum where a given transaction might require different degrees of each property. Cassandra’s approach to availability takes this continuum into account. In contrast to its relational predecessors—and even most of its NoSQL contemporaries—its original architects considered availability as a key design objective, with the intent to achieve the elusive goal of 100 percent uptime. Cassandra provides numerous knobs that give the user highly granular control of the ACID properties, all with different trade-offs. The remainder of this chapter offers an introduction to Cassandra’s high availability attributes and features, with the rest of the book devoted to help you to make use of these in real-world applications. www.it-ebooks.info Cassandra’s architecture Unlike either monolithic or master-slave designs, Cassandra makes use of an entirely peer- to-peer architecture. All nodes in a Cassandra cluster can accept reads and writes, no matter where the data being written or requested actually belongs in the cluster. Internode communication takes place by means of a gossip protocol, which allows all nodes to quickly receive updates without the need for a master coordinator. This is a powerful design, as it implies that the system itself is both inherently available and massively scalable. Consider the following diagram: www.it-ebooks.info Note that in contrast to the monolithic and master-slave architectures, there are no special nodes. In fact, all nodes are essentially identical, and as a result Cassandra has no single point of failure—and therefore no need for complex sharding or leader election. But how does Cassandra avoid sharding? www.it-ebooks.info Distributed hash table Cassandra is able to achieve both availability and scalability using a data structure that allows any node in the system to easily determine the location of a particular key in the cluster. This is accomplished by using a distributed hash table (DHT) design based on the Amazon Dynamo architecture. As we saw in the previous diagram, Cassandra’s topology is arranged in a ring, where each node owns a particular range of data. Keys are assigned to a specific node using a process called consistent hashing, which allows nodes to be added or removed without having to rehash every key based on the new range. The node that owns a given key is determined by the chosen partitioner. Cassandra ships with several partitioner implementations or developers can define their own by implementing a Java interface. These topics will be covered in greater detail in the next chapter. www.it-ebooks.info Replication One of the most important aspects of a distributed data store is the manner in which it handles replication of data across the cluster. If each partition were only stored on a single node, the system would effectively possess many single points of failure, and a failure of any node could result in catastrophic data loss. Such systems must therefore be able to replicate data across multiple nodes, making the occurrence of such loss less likely. Cassandra has a sophisticated replication system, offering rack and data center awareness. This means it can be configured to place replicas in such a manner so as to maintain availability even during otherwise catastrophic events such as switch failures, network partitions, or data center outages. Cassandra also includes a mechanism that maintains the replication factor during node failures. Replication across data centers Perhaps the most unique feature Cassandra provides to achieve high availability is its multiple data center replication system. This system can be easily configured to replicate data across either physical or virtual data centers. This facilitates geographically dispersed data center placement without complex schemes to keep data in sync. It also allows you to create separate data centers for online transactions and heavy analysis workloads, while allowing data written in one data center to be immediately reflected in others. Chapters 3, Replication, and Chapter 4, Data Centers, will provide a complete discussion of Cassandra’s extensive replication features. www.it-ebooks.info Tunable consistency Closely related to replication is the idea of consistency, the C in ACID that attempts to keep replicas in sync. Cassandra is often referred to as an eventually consistent system, a term that can cause fear and trembling for those who have spent many years relying on the strong consistency characteristics of their favorite relational databases. However, as previously discussed, consistency should be thought of as a continuum, not as an absolute. With this in mind, Cassandra can be more accurately described as having tunable consistency, where the precise degree of consistency guarantee can be specified on a per- statement level. This gives the application architect ultimate control over the trade-offs between consistency, availability, and performance at the call level—rather than forcing a one-size-fits-all strategy onto every use case. The CAP theorem Any discussion of consistency would be incomplete without at least reviewing the CAP theorem. The CAP acronym refers to three desirable properties in a replicated system: Consistency: This means that the data should appear identical across all nodes in the cluster Availability: This means that the system should always be available to receive requests Partition tolerance: This means that the system should continue to function in the event of a partial failure In 2000, computer scientist Eric Brewer from the University of California, Berkeley, posited that a replicated service can choose only two of the three properties for any given operation. The CAP theorem has been widely misappropriated to suggest that entire systems must choose only two of the properties, which has led many to characterize databases as either AP or CP. In fact, most systems do not fit cleanly into either category, and Cassandra is no different. Brewer himself addressed this misguided interpretation in his 2012 article, CAP Twelve Years Later: How the “Rules” Have Changed: … all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists In that same article, Brewer also pointed out that the definition of consistency in ACID terms differs from the CAP definition. In ACID, consistency refers to the guarantee that all database rules will be followed (unique constraints, foreign key constraints, and the like). The consistency in CAP, on the other hand, as clarified by Brewer refers only to single-copy consistency, a strict subset of ACID consistency. www.it-ebooks.info Note When considering the various trade-offs of Cassandra’s consistency level options, it’s important to keep in mind that the CAP properties exist on a continuum rather than as binary choices. The bottom line is that it’s important to bear this continuum in mind when designing a system based on Cassandra. Refer to Chapter 3, Replication, for additional details on properly tuning Cassandra’s consistency level under a variety of circumstances. www.it-ebooks.info Summary By now, you should have a solid understanding of Cassandra’s approach to availability and why the fundamental design decisions were made. In the later chapters, we’ll take a deeper look at the following ideas: Configuring Cassandra for high availability Designing highly available applications on Cassandra Avoiding common antipatterns Handling various failure scenarios By the end of this book, you should possess a solid grasp of these concepts and be confident that you’ve successfully deployed one of the most robust and scalable database platforms available today. However, we need to take it a step at a time, so in the next few chapters, we will build a deeper understanding of how Cassandra manages data. This foundation will be necessary for the topics covered later in the book. We’ll start with a discussion of Cassandra’s data placement strategy in the next chapter. www.it-ebooks.info Chapter 2. Data Distribution Cassandra’s peer-to-peer architecture and scalability characteristics are directly tied to its data placement scheme. Cassandra employs a distributed hash table data structure that allows data to be stored and retrieved by a key quickly and efficiently. Consistent hashing is the core of this strategy as it enables all nodes to understand where data exists in the cluster without complicated coordination mechanisms. In this chapter, we’ll cover the following topics: The fundamentals of distributed hash tables Cassandra’s consistent hashing mechanism Token assignment, both manual and using virtual nodes (vnodes) The implications of Cassandra’s partitioner implementations Formation of hotspots in the cluster By the time you finish this chapter, you should have a deep understanding of these concepts. Let’s begin with some basics about hash tables in general, and then we can delve deeper into Cassandra’s distributed hash table implementation. www.it-ebooks.info Hash table fundamentals Most developers have experience with hash tables in some form, as nearly all programming languages include hash table implementations. Hash tables store data by applying a hash function to the object, which determines its placement in an underlying array. While a detailed description of hashing algorithms is out of scope of this book, it is sufficient for you to understand that a hash function simply maps any input data object (which might be any size) to some expected output. While the input might be large, the output of the hash function will be a fixed number of bits. In a typical hash table design, the result of the hash function is divided by the number of array slots; the remainder then becomes the assigned slot number. Thus, the slot can be computed using hash(o) % n, where o is the object and n is the number of slots. Consider the following hash table with names as keys and addresses as values: In the preceding diagram, the values in the table on the left represent keys, which are hashed using the hash function to produce the index of the slot where the value is stored. Our input objects (John, Jane, George, and Sue), are put through the hash function, which results in an integer value. This value becomes the index in an array of street addresses. We can look up the street address of a given name by computing its hash, then accessing the resulting array index. This method works well when the number of slots is stable or when the order of the elements can be managed in a predictable way by a single owner. There are additional complexities in hash table design, specifically around avoiding hash collisions, but the basic concept remains straightforward. However, the situation gets a bit more complicated when multiple clients of the hash table need to stay in sync. All these clients need to consistently produce the same hash result even as the elements themselves might be moving around. Let’s examine the distributed hash table architecture and the means by which it solves this problem. www.it-ebooks.info Distributing hash tables When we take the basic idea of a hash table and partition it out to multiple nodes, it gives us a distributed hash table (DHT). Each node in the DHT must share the same hash function so that hash results on one node match the results on all others. In order to determine the location of a given piece of data in the cluster, we need some means to associate an object with the node that owns it. We can ask every node in the cluster, but this will be problematic for at least two important reasons: first, this strategy doesn’t scale well as the overhead will grow with the number of nodes; second, every node in the cluster will have to be available to answer requests in order to definitively state that a given item does not exist. A shared index can address this, but the result will be additional complexity and another point of failure. Therefore, a key objective of the hash function in a DHT is to map a key to the node that owns it, such that a request can be made to the correct node. However, the simple hash function discussed previously is no longer appropriate to map data to a node. The simple hash is problematic in a distributed system because n translates to the number of nodes in the cluster—and we know that n changes as nodes are added or removed. To illustrate this, we can modify our hash table to store pointers to machine IP addresses instead of street addresses. In this case, keys are mapped to a specific machine in the distributed hash table that holds the value for the key. Now, each key in the table can be mapped to its location in the cluster with a simple lookup. However, if we alter the cluster size (by adding or removing nodes), the result of the computation—and therefore the node mapping—changes for every object! Let’s see what happens when a node is removed from the cluster. www.it-ebooks.info When a node is removed from the cluster, the result is that the subsequent hash buckets are shifted, which causes the keys to point to different nodes. Note that after removing node 3, the number of buckets is reduced. As previously described, this changes the result of the hash function, causing the old mappings to become unusable. This will be catastrophic as all key lookups will point to the wrong node. www.it-ebooks.info Consistent hashing To solve the problem of locating a key in a distributed hash table, we use a technique called consistent hashing. Introduced as a term in 1997, consistent hashing was originally used as a means of routing requests among large numbers of web servers. It’s easy to see how the Web can benefit from a hash mechanism that allows any node in the network to efficiently determine the location of an object, in spite of the constant shifting of nodes in and out of the network. This is the fundamental objective of consistent hashing. www.it-ebooks.info The mechanics of consistent hashing With consistent hashing, the buckets are arranged in a ring with a predefined range; the exact range depends on the partitioner being used. Keys are then hashed to produce a value that lies somewhere along the ring. Nodes are assigned a range, which is computed as follows: Range start Range end Token value Next token value - 1 Note The following examples assume that the default Murmur3Partitioner is used. For more information on this partitioner, take a look at the documentation at http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerM3P_c.html For a five-node cluster, a ring with evenly distributed token ranges would look like the following diagram, presuming the default Murmur3Partitioner is used: In the preceding diagram, the primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). This diagram represents data ranges (the letters) and the nodes (the numbers) that own these ranges. It might also be helpful to visualize this in a table form, which might be more familiar to those who have used the nodetool ring command to view Cassandra’s topology. Node Range start Range end 1 5534023222112865485 -9223372036854775808 www.it-ebooks.info 2 -9223372036854775807 -5534023222112865485 3 -5534023222112865484 -1844674407370955162 4 -1844674407370955161 1844674407370955161 5 1844674407370955162 5534023222112865484 When Cassandra receives a key for either a read or a write, the hash function is applied to the key to determine where it lies in the range. Since all nodes in the cluster are aware of the other nodes’ ranges, any node can handle a request for any other node’s range. The node receiving the request is called the coordinator, and any node can act in this role. If a key does not belong to the coordinator’s range, it forwards the request to replicas in the correct range. Following the previous example, we can now examine how our names might map to a hash, using the Murmur3 hash algorithm. Once the values are computed, they can be matched to the range of one of the nodes in the cluster, as follows: Name Hash value Node assignment John -3916187946103363496 3 Jane 4290246218330003133 5 George -7281444397324228783 2 Sue -8489302296308032607 2 The placement of these keys might be easier to understand by visualizing their position in the ring. www.it-ebooks.info The hash value of the name keys determines their placement in the cluster Now that you understand the basics of consistent hashing, let’s turn our focus to the mechanism by which Cassandra assigns data ranges. www.it-ebooks.info Token assignment In Cassandra terminology, the start of the hash range is called a token, and until version 1.2, each node was assigned a single token in the manner discussed in the previous section. Version 1.2 introduced the option to use vnodes, as the feature is officially termed. vnodes became the default option in the 2.0 release. Cassandra determines where to place data using the tokens assigned to each node. Nodes learn about these token assignments via gossip. Additional replicas are then placed based on the configured replication strategy and snitch. More details about replica placement can be found in Chapter 3, Replication. www.it-ebooks.info Manually assigned tokens If you’re running a version prior to 1.2 or if you have chosen not to use vnodes, you will have to assign tokens manually. This is accomplished by setting the initial_token in cassandra.yaml. Manual token assignment introduces a number of potential issues: Adding and removing nodes: When the size of the ring changes, all tokens must be recomputed and then assigned to their nodes using nodetool move. This causes a significant amount of administrative overhead for a large cluster. Node rebuilds: In case of a node rebuild, only a few nodes can participate in bootstrapping the replacement, leading to significant service degradation. We’ll discuss this in detail later in this chapter. Hotspots: In some cases, the relatively large range assigned to each node can cause hotspots if data is not evenly distributed. Heterogeneous clusters: With every node assigned a single token, the expectation is that all nodes will hold the same amount of data. Attempting to subdivide ranges to deal with nodes of varying sizes is a difficult and error-prone task. Because of these issues, the use of vnodes is highly recommended for any new installation. For existing installations, migrating to vnodes will improve the performance, reliability, and administrative requirements of your cluster, especially during topology changes and failure scenarios. Tip Use vnodes whenever possible to avoid issues with topology changes, node rebuilds, hotspots, and heterogeneous clusters. If you must continue to manually assign tokens, make sure to set the correct value for initial_token whenever any nodes are added or removed. Failure to do so will almost always result in an unbalanced ring. For information about how to generate tokens, refer to the DataStax documentation at http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html You can then use the values you generate as the initial_token settings for your nodes, with each node getting one of the values. It’s best to always assign your tokens to the nodes in the same order to avoid unnecessary shuffling of data. www.it-ebooks.info vnodes The concept behind vnodes is straightforward. Instead of a single token assigned to each node, it is now possible to specify the number of tokens using the num_tokens configuration property in cassandra.yaml. The default value is 256, which is sufficient for most use cases. Note When using vnodes, use nodetool status instead of nodetool ring as the latter will output a row for every token across the cluster. Using nodetool status results in a much more readable output. The following diagram illustrates a cluster without vnodes compared to one with vnodes enabled: In the preceding diagram, each numbered node is represented as a slice of the ring, where the tokens are represented as letters. Note that tokens are assigned randomly. Remember that the letters represent ranges of data. You’ll notice that there are more ranges than nodes after enabling vnodes, and each node now owns multiple ranges. How vnodes improve availability While technically the cluster remains available during topology changes and node rebuilds, the level of degraded service has the potential to impact availability if the system remains under significant load. vnodes offer a simple solution to the problems associated with manually assigned tokens. Let’s examine the reasons why this is the case. www.it-ebooks.info Adding and removing nodes There are many reasons to change the size of a cluster. Perhaps you’re increasing capacity for an anticipated growth in data or transaction volume, or maybe you’re adding a data center for increased availability. Considering that the objective is to handle greater load or provide additional redundancy, any significant performance degradation while adding or bootstrapping a new node is unacceptable as it counteracts these goals. Often in modern high-scale applications, slow is the same as unavailable. Equally important is to ensure that new nodes receive a balanced share of the data. vnodes improve the bootstrapping process substantially because: More nodes can participate in data transfer: Since the token ranges are more dispersed throughout the cluster, adding a new node involves ranges from a greater number of the existing nodes. As a result, machines involved in the transfer end up under less load than without vnodes, thus increasing availability of those ranges. Token assignment is automatic: Cassandra handles the allocation of tokens, so there’s no need to manually recalculate and reassign a new token for every node in the cluster. As a result, the ring becomes naturally balanced on its own. Node rebuilding Rebuilding a node is a relatively common operation in a large cluster, as nodes will fail for a variety of reasons. Cassandra provides a mechanism to automatically rebuild a failed node using replicated data. When each node owns only a single token, that node’s entire data set is replicated to a number of nodes equal to the replication factor minus one. For example, with a replication factor of three, all the data on a given node will be replicated to two other nodes (replication will be covered in detail in Chapter 3, Replication). However, Cassandra will only use one replica in the rebuild operation. So in this case, a rebuild operation involves three nodes, placing a high load on all three. Imagine that we have a six-node cluster, and node 2 has failed, requiring a rebuild. In the following diagram, note that each node only contains replicas for three tokens, preventing two of the nodes from participating in the rebuild: www.it-ebooks.info In the rebuilding of node 2, only nodes 1, 3, and 4 can participate because they contain the required replicas. We can assume that reads and writes continue during this process. With one node down and three working hard to rebuild it, we now have only two out of six nodes operating at full capacity! Even worse, token ranges A and B reside entirely on nodes that are being taxed by this process, which can result in overburdening the entire cluster due to slow response times for these operations. vnodes provide significant benefits over manual token management for the rebuild process, as they distribute the load over many more nodes. This concept is the same as the benefit gained during the bootstrapping process. Since each node contains replicas for a larger (and random) variety of the available tokens, Cassandra can use these replicas in the rebuild process. Consider the following diagram of the same rebuild using vnodes: www.it-ebooks.info With vnodes, all nodes can participate in rebuilding node 2 because the tokens are spread more evenly across the cluster. In the preceding diagram, you can see that rebuilding node 2 now involves the entire cluster, thus distributing the workload more evenly. This means each individual node is doing less work than without vnodes, resulting in greater operational stability. Heterogeneous nodes While it might be straightforward to initially build your Cassandra cluster with machines that are all identical, at some point older machines will need to be replaced with newer ones. This can create issues while manually assigning tokens since it can become difficult to effectively choose the right tokens to produce a balanced result. This is especially problematic when adding or removing nodes, as it would become necessary to recompute the tokens to achieve a proper balance. vnodes ease this effort by allowing you to specify a number of tokens, instead of having to determine specific ranges. It is much easier to choose a proportionally larger number for newer, more powerful nodes than it is to determine proper token ranges. www.it-ebooks.info Partitioners You might recall from the earlier discussion of distributed hash tables that keys are mapped to nodes via an implementation-specific hash function. In Cassandra’s architecture, this function is determined by the partitioner you choose. This is a cluster- wide setting specified in cassandra.yaml. As of version 1.2, there are three options: Murmur3Partitioner: This produces an even distribution of data across the cluster using the MurmurHash algorithm. This is the default as of version 1.2, and should not be changed as it is measurably faster than the RandomPartitioner. RandomPartitioner: This is similar to the Murmur3Partitioner, except that it computes an MD5 hash. This was the default prior to version 1.2. ByteOrderedPartitioner: This places keys in byte order (lexically) around the ring. This partitioner should generally be avoided for reasons explained in this section. The only reason to switch from the default Murmur3Partitioner to ByteOrderedPartitioner would be to enable range queries on keys (range queries on columns are always possible). However, this decision must be carefully weighed as there is a high likelihood that you’ll end up with hotspots. www.it-ebooks.info Hotspots Let’s assume, for example, that you’re storing an address book, where the keys represent the last name of the contact. You want to use ByteOrderedPartitioner so you can search for all names between Smith and Watson. Using 2000 United States Census data as a guide, let’s assume the distribution is as follows: As one would expect, last names in the United States are not evenly distributed by the first letter. In fact, the distribution is quite uneven and this imbalance translates directly to the data stored in Cassandra. If we presume that each node owns a subset of the keys alphabetically, the result will resemble the following diagram: www.it-ebooks.info When using the ByteOrderedPartitioner, a table with the last name as the key is likely to result in uneven data distribution. The preceding diagram clearly shows that the resulting distribution produces hotspots in nodes 1 and 4, while node 5 is significantly underutilized. One perhaps less obvious side effect of this imbalance is the impact on reads and writes. If we presume that both reads and writes follow the same distribution as the data itself (which is a logical assumption in this specific case), the heavier data nodes will also be required to handle more operations than the lighter data nodes. Effects of scaling out using ByteOrderedPartitioner As is often the case in large systems, scaling out does not help to address this problem. In fact, the imbalance only gets worse when nodes are added. Using the same data distribution from the previous example, let’s increase the size of the cluster to 13 nodes to illustrate this point: www.it-ebooks.info The effects of hotspotting increased with the cluster size Obviously, we now have a significant problem. While in the five-node cluster, only one node was significantly underutilized, the larger cluster has eight out of 13 nodes doing half or less than half of the work as compared to the other nodes! In fact, two of the nodes own almost no data at all. A time-series example Perhaps the most common use case for Cassandra is storing time-series data. Let’s assume our use case involves writing log-style data, where we’re always writing current timestamps and reading from relatively recent ranges of time. These are typical operations involved in time-series use cases, so it’s natural to ask, “How can I query my data by date range?”. You’ll recall that range queries on columns in Cassandra are possible using any partitioner, but only the ByteOrderedPartitioner allows key-based range queries. Thus it’s a common mistake to build a time-series model using time as a key, and rely on ordering from the ByteOrderedPartitioner to perform range queries. www.it-ebooks.info Let’s assume we have a six-node cluster where the key corresponds to the time of day. If you are always writing current time, your writes will always go to a single node! Even worse, presuming you are reading recent ranges, your reads will also go to that same node. The following diagram illustrates what happens when log data is being written while the application is also requesting recent logs: Time-series reads and writes using the ByteOrderdPartitioner will concentrate on a small subset of nodes. As you can see, node 2 is the only node doing any work. Each time the hour shifts, the workload will move to the next node in the ring. While the distribution of data in this model might be balanced (or it might not, depending on whether the application is busier at certain times), the workload will always experience hotspots. We will discuss some more appropriate time-series data modeling techniques in detail in Chapter 7, Modeling for High Availability. For now, consider it sufficient that you understand the implications of choosing the ByteOrderedPartitioner over one of the other options that uses a random hash function. Note In almost all cases, the Murmur3Partitioner is the right choice. Use of the ByteOrderedPartitioner (or OrderPreservingPartitioner prior to version 1.2) should be used with great caution, and can usually be avoided by altering the data model. If you choose to use ByteOrderedPartitioner, just remember that you will need to keep a close watch on your data distribution. Also, you will have to ensure that your reads and writes can be accomplished without overloading a subset of your nodes. In practice, it’s rarely necessary to store keys in order if you model your data correctly. In Chapter 7, Modeling for High Availability, we’ll discuss a number of data modeling strategies that can enable range queries without the drawbacks of the ByteOrderedPartitioner. For now, it’s best to assume that the Murmur3Partitioner is the safest choice, and this follows the recommendation made by Cassandra’s core developers. www.it-ebooks.info Summary At this point, you should have a strong grasp of Cassandra’s data distribution architecture, including consistent hashing, tokens, vnodes, and partitioners, as well as some of the causes of data hotspots. Your understanding of these fundamentals will help you to make sound design decisions that enable you to scale your cluster effectively and get the most out of your infrastructure investment. In this chapter and the previous one, we made reference to replication and its related concepts a number of times. In the next chapter, we’ll discuss replication in depth as replication is very important in determining the availability of data. www.it-ebooks.info Chapter 3. Replication Replication is perhaps the most critical feature of a distributed data store, as it would otherwise be impossible to make any sort of availability guarantee in the face of a node failure. As you learned in Chapter 1, Cassandra’s Approach to High Availability, Cassandra employs a sophisticated replication system that allows fine-grained control over replica placement and consistency guarantees. In this chapter, we’ll explore Cassandra’s replication mechanism in depth, including the following topics: The replication factor How replicas are placed How Cassandra resolves consistency issues Maintaining replication factor during node failures Consistency levels Choosing the right replication factor and consistency level At the end of this chapter, you’ll be able understand how to configure replication and tune consistency for your specific use cases. You’ll be able to intelligently choose options that will provide the fault tolerance and consistency guarantees that are appropriate for your application. Let’s start with the basics: how Cassandra determines the number of replicas to be created and where to locate them in the cluster. We’ll begin the discussion with a feature that you’ll encounter the very first time you create a keyspace: the replication factor. www.it-ebooks.info The replication factor On the surface, setting the replication factor seems to be a fundamentally straightforward idea. You configure Cassandra with the number of replicas you want to maintain (during keyspace creation), and the system dutifully performs the replication for you, thus protecting you when something goes wrong. So by defining a replication factor of three, you will end up with a total of three copies of the data. There are a number of variables in this equation, and we’ll cover many of these in detail in this chapter. Let’s start with the basic mechanics of setting the replication factor. www.it-ebooks.info Replication strategies One thing you’ll quickly notice is that the semantics to set the replication factor depend on the replication strategy you choose. The replication strategy tells Cassandra exactly how you want replicas to be placed in the cluster. There are two strategies available: SimpleStrategy: This strategy is used for single data center deployments. It is fine to use this for testing, development, or simple clusters, but discouraged if you ever intend to expand to multiple data centers (including virtual data centers such as those used to separate analysis workloads). NetworkTopologyStrategy: This strategy is to be used when you have multiple data centers, or if you think you might have multiple data centers in the future. In other words, you should use this strategy for your production cluster. SimpleStrategy As a way of introducing this concept, we’ll start with an example using SimpleStrategy. The following Cassandra Query Language (CQL) block will allow us to create a keyspace called AddressBook with three replicas: CREATE KEYSPACE AddressBook WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 3 }; You will recall from the previous chapter’s section on token assignment that data is assigned to a node via a hash algorithm, resulting in each node owning a range of data. Let’s take another look at the placement of our example data on the cluster. Remember the keys are first names, and we determined the hash using the Murmur3 hash algorithm. www.it-ebooks.info The primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). While using SimpleStrategy, Cassandra will locate the first replica on the owner node (the one determined by the hash algorithm), then walk the ring in a clockwise direction to place each additional replica, as follows: www.it-ebooks.info Additional replicas are placed in adjacent nodes when using manually assigned tokens In the preceding diagram, the keys in bold represent the primary replicas (the ones placed on the owner nodes), with subsequent replicas placed in adjacent nodes, moving clockwise from the primary. Although each node owns a set of keys based on its token range(s), there is no concept of a master replica. In Cassandra, unlike make other database designs, every replica is equal. This means reads and writes can be made to any node that holds a replica of the requested key. If you have a small cluster where all nodes reside in a single rack inside one data center, SimpleStrategy will do the job. This makes it the right choice for local installations, development clusters, and other similar simple environments where expansion is unlikely because there is no need to configure a snitch (which will be covered later in this section). For production clusters, however, it is highly recommended that you use NetworkTopologyStrategy instead. This strategy provides a number of important features for more complex installations where availability and performance are paramount. NetworkTopologyStrategy When it’s time to deploy your live cluster, NetworkTopologyStrategy offers two additional properties that make it more suitable for this purpose: Rack awareness: Unlike SimpleStrategy, which places replicas naively, this feature attempts to ensure that replicas are placed in different racks, thus preventing service interruption or data loss due to failures of switches, power, cooling, and other similar events that tend to affect single racks of machines. Configurable snitches: A snitch helps Cassandra to understand the topology of the cluster. There are a number of snitch options for any type of network configuration. We’ll cover snitches in detail later in this chapter. Here’s a basic example of a keyspace using NetworkTopologyStrategy: CREATE KEYSPACE AddressBook WITH REPLICATION = { ‘class’ : ‘NetworkTopologyStrategy’, ‘dc1’ : 3, ‘dc2’ : 2 }; In this example, we’re telling Cassandra to place three replicas in a data center called dc1 and two replicas in a second data center called dc2. We’ll spend more time discussing data centers in Chapter 4, Data Centers, but for now it is sufficient to point out that the data center names must match those configured in the snitch. www.it-ebooks.info Snitches As discussed earlier, Cassandra is able to intelligently place replicas across the cluster if you provide it with enough information about your topology. You give this insight to Cassandra through a snitch, which is set using the endpoint_snitch property in cassandra.yaml. The snitch is also used to help Cassandra route client requests to the closest nodes to reduce network latency. As of version 2.0, there are eight available snitch options (and you can write your own as well): SimpleSnitch: This snitch is a companion to the SimpleStrategy replication strategy. It is designed for simple single data center configurations. RackInferringSnitch: As the name implies, this snitch attempts to infer your network topology. Using this snitch is discouraged because it assumes that your IP addressing scheme reflects your data center and rack configuration. For this to work properly, your addresses must be in the following form: PropertyFileSnitch: Using this snitch allows the administrator to define which nodes belong in certain racks and data centers. You can configure this using cassandra-topology.properties. Each node in the cluster must be configured identically. You should generally prefer GossipingPropertyFileSnitch because it handles the addition or removal of nodes without the need to update every node’s properties file. GossipingPropertyFileSnitch: Unlike PropertyFileSnitch, where the entire topology must be defined on every node, this snitch allows you to configure each node with its own rack and data center, and then Cassandra gossips this information to the other nodes. CloudstackSnitch: This snitch sets data centers and racks using Cloudstack’s country, location, and availability zone. GoogleCloudSnitch: For Google Cloud deployments, this snitch automatically sets the region as the data center and the availability zone as the rack. EC2Snitch: This is similar to GoogleCloudSnitch, but for single-region EC2 deployments. This snitch also sets the region as the data center and the availability zone as the rack. EC2MultiRegionSnitch: This snitch assigns data centers and racks identically to EC2Snitch, with the difference being that it supports using public IP addresses for cross-data center communications. Tip www.it-ebooks.info For production installations, it is almost always best to choose GossipingPropertyFileSnitch in physical data center environments and the appropriate cloud snitch in cloud environments. Since much of the configuration related to snitches pertains to the topology of our data center, we will save our detailed treatment of this topic for Chapter 4, Data Centers, which will cover Cassandra’s multiple data center features in detail. www.it-ebooks.info Maintaining the replication factor when a node fails One key way in which Cassandra maintains fault tolerance even during node failure is through a mechanism called hinted handoff. If you have set hinted_handoff_enabled to true in cassandra.yaml (which is the default), and one of the replica nodes is unreachable during a write, then the system will store a hint on the coordinator node (the node that receives the write). This hint contains the data itself along with information about where it belongs in the cluster. Hints are replayed to the replica node once the coordinator learns via gossip that the replica node is back online. By default, Cassandra stores hints for up to three hours to avoid hint queues growing too long. This time window can be configured using the max_hint_window_in_ms property in cassandra.yaml. After this time period, it is necessary to run a repair to restore consistency. Chapter 9, Failing Gracefully, will include more in-depth coverage of hinted handoff and how to ensure that your system recovers from longer node outages. Now that we’ve covered the basics of replication, it’s time to move on to the closely related topic of consistency. In most configurations, there will inevitably be occasions when not all replicas of a given bit of data are up to date. The specifics of how and when this occurs will be outlined later in this chapter. For now, let’s find out how Cassandra handles those conflicts when they arise. www.it-ebooks.info Consistency conflicts In Chapter 1, Cassandra’s Approach to High Availability, we discussed Cassandra’s tunable consistency characteristics. For any given call, it is possible to achieve either strong consistency or eventual consistency. In the former case, we can know for certain that the copy of the data that Cassandra returns will be the latest. In the case of eventual consistency, the data returned may or may not be the latest, or there may be no data returned at all if the node is unaware of newly inserted data. Under eventual consistency, it is also possible to see deleted data if the node you’re reading from has not yet received the delete request. Depending on the read_repair_chance setting and the consistency level chosen for the read operation (more on this in the anti-entropy section later in this chapter), Cassandra might block the client and resolve the conflict immediately, or this might occur asynchronously. If data in conflict is never requested, the system will resolve the conflict the next time nodetool repair is run. How does Cassandra know there is a conflict? Every column has three parts: key, value, and timestamp. Cassandra follows last-write-wins semantics, which means that the column with the latest timestamp always takes precedence. Now, let’s discuss one of the most important knobs a developer can turn to determine the consistency characteristics of their reads and writes. www.it-ebooks.info Consistency levels On every read and write operation, the caller must specify a consistency level, which lets Cassandra know what level of consistency to guarantee for that one call. The following table details the various consistency levels and their effects on both read and write operations: Consistency level Reads Writes ANY This is not supported for reads. Data must be written to at least one node, but permits writes via hinted handoff. Effectively allows a write to any node, even if all nodes containing the replica are down. A subsequent read might be impossible if all replica nodes are down. ONE The replica from the closest node will be returned. Data must be written to at least one replica node (both commit log and memtable). Unlike ANY, hinted handoff writes are not sufficient. TWO The replicas from the two closest nodes will be returned. The same as ONE, except two replicas must be written. THREE The replicas from the three closest nodes will be returned. The same as ONE, except three replicas must be written. QUORUM Replicas from a quorum of nodes will be compared, and the replica with the latest timestamp will be returned. Data must be written to a quorum of replica nodes (both commit log and memtable) in the entire cluster, including all data centers. SERIAL Permits reading uncommitted data as long as it represents the current state. Any uncommitted transactions will be committed as part of the read. Similar to QUORUM, except that writes are conditional based on the support for lightweight transactions. LOCAL_ONE Similar to ONE, except that the read will be returned by the closest replica in the local data center. Similar to ONE, except that the write must be acknowledged by at least one node in the local data center. LOCAL_QUORUM Similar to QUORUM, except that only replicas in the local data center are compared. Similar to QUORUM, except the quorum must only be met using the local data center. LOCAL_SERIAL Similar to SERIAL, except only local replicas are used. Similar to SERIAL, except only writes to local replicas must be acknowledged. EACH_QUORUM The opposite of LOCAL_QUORUM; requires each data center to produce a quorum of replicas, then returns the replica with the latest timestamp. The opposite of LOCAL_QUORUM; requires a quorum of replicas to be written in each data center. ALL Replicas from all nodes in the entire cluster (including all data centers) will be compared, and the replica with the latest timestamp will be returned. Data must be written to all replica nodes (both commit log and memtable) in the entire cluster, including all data centers. As you can see, there are numerous combinations of read and write consistency levels, all with different ultimate consistency guarantees. To illustrate this point, let’s assume that you would like to guarantee absolute consistency for all read operations. On the surface, it might seem as if you would have to read with a consistency level of ALL, thus sacrificing availability in the case of node failure. But there are alternatives depending on your use case. There are actually two additional ways to achieve strong read consistency: Write with consistency level of ALL: This has the advantage of allowing the read operation to be performed using ONE, which lowers the latency for that operation. On the other hand, it means the write operation will result in UnavailableException if one of the replica nodes goes offline. Read and write with QUORUM or LOCAL_QUORUM: Since QUORUM and LOCAL_QUORUM both require a majority of nodes, using this level for both the write and the read will result in a full consistency guarantee (in the same data center when www.it-ebooks.info using LOCAL_QUORUM), while still maintaining availability during a node failure. You should carefully consider each use case to determine what guarantees you actually require. For example, there might be cases where a lost write is acceptable, or occasions where a read need not be absolutely current. At times, it might be sufficient to write with a level of QUORUM, then read with ONE to achieve maximum read performance, knowing you might occasionally and temporarily return stale data. Cassandra gives you this flexibility, but it’s up to you to determine how to best employ it for your specific data requirements. A good rule of thumb to attain strong consistency is that the read consistency level plus write consistency level should be greater than the replication factor. Tip If you are unsure about which consistency levels to use for your specific use case, it’s typically safe to start with LOCAL_QUORUM (or QUORUM for a single data center) reads and writes. This configuration offers strong consistency guarantees and good performance while allowing for the inevitable replica failure. It is important to understand that even if you choose levels that provide less stringent consistency guarantees, Cassandra will still perform anti-entropy operations asynchronously in an attempt to keep replicas up to date. www.it-ebooks.info Repairing data Cassandra employs a multifaceted anti-entropy mechanism that keeps replicas in synch. Data repair operations generally fall into three categories: Synchronous read repair: When a read operation requires comparing multiple replicas, Cassandra will initially request a checksum from the other nodes. If the checksum doesn’t match, the full replica is sent and compared with the local version. The replica with the latest timestamp will be returned and the old replica will be updated. This means that in normal operations, old data is repaired when it is requested. Asynchronous read repair: Each table in Cassandra has a setting called read_repair_chance (as well as its related setting, dclocal_read_repair_chance), which determines how the system treats replicas that are not compared during a read. The default setting of 0.1 means that 10 percent of the time, Cassandra will also repair the remaining replicas during read operations. Manually running repair: A full repair (using nodetool repair) should be run regularly to clean up any data that has been missed as part of the previous two operations. At a minimum, it should be run once every gc_grace_seconds, which is set in the table schema and defaults to 10 days. One might ask what the consequence would be of failing to run a repair operation within the window specified by gc_grace_seconds. The answer relates to Cassandra’s mechanism to handle deletes. As you might be aware, all modifications (or mutations) are immutable, so a delete is really just a marker telling the system not to return that record to any clients. This marker is called a tombstone. Cassandra performs garbage collection on data marked by a tombstone each time a compaction occurs. If you don’t run the repair, you risk deleted data reappearing unexpectedly. In general, deletes should be avoided when possible as the unfettered buildup of tombstones can cause significant issues. For more information on this topic, refer to Chapter 8, Antipatterns. Note In the course of normal operations, Cassandra will repair old replicas when they records are requested. Thus, it can be said that read repair operations are lazy, such that they only occur when required. With all these options for replication and consistency, it can seem daunting to choose the right combination for a given use case. Let’s take a closer look at this balance to help bring some additional clarity to the topic. www.it-ebooks.info Balancing the replication factor with consistency There are many considerations when choosing a replication factor, including availability, performance, and consistency. Since our topic is high availability, let’s presume your desire is to maintain data availability in the case of node failure. It’s important to understand exactly what your failure tolerance is, and this will likely be different depending on the nature of the data. The definition of failure is probably going to vary among use cases as well, as one case might consider data loss a failure, whereas another accepts data loss as long as all queries return. Achieving the desired availability, consistency, and performance targets requires coordinating your replication factor with your application’s consistency level configurations. In order to assist you in your efforts to achieve this balance, let’s consider a single data center cluster of 10 nodes and examine the impact of various configuration combinations: RF Write CL Read CL Consistency Availability Use cases 1 ONE QUORUM ALL ONE QUORUM ALL Consistent Doesn’t tolerate any replica loss Data can be lost and availability is not critical, such as analysis clusters 2 ONE ONE Eventual Tolerates loss of one replica Maximum read performance and low write latencies are required, and sometimes returning stale data is acceptable 2 QUORUM ALL ONE Consistent Tolerates loss of one replica on reads, but none on writes Read-heavy workloads where some downtime for data ingest is acceptable (improves read latencies) 2 ONE QUORUM ALL Consistent Tolerates loss of one replica on writes, but none on reads Write-heavy workloads where read consistency is more important than availability 3 ONE ONE Eventual Tolerates loss of two replicas Maximum read and write performance are required, and sometimes returning stale data is acceptable 3 QUORUM ONE Eventual Tolerates loss of one replica on write and two on reads Read throughput and availability are paramount, while write performance is less important, and sometimes returning stale data is acceptable 3 ONE QUORUM Eventual Tolerates loss of two replicas on write and one on reads Low write latencies and availability are paramount, while read performance is less important, and sometimes returning stale data is acceptable 3 QUORUM QUORUM Consistent Tolerates loss of one replica Consistency is paramount, while striking a balance between availability and read/write performance 3 ALL ONE Consistent Tolerates loss of two replicas on reads, but none on writes Additional fault tolerance and consistency on reads is paramount at the expense of write performance and availability 3 ONE ALL Consistent Tolerates loss of two replicas on writes, but none on reads Low write latencies and availability are paramount, but read consistency must be guaranteed at the expense of performance and availability 3 ANY ONE Eventual Tolerates loss of all replicas on write and two on read Maximum write and read performance and availability are paramount, and often returning stale data is acceptable (note that hinted writes are less reliable than the guarantees offered at CL ONE) 3 ANY QUORUM Eventual Tolerates loss of all replicas on write and one on read Maximum write performance and availability are paramount, and sometimes returning stale data is acceptable 3 ANY ALL Consistent Tolerates loss of all replicas on writes, but none on reads Write throughput and availability are paramount, and clients must all see the same data, even though they might not see all writes immediately www.it-ebooks.info As you can see, there are numerous possibilities to consider when choosing these values, especially in a scenario involving multiple data centers. This discussion will give you greater confidence as you design your applications to achieve the desired balance. www.it-ebooks.info Summary In this chapter, we introduced the foundational concepts of replication and consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. By now, you should be able to make sound decisions specific to your use cases. This chapter might serve as a handy reference in the future as it can be challenging to keep all these details in mind. In the previous two chapters, we’ve been gradually expanding from how Cassandra locates individual pieces of data to its strategy to replicate it and keep it consistent. In the next chapter, we’ll take things a step further and take a look at its multiple data center capabilities, as no highly available system is truly complete without the ability to distribute itself geographically. www.it-ebooks.info Chapter 4. Data Centers One of Cassandra’s most compelling high availability features is its support for multiple data centers. In fact, this feature gives it the capability to scale reliably with a level of ease that few other data stores can match. In this chapter, we’ll explore Cassandra’s data center support, covering the following topics: Use cases for multiple data centers Using a separate data center for online analytics Replication across data centers An in-depth look at configuring snitches Multiregion EC2 implementations Consistency levels for multiple data centers Database administrators have struggled for many years to reliably replicate data across multiple geographies—a task that is made especially difficult when the system attempts to maintain ACID guarantees. The best we could typically hope for was to keep a relatively recent backup for failover purposes. Distributed database designs have made this easier, but many of these still require complex configurations and have significant limitations while replicating across data centers. Cassandra allows you to maintain a complete set of replicas in more than one data center with relative ease. Let’s start by examining some of the reasons why users might want to deploy multiple data centers. As we look at each option, think about your own use cases and in which category they might fall. Doing so will help you to make the right deployment decisions to make the best use of your Cassandra investment. www.it-ebooks.info Use cases for multiple data centers There are several key use cases that involve deploying Cassandra across multiple data centers, including the obvious failover and load balancing scenarios. Let’s examine a few of these cases. www.it-ebooks.info Live backup Traditional database backups involve taking periodic snapshots of the data and storing them offsite in case the system fails. In such a case, there will be downtime as a new system is brought up and the data is restored. This strategy also inevitably leads to data loss for the time period between the last backup and the point of failure. Cassandra supports these types of backups, and we will discuss this in greater depth in Chapter 9, Failing Gracefully. While snapshot backups are still useful to protect against data corruption or accidental updates, Cassandra’s data center support can be used to provide a current backup for cases such as hardware failures. The basic idea involves setting up a second data center that maintains a current set of replicas that can be used to rebuild the primary cluster, should a catastrophic event cause the loss of an entire data center. For this use case, it is typically sufficient to maintain a smaller cluster with a replication factor of one, as the system will never be used to accept live reads or writes. The primary consideration in this case is the storage capacity to handle the same quantity of data as the live data center. www.it-ebooks.info Failover A failover scenario is very similar to the backup use case we just discussed, except that the backup data center is generally allocated similar resources as the primary cluster. Additionally, while a single replica might suffice for a backup data center, generally speaking, a failover data center should be configured with the same replication factor as the primary since it might take over responsibility for the full application load in the event of a failure. It’s also important to consider whether you expect your failover data center to handle a full production load. Presuming you do have this expectation, you will need to ensure that it has adequate capacity to handle this. Having a hot failover data center protects you from a common single point of failure—the power supply to your hosts. In EC2, you can choose to configure your hosts to run in multiple availability zones, as each is supplied with a separate power source. If you do this while using the EC2 snitch, be sure to allocate your nodes evenly across zones, as the snitch will place replicas across multiple zones. Failure to do this can lead to hotspots. Tip It would be ill-advised to assume that you can maintain a small failover data center, and then simply add multiple nodes if a failure occurs. The additional overhead of bootstrapping the new nodes will actually reduce capacity at a critical point when the capacity is needed most. www.it-ebooks.info Load balancing In some cases, applications might be configured to route traffic to any node in the cluster without taking into account a specific data center. This has the effect of load balancing the requests across multiple data centers, and can be useful in cases where the data centers share a high bandwidth connection. In this instance, the objective is to provide redundancy, so each data center must be able to handle the entire application load, similar to the failover scenario. However, there are a couple of important considerations when choosing this approach: Absolute consistency is expensive to guarantee in this scenario because doing so typically requires replicating the data across higher latency connections. If strong consistency is paramount for your use case, you should consider employing a geographic distribution model as described in the next section. This usage pattern is most appropriate for use cases where eventual consistency is acceptable, such as event capture, time-series data, and logging where the primary read case involves offline data analysis rather than real-time queries. www.it-ebooks.info Geographic distribution Often, application architects will find it necessary for latency reasons to send requests to a data center located near the originator or to mitigate the potential impact of natural disasters. This is particularly useful for systems that span the globe, where routing all requests to a central location is impractical. The ability to locate data centers in strategic global locations around the world can be an indispensable feature in these scenarios. This approach is often desirable for applications where both performance and strong consistency are important. The reason for this is that clients are guaranteed to make requests to a single data center, enabling the use of the LOCAL_QUORUM consistency level— which means they won’t suffer a performance penalty by waiting for a remote data center to acknowledge the write. The following diagram illustrates this configuration: www.it-ebooks.info A variation on this idea would be key distribution, where the data is partitioned using some other differentiator (such as last name). With this scheme, the data centers might be located near each other geographically, but the load is split between them based on something other than the client’s location. In either of these scenarios, the idea is that clients should detect the failure of a data center and fall back on one of the others. There is a possibility of reading old data if it was written with a local consistency level, but in many cases stale data is better than application downtime. This can be visualized as follows: In this scenario, the North American data center experiences a failure, which requires clients in North America to redirect to the European data center during the outage. Obviously, the European data center must have sufficient capacity to handle the additional load. It’s important to make sure that your application is capable of handling this scenario, as the latency will increase and reads might produce some stale data. A good strategy is to limit the interaction with the database to only those operations that are critical to the continuous functioning of the application. www.it-ebooks.info Online analysis So far, we’ve discussed use cases that might be obvious to experienced database users. But Cassandra supports an additional scenario that is particularly useful in the context of a NoSQL database that doesn’t provide a built-in ad hoc query mechanism. The use of a data center for analysis purposes has become commonplace among Cassandra users, as it provides the benefits of a scalable NoSQL solution with the power of modern data analysis tools. Traditional data analysis, referred to as Online Analytical Processing (OLAP), typically involves taking normalized data from the transactional relational database and moving it into a denormalized form for faster analysis. This process involves significant extract, transform, and load (ETL) overhead, which inherently results in a delay in analyzing the data. Cassandra’s support for multiple data centers, in combination with its robust integrations with the Hadoop and Spark frameworks, allows users to conduct sophisticated batch or real-time analysis using live data with no ETL overhead. This is accomplished by dedicating a separate data center for analysis, then isolating this data center from live traffic. For many use cases, a single replica is sufficient for an analysis data center, as short periods of downtime are frequently acceptable for batch analysis purposes. However, if you require 100 percent uptime for your analysis workloads, you might need to specify a higher replication factor. Additional replicas also mean that the analysis data center is less likely to drop writes, especially while heavy analysis jobs are running. Also, make sure to run repair regularly to keep data consistent. There are currently two popular open source analysis projects with excellent Cassandra integration: Hadoop: Cassandra has included support for Hadoop since the very early revisions, and the DataStax Enterprise offering even provides a replacement for Hadoop Distributed File System (HDFS) called CassandraFS. Having said that, while Hadoop was quite revolutionary at its introduction, it is beginning to show its age. Spark: The Spark project has gained significant traction in a very short period of time, primarily as an in-memory replacement for Hadoop. The excellent open source integration with Cassandra, supported by DataStax, allows much faster and more elegant analysis work to be performed against native Cassandra data. If you don’t already have a significant Hadoop investment, the Spark integration is most likely the better choice. Regardless of which path you choose, it’s important to realize that the old OLAP paradigms no longer apply. The key to successfully processing large amounts of distributed data is to bring the processing to the data, rather than the data to the processing. This was the key innovation with MapReduce. www.it-ebooks.info In this new world of large datasets, shipping data across the network using complex ETL processes is no longer a viable solution. We must co-locate the processing framework with the database. Let’s explore how to do this using both Hadoop and Spark. Analysis using Hadoop Hadoop is actually an ecosystem comprising of multiple projects, a full discussion of which would be too much for this chapter. For our purposes, we will simply point out the important processes and how they should be deployed with Cassandra. Under the covers, Hadoop makes extensive use of HDFS to write temporary data to disk. HDFS components include NameNode and SecondaryNameNode (which live on a master node), and DataNodes (which hold the data itself). If you use DataStax Enterprise, these components are replaced by CassandraFS, which uses Cassandra as the underlying file system. The actual analysis work is performed by the MapReduce framework, which consists of a JobTracker (which you will install on the master) and TaskTrackers (which are co-located on DataNodes). The canonical Cassandra-Hadoop integration places DataNodes and TaskTrackers on each Cassandra node in the analysis data center. This allows the data owned by each node to be processed locally, rather than having to be retrieved from across the network. This idea is fundamental to the ability to process large amounts of data in an efficient manner. In fact, shuffling data across the network is typically the most significant time sink in any analysis work. The following diagram shows how this configuration looks: The canonical Hadoop-Cassandra topology involves co-locating TaskTrackers and DataNodes with the Cassandra instances. If you have an existing Hadoop installation, you may be tempted to try to move data from Cassandra into that cluster. However, a better www.it-ebooks.info strategy is to install Cassandra on that cluster. Alternatively, you can use a separate cluster to process your Cassandra data, then move the results into your existing cluster. In any case, migration to Spark is worth considering, as it is a much more modern attempt at distributed data processing. Analysis using Spark To use Spark to analyze Cassandra data, you will essentially be replacing the MapReduce component of your Hadoop installation with the Spark processes. The Spark Master process replaces the JobTracker, and the Slave processes take over the job of the TaskTrackers, as follows: While Spark appears to be rapidly gaining traction in the analysis space, many of the existing tools and frameworks are built around Hadoop and MapReduce. Additionally, a large number of users have existing investments in the Hadoop ecosystem, which might make a wholesale move to Spark impractical. The good news is that these two can live together in harmony. In fact, you can simply add Spark processes to your existing infrastructure, provided that you have sufficient resources to do so. You can also employ two analysis data centers: one for Hadoop jobs and one for Spark jobs. Cassandra offers tremendous flexibility in this manner. Now that we’ve covered the basic scenarios where multiple data centers prove useful, let’s deep dive into data center configuration. www.it-ebooks.info Data center setup The mechanism to define a data center depends on the snitch that you specify in cassandra.yaml. Take a look at the previous chapter if you need a refresher on the various types of snitches. You’ll recall that the snitch’s role is to tell Cassandra what your network topology looks like, so it can know how to place replicas across your cluster. While configuring a snitch, it’s important to make sure that the host names resolved by the snitch match those in your schema. With this in mind, let’s take a closer look at what configuration looks like for each of the snitch options. www.it-ebooks.info RackInferringSnitch There really isn’t any configuration to be performed on RackInferringSnitch, as long as your IP addressing scheme matches your topology. Specifically, it uses the second, third, and fourth octets to define data center, rack, and node, respectively, as follows: This strategy can work well for simple deployments in physical data centers where IP addresses can be predicted reliably. The problem is that this rarely works out well over the long term, as network requirements often change over time. Also, ensuring all network administrators abide by these rules can be difficult. In general, it’s better to use one of the other more explicit snitches. Tip As a general rule, it is preferable to deploy a single rack in each data center as opposed to using the rack awareness feature. This applies to any snitch that allows specifying racks. While the initial configuration might be straightforward, it can be difficult to scale the multiple rack strategy. Rack configurations have a tendency to change over time, and often the people who manage the hardware are not the same people who handle Cassandra configuration. In this case, simplicity is often the best strategy. www.it-ebooks.info PropertyFileSnitch The PropertyFileSnitch configuration allows an administrator to precisely configure the topology of the network by means of a properties file named cassandra- topology.properties. The following is an example configuration, representing a cluster with three data centers, where the first two have two racks each and the analysis cluster has a single rack: # US East Data Center 50.11.22.33 =DC1:RAC1 50.11.22.44 =DC1:RAC1 50.11.22.55 =DC1:RAC1 50.11.33.33 =DC1:RAC2 50.11.33.44 =DC1:RAC2 50.11.33.55 =DC1:RAC2 # US West Data Center 172.11.22.33 =DC2:RAC1 172.11.22.44 =DC2:RAC1 172.11.22.55 =DC2:RAC1 172.11.33.33 =DC2:RAC2 172.11.33.44 =DC2:RAC2 172.11.33.55 =DC2:RAC2 # Analysis Cluster 172.11.44.11 =DC3:RAC1 172.11.44.22 =DC3:RAC1 172.11.44.33 =DC3:RAC1 # Default for unspecified nodes default =DC3:RAC1 The following diagram shows what this cluster would look like visually: www.it-ebooks.info This example demonstrates a cluster with two physical data centers and one virtual data center used for analysis. It is worth noting that in the specific case shown earlier, RackInferringSnitch would automatically choose essentially the same topology since the IP addresses conform to its required scheme. www.it-ebooks.info GossipingPropertyFileSnitch One of the principal challenges when using the PropertyFileSnitch is that the configuration file must be kept in sync on all nodes. This can be difficult, as the file is reloaded automatically without restarting. While modern cluster management tools certainly ease this burden, the GossipingPropertyFileSnitch solves the problem completely. Rather than using cassandra-topology.properties, you can specify the data center and rack membership for each node in its own configuration file. In each node’s $CASSANDRA_HOME/conf directory, you’ll need to place a file called cassandra- rackdc.properties, which should conform to the following example: dc =DC1 rack =RAC1 # Uncomment the following line to make this snitch prefer the internal ip when possible, as the Ec2MultiRegionSnitch does. # prefer_local=true Once this file is in place (and the GossipingPropertyFileSnitch is selected in cassandra.yaml), as the name implies, Cassandra will gossip the data center and rack information to the other nodes in the cluster. This eliminates the requirement for a centralized configuration, and in general conforms to the principles behind Cassandra’s peer-to-peer architecture in a better manner. Thus far, we’ve examined snitches that work well when you control the network configuration on your nodes as is the case with physical, noncloud data centers. With the proliferation of cloud deployments on Amazon’s EC2 infrastructure, this is not always the case. www.it-ebooks.info Cloud snitches Amazon EC2, Google Cloud, and CloudStack can be excellent places to run Cassandra, as much work has been put into getting it right. This section will focus on EC2 deployments, as they are currently the most common. But the general principles apply to all the cloud snitches. If you’re planning on going this route, be sure to check out the plethora of fantastic open source tools available from Netflix, who has put significant time and energy into perfecting the art of deploying and running Cassandra on EC2. Their engineering blog also has loads of great content that’s worth a look. This book will avoid making any recommendations for specific instance types or configurations, as requirements are unique for different use cases. However, an exception is that running on ephemeral SSDs is highly recommended, as you will see tremendous performance gains from doing so. When the time comes to configure Cassandra on EC2, the EC2MultiRegionSnitch will come in handy. If you already manage deployments on EC2, you must be aware of the frequently transient nature of its network configurations. This snitch is designed to ease the burden of managing this often troublesome issue. When using the EC2MultiRegionSnitch configuration, data center and rack configuration will be tied directly to region and availability zone, respectively. Thus, a node in the US- East region, availability zone 1a, will be assigned to a data center named us-east and a rack named 1a. Additionally, since many deployments involve virtual data centers that are logically separated but located in the same physical region, this snitch allows you to specify a suffix to be applied to the data center name. This involves setting the dc_suffix property in cassandra-rackdc.properties, as follows: dc_suffix=_live With this suffix in place, the data center will now be named us-east_live. Note When deploying Cassandra in EC2 with the multiregion snitch, make sure to set your broadcast_address to the external IP address, and your rpc_address and listen_address to the internal IP address. These values can be found in cassandra.yaml. This will allow your nodes to communicate across data centers while keeping your client traffic local to the data center in which it resides. In order to achieve the greatest amount of protection from failures in EC2, it is advisable to deploy your nodes across multiple availability zones in each region. Amazon’s availability zones operate as isolated locations with high bandwidth network connections between them, and Cassandra’s rack awareness features can guarantee replica placement in multiple zones. Keep in mind that you need to evenly distribute nodes across availability zones to achieve even replica distribution. The following diagram shows an example of an optimal configuration, with data centers in www.it-ebooks.info two regions in addition to an analysis cluster. This is similar to the diagram shown previously using PropertyFileSnitch. When using a cloud snitch, data centers correlate to regions, while racks are assigned based on availability zones. This topology mirrors the previous example, except the naming convention uses AWS regions and availability zones. In the us-east data center, dc_suffix is defined as live for the nodes that accept live traffic, and analysis for the nodes isolated for read-heavy analytics workloads. You should now have a good understanding of how to configure your cluster for multiple data centers. Now, let’s explore how Cassandra replicates data across these data centers, and how multiple data centers influence the balance between consistency, availability, and performance. www.it-ebooks.info Replication across data centers In the previous chapters, we touched on the idea that Cassandra can automatically replicate across multiple data centers. There are other systems that allow similar replication; however, the ease of configuration and general robustness set Cassandra apart. Let’s take a detailed look at how this works. www.it-ebooks.info Setting the replication factor You will recall from Chapter 3, Replication, that specifics about replication are configured via CQL at the keyspace level. Since we’re on the topic of multiple data centers, it’s important to understand that you’ll always have to use the NetworkTopologyStrategy, since the SimpleStrategy does not allow you to set replication factor for each data center. Using our example physical topology from the PropertyFileSnitch section, the following statement will create a keyspace, users, with three replicas in each of our two live data centers, as well as one in the analysis data center: CREATE KEYSPACE users WITH REPLICATION = { ‘class’: ‘NetworkTopologyStrategy’, ‘DC1’: 3, ‘DC2’: 3, ‘DC3’: 1 }; Now, each column in the database will have seven replicas in total, dispersed across five distinct racks in two different data centers—without any complex configuration. www.it-ebooks.info Consistency in a multiple data center environment In this section, we will take a look at how Cassandra moves data from one data center to another. It is easy to understand the concept of replication in a local context, but it might seem more difficult to grasp the idea that Cassandra can seamlessly transfer large amounts of data across high-latency connections in real time. As you might now suspect, the precise replication behavior depends on your chosen consistency level. In the previous chapter, we explored each consistency level in detail, as well as its impact on availability, consistency, and performance. In a multiple data center environment, it is extremely important to remember that using a nonlocal consistency level (ALL, ONE, TWO, THREE, QUORUM, SERIAL, or EACH_QUORUM) might have an impact on performance. This is because these consistency levels do not always route requests to the local data center; they will generally prefer local nodes, but there is no locality guarantee. If you do this, you will end up with a scenario that resembles the following diagram (assuming clients in both data centers): When nonlocal consistency levels are used, requests can be routed anywhere in the cluster. Obviously, sending traffic across the Atlantic Ocean will have a serious impact on client performance, which is why it’s so critical that application architects and operations personnel work together to make sure consistency levels match the deployed data center configurations. You can imagine how the situation can become even less tenable with the addition of more data centers! As an alternative to the previous scenario, it is nearly always preferable to use a local consistency level (LOCAL_ONE, LOCAL_QUORUM, or LOCAL_SERIAL) to ensure you’re only working against the local data center, resulting in a far more performant configuration. www.it-ebooks.info When using local consistency levels, requests are sent only to nodes in the specified data center. Also, you must make sure your client is only aware of the local nodes. If you’re using the native Java driver, you can read about how to do this in Chapter 6, High Availability Features in the Native Java Client. Otherwise, consult the documentation for the driver you are using or consider moving to one of the newer native drivers. Note Note that it is not sufficient to simply provide your client with the local node list and then attempt to use a global consistency level (ALL, ONE, TWO, THREE, QUORUM, or SERIAL). This is because once the operation hits the database, Cassandra will not restrict fulfillment of the consistency requirements to the local data center. If you intend to satisfy the consistency guarantee locally, you must use a local consistency level (LOCAL_ONE, LOCAL_QUORUM, or LOCAL_SERIAL). Additionally, if your client connects to a remote node using a local consistency level, the consistency level will be fulfilled using nodes in the remote data center. This is because locality is measured relative to the coordinator node, not to the client. The anatomy of a replicated write It is important to fully grasp what’s going on when you perform a write in a multiple data center environment in order to avoid common pitfalls and make sure you achieve your desired consistency goals. To start with, we will assume your clients generally need to be aware of updates as soon as they are written. We have discussed the fact that it’s possible to achieve strong consistency using QUORUM reads and writes, but what happens in the case of LOCAL_QUORUM, which is typically the suggested default? Let’s examine this situation in detail. We will assume that we have two live data centers in a geographically distributed configuration: one in North America and the second in Europe. Each data center has a client application that’s responsible for performing reads and writes local to that data center using LOCAL_QUORUM for both. www.it-ebooks.info We have established that local reads and writes will be strongly consistent (refer to Chapter 3, Replication, for a review of the reasons behind this), so the question is what consistency guarantees do we have between data centers? With LOCAL_QUORUM reads and writes, data inside a data center is strongly consistent, but what happens to inter-data center consistency? To answer this question, let’s examine the high-level path a write takes from the time the client sends it to Cassandra: 1. The client sends a write request using the LOCAL_QUORUM consistency level. 2. The node that receives the request is called the coordinator, and is responsible for ensuring that the consistency level guarantees are met prior to acknowledging the write. 3. The coordinator determines the nodes that should own the replicas using consistent hashing (refer to Chapter 2, Data Distribution, for more details) and then sends the writes to those nodes, including one in each remote data center, which then acts as coordinator inside that data center. 4. Since we’re using LOCAL_QUORUM, the coordinator will only wait for the majority of replica owning nodes in the local data center to acknowledge the write. This implies that there might be down hosts who have not yet received the write and are therefore inconsistent. If you have paid close attention to the flow, you might have noticed that step 4 includes a guarantee that at least a majority of local nodes received the write, so we know that a LOCAL_QUORUM read will result in strong consistency. However, there was no guarantee that any remote writes succeeded. In fact, it’s entirely possible that only the local data center was operational at the time of the request. Tip Based on the Cassandra write path, we must conclude that LOCAL_QUORUM writes inside a data center exhibit strong consistency when paired with LOCAL_QUORUM reads, whereas the same pattern results in eventual consistency between data centers. www.it-ebooks.info Thus, we can complete our diagram as follows: With LOCAL_QUORUM reads and writes, we get eventual consistency between data centers. This level of guarantee is appropriate for many use cases, especially where users are being routed to a single data center for the vast majority of the time. In this instance, eventual consistency would be acceptable, since traveling across continents takes enough time that the second data center would have received the writes by the time the individual had completed their travels. But in some cases, you might want or need to guarantee consistency in a remote data center, but you cannot afford to pay the cost by using a global consistency level at write time. Achieving stronger consistency between data centers There are a number of reasons why you might want to know for sure that your remote data is consistent with the originating data center. For example, you might need to ensure that your analytics include the most up-to-date data, or you might be reconciling bank transactions that occurred in another data center. Either way, you want to know prior to running your analysis or reconciliation job that your data is as recent as possible. The solution to this dilemma is to run nodetool repair more frequently. Typically, it is advised that users run a repair at least once every gc_grace_seconds, but in some cases you might want to run repair more frequently. If you want to make sure a remote data center is as consistent as possible, you can choose to run repair more frequently as this will make sure all your data is consistent with the originating data center. Tip Keep in mind that the repair process is quite intensive, so be sure to stagger the process such that only a subset of your nodes is involved in a repair at any given time. If you must maintain availability during repair, a higher replication factor might be needed to satisfy consistency guarantees. With version 2.1, you can choose to run incremental repair, which can be run much more www.it-ebooks.info often as it is a much more lightweight weight process. As we discussed in Chapter 1, Cassandra’s Approach to High Availability, consistency in a distributed database is a complex and multifaceted problem. This is even more the case when nodes in the database are dispersed across multiple geographical regions. Fortunately, as we have demonstrated, Cassandra provides the tools needed to handle this job. The key to succeed in large-scale deployments of the sort we have covered in this chapter is to design your solution holistically. A common traditional approach to these problems has been to model the data independently of the infrastructure, then retrofit later to scale the solution. You’ve likely chosen Cassandra because you have outgrown this approach, so don’t make the mistake of applying old ideas to the new technology. Consider how your replication factor, data center configuration, cluster size, consistency levels, and analytics approach all work together to produce your desired result. www.it-ebooks.info Summary After reading this chapter and the previous one, you should have a solid understanding of how Cassandra ensures that your data is available when required and protected from loss due to node or data center failure. By now, you should be able to set up and configure a cluster across multiple geographical regions, and be familiar enough with data centers to begin the journey to analyze your live data without cumbersome and expensive ETL processes. So far we’ve focused on what it takes to get started with a solid Cassandra foundation for your application. In the next chapter, we will talk about what it looks like when your use case grows beyond your original plan and you need to scale out your cluster. www.it-ebooks.info Chapter 5. Scaling Out In the old days, a significant increase in system traffic would cause excitement for the sales organization and strike fear in the hearts of the operations team. Fortunately, Cassandra makes the process of scaling out a relatively pain-free affair, so both your sales and operations teams can enjoy the fruits of your success. This chapter will give you a complete rundown of the processes, tools, and design considerations when adding nodes or data centers to your topology. In this chapter, we’ll cover the following topics: Choosing the right hardware configuration Scaling out versus scaling up Adding nodes The bootstrapping process Adding a data center Sizing your cluster correctly It goes without saying that making proper choices regarding the underlying infrastructure is a key component in achieving good performance and high availability. Conversely, poor choices can lead to a host of issues, and recovery can sometimes be difficult. Let’s begin the chapter with some guidance on choosing hardware that’s compatible with Cassandra’s design. www.it-ebooks.info Choosing the right hardware configuration There are a number of points to consider when deciding on a node configuration, including disk sizes, memory requirements, and the number of processor cores. The right choices depend quite a bit on your use case and whether you are on a physical or virtual infrastructure, but we will discuss some general guidelines here. Since Cassandra is designed to be deployed in large-scale clusters on commodity hardware, an important consideration is whether to use fewer large nodes or a greater number of smaller nodes. Regardless of whether you use physical or virtual machines, there are a few key principles to keep in mind: More RAM equals faster reads, so the more you have, the better they will perform. This is because Cassandra can take advantage of its cache capabilities as well as larger memory tables. More space for memory tables means fewer scans to the on- disk SSTables. More memory also results in better file system caching, which reduces disk operations, but not if you allocate it to the JVM heap. Most of the time, the default JVM heap size is sufficient, as Cassandra stores its O(n) structures (those that grow with data set size) off-heap. In general, you should not use more than 8 GB of heap on the JVM. More processors equal faster writes. This is because Cassandra is able to efficiently utilize all available processors, and writes are generally CPU-bound. While this might seem counter-intuitive, it holds true because Cassandra’s highly efficient log- structured storage introduces very little overhead. Disk utilization is highly dependent on data volume and compaction strategy. Obviously, you will need more disk space if you intend to store more data. What might be less obvious is the dependence on your compaction strategy. In the worst case, SizeTieredCompactionStrategy can use up to 50 percent more disk space than the data itself. As an upper bound, try to limit the amount of data on each node to 1-2 TB. Solid-state drives (SSDs) are a good choice. For many use cases, simply moving to SSDs from spinning disks can be the most cost effective way to boost performance. In fact, SSDs should be the default choice since they provide tremendous benefit without any real drawbacks. Do not use shared storage because Cassandra is designed to use local storage. Shared storage configurations introduce unwanted bottlenecks and subvert Cassandra’s peer- to-peer design. They also introduce an unnecessary single point of failure. Cassandra needs at least two disks: one for the commit log and one for data directories. This is somewhat less important when using SSDs as they handle parallel writes better than spinning disks. Note For physical hardware, anything between 16 GB and 64 GB of RAM seems to be a good compromise between price and performance, whereas 16 GB should be www.it-ebooks.info considered ideal for virtual hardware. When choosing the right number of CPUs, eight-core processors are currently a good choice for dedicated machines. CPU performance varies among cloud vendors, so it’s a good idea to consult the vendor and/or perform your own benchmarks. These simple guidelines will help you to get the most out of your hardware or cloud infrastructure investment and form a solid foundation for a high performance and highly available cluster. www.it-ebooks.info Scaling out versus scaling up So you know it’s time to add more muscle to your cluster, but how do you know whether to scale up or out? If you’re not familiar with the difference, scaling up refers to converting existing infrastructure into better or more robust hardware (or instance types in cloud environments). This can mean adding storage capacity, increasing memory, moving to newer machines with more cores, and so on. Scaling out simply means adding more machines that roughly match the specifications of the existing machines. Since Cassandra scales linearly with its peer-to-peer architecture, scaling out is often more desirable. Note In general, it is better to replace physical hardware components incrementally rather than all at one time. This is because in large systems, failures tend to occur after hardware ages to a certain point, which is statistically likely to happen simultaneously for some subset of your nodes. For example, purchasing a large amount of drives from a single source at one time is likely to result in a sudden onslaught of drive failures as they near the end of their service life. How do you know which is the better strategy? To arrive at an answer, you can ask yourself a few questions about your existing infrastructure: Have there been significant advances in hardware (or cloud instance types, in the case of EC2, Rackspace, and so on), such that scaling up yields more benefit for the cost than adding nodes? Refer to the excellent article from Netflix at http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html, which discusses the benefits of moving to SSDs rather than adding nodes. Did you start with hardware that was too small because you were bound by the limitations of early Cassandra versions or a cloud provider’s offerings at the time? Do you have existing hardware to repurpose for use as a Cassandra cluster that is better than your current hardware? If the answer to any of the preceding questions is yes, then scaling up might be your best option. If the answer is no, it might still be better to scale up, depending on what extra resource you hope to gain by scaling up and the cost-benefit ratio. If, for example, you only require more storage but not more CPU or IOPS, then adding disks is probably cheaper. If you require a bit more memory for the cache, then add some memory if your nodes can take more. However, upgrading the motherboard to take more memory is unlikely to be cost- effective, so adding nodes is a better choice. Fortunately, Cassandra makes scaling out painless. Regardless of which path you choose, you will need to know how to add nodes to your cluster. www.it-ebooks.info Growing your cluster The process of adding a node to an existing Cassandra cluster ranges from simple when vnodes are used to somewhat tedious if you manually assign tokens. Let’s start with the manual case, as the vnodes process is a subset of this. www.it-ebooks.info Adding nodes without vnodes As previously mentioned, the procedure to add a node to a cluster without vnodes enabled is straightforward, if not a bit tedious. The first step is to determine the new total cluster size, then compute tokens for all nodes. To compute tokens, follow the DataStax documentation at http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html There are also several useful online tools to help you, such as the ones that you will find at http://www.geroba.com/cassandra/cassandra-token-calculator/. Once you have the new tokens, complete the following steps to add your new nodes to the cluster: 1. Run repair to ensure that all nodes contain the most recent data. Failure to do this can result in data loss, as the new node might bootstrap data from a node that doesn’t contain the latest replicas. 2. Make sure Cassandra is installed, but do not start the process. If you use a package manager, be aware that Cassandra will start automatically. If so, you will need to stop the process before proceeding. 3. On new nodes, in cassandra.yaml, set the addresses to their proper values, along with the cluster name, seeds, and endpoint snitch. Then set the initial_token value to the node’s assigned token using the tokens calculated prior to beginning this process. 4. Start the Cassandra daemon on the new node. 5. Wait for at least two minutes before starting the bootstrap process on another node. A good practice is to watch the Cassandra log as it starts to make sure there are no errors. 6. Once all new nodes are up, run nodetool move on old nodes to assign new tokens on one node at a time. This is unnecessary if you are doubling the cluster size as the token assignments on old nodes will remain the same. 7. After this process has been completed on all new and existing nodes, run nodetool cleanup on old nodes to purge old data that now belongs to the new nodes. You should do this on one node at a time. www.it-ebooks.info Adding nodes with vnodes The primary difference when using vnodes is that you do not have to generate or set tokens as this happens automatically, and there is no need to run nodetool move. Instead of setting the initial_token property, you should set the num_tokens property in accordance with the desired data distribution. Larger values represent proportionally larger nodes in your cluster, with 256 being the default. If all your nodes are the same size, this default should be sufficient. Over time, your cluster might naturally become heterogeneous in terms of node size and capacity. In the past, when using manually assigned tokens, this presented a challenge as it was difficult to determine the proper tokens that would result in a balanced cluster. With vnodes, you can simply set the num_tokens property to a larger number for larger nodes. For example, if your typical node owns 256 tokens, when adding a node with twice the capacity, you should set its num_tokens property to 512. If you want to keep track of the bootstrapping process, you can run nodetool netstats to view the progress. Once the streaming has completed, the output of this command is as follows: Mode: NORMAL Nothing streaming to /x.x.x.x Nothing streaming from /x.x.x.x Read Repair Statistics: Attempted: 1 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Commands n/a 0 1 Responses n/a 0 12345 Once the Mode status reports as NORMAL, this indicates the node is ready to serve requests. Now that you know how to add a node, let’s examine the two paths to increase the capacity of your cluster, starting with scaling out. www.it-ebooks.info How to scale out Scaling out typically involves adding nodes to your current cluster, but might also mean adding an entire data center. If you simply need to add nodes to an existing data center, you might have guessed that you must only follow the steps to add a node, as described in the previous section on that topic. www.it-ebooks.info Adding a data center Adding a new data center to your cluster is similar to initializing a new multinode cluster. As this is not a basic tutorial on Cassandra, we will assume you already know how to do this. Before starting your nodes in the new data center, be sure to keep in mind the following additional details: You must use NetworkTopologyStrategy with an appropriate snitch. If you have not already chosen a data center-aware snitch, the recommendation is to use the GossipingPropertyFileSnitch configuration for non-cloud installations, or the appropriate cloud snitch for cloud-based installations. Refer to Chapter 4, Data Centers, for more information on configuring snitches. Set auto_bootstrap to false. This property is set to true by default, and if left as true will cause the node to immediately start transferring data from the existing data center. The correct procedure is to wait and run a rebuild after all nodes are online. Configure the seeds. It is a good idea to include at least a couple nodes from each data center as seeds in cassandra.yaml. Update the appropriate properties files. If you’re using the GossipingPropertyFileSnitch configuration, add the cassandra- rackdc.properties file on each new node. If you have chosen PropertyFileSnitch, you will need to update cassandra-topology.properties on all nodes (a restart is not required on existing nodes). Note Prior to changing your keyspace definition, make sure that you change the consistency levels on your clients so they reflect the desired guarantees. Failing to do this might result in slow response times and UnavailableExceptions, as Cassandra attempts to satisfy the target consistency level using your new data center. This is especially true when moving from a single data center environment (where your calls are likely, for example, to be QUORUM rather than LOCAL_QUORUM). When adding data centers beyond the second, it should be less of a concern. Refer to Chapter 6, High Availability Features in the Native Java Client, for more details if you’re using the native driver. Once your new nodes are online, you will need to change your keyspace properties to reflect your desired replication factor for each data center. For example, suppose you previously had a data center named DC1 and your new data center is called DC2, and you wanted both DC1 and DC2 to have three replicas, you would issue the following CQL statement: ALTER KEYSPACE [your_keyspace] WITH REPLICATION = { ‘class’ : ‘NetworkTopologyStrategy’, ‘DC1’ : 3, ‘DC2’ : 3 }; www.it-ebooks.info Note that you only need to do this on one node as your schema will be gossiped to all nodes in all data centers. After you set your desired replication factor, you will need to execute a rebuild operation on each node in the new data center: nodetool rebuild — [name of data center] The rebuild will ensure that nodes in the new data center receive up-to-date replicas from the existing data center. It’s important to include the data center name when issuing this command or the rebuild operation will not copy any data. You can safely run this on all nodes at once, provided your existing data center can handle the additional load. If you are in doubt about this, it might be wise to run the rebuild on one node at a time to avoid potential problems. www.it-ebooks.info How to scale up Properly scaling up your Cassandra cluster is not a difficult process, but it does require you to carefully follow established procedures to avoid undesirable side effects. There are two general approaches to consider: Upgrade in place: Upgrading in place involves taking each node out of the ring, one at a time, bringing its new replacement online, and allowing the new node to bootstrap. This choice makes the most sense if a subset of your cluster needs upgrading rather than an entire data center. To upgrade an entire data center, it might be preferable to allow replication to automatically build the new nodes. This assumes, of course, that your replication factor is greater than one. Using data center replication: Since Cassandra already supports bringing up another data center via replication, you can use this mechanism to populate your new hardware with existing data and then switch to the new data center when replication is complete. www.it-ebooks.info Upgrading in place If you have determined that your best strategy is to upgrade a subset of your existing nodes, you will need to take the node offline so that the cluster sees its status as down, which can be confirmed using nodetool status: Datacenter: dc1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving — Address … UN 10.10.10.1 … UN 10.10.10.2 … DN 10.10.10.3 … UN 10.10.10.4 … You can see in this excerpt of the output that the node at the 10.10.10.3 address is labeled DN, which indicates that Cassandra sees it as down. Once you have confirmed this, you should make a note of the address (and the token if you are using manually assigned tokens). You are now ready to begin the process of replacing the node, which simply involves following the previously outlined steps for adding a node with the following minor exceptions: If using vnodes with a packaged installation, add the following line to /usr/share/cassandra/cassandra-env.sh prior to starting Cassandra: JVM_OPTS=”$JVM_OPTS -Dcassandra.replace_address=[old_address] If using vnodes with a tarball installation, when starting Cassandra, use the following option: bin/cassandra –Dcassandra.replace_address=[old_address] If you are manually assigning tokens, set initial_token to the old node’s token minus one, and run nodetool repair on each keyspace on the new node after bootstrapping is complete. You will also need to decommission the old node. You will need to repeat this process for each node that you want to upgrade, and make sure you execute the procedure one node at a time. In addition, you should consider running a repair after each node replacement. If only two of three nodes contain the latest data for some particular token range and you’re replacing one of these nodes, Cassandra might end up copying the data from the node with the older data. Thus, you would only have the latest data on one node; if this node is replaced next, you would lose the data. www.it-ebooks.info Scaling up using data center replication If you have a large data center and intend to replace all the hardware in that data center, the simplest way to handle this is to use Cassandra’s replication mechanism to do the hard work for you. Once the new data center is ready to receive traffic, you can simply redirect client requests to it. At this point, you will be able to safely decommission the old data center. To accomplish this, you should follow the procedure to add a data center, which was outlined earlier in this chapter. Once your new data center is online, you should perform the following steps: 1. Validate that all new nodes are online using nodetool status. 2. Redirect all client traffic to the new data center and make sure that there are no remaining clients connected before proceeding. 3. Run nodetool repair on nodes in any other data centers (besides the one you’re decommissioning) to ensure that any data that was updated on the old data center is propagated to the rest of the cluster. 4. Use the ALTER KEYSPACE command to remove any references to the old data center, as described in the earlier section on adding data centers. 5. Run nodetool decommission on each of the old nodes to permanently remove it from the cluster. www.it-ebooks.info Removing nodes While the material in this chapter is primarily focused on adding capacity to your cluster, there might be times when reducing capacity is what you’re hoping to accomplish. There can be a number of valid reasons for doing this. Perhaps you experience smaller transaction volumes than originally anticipated for a new application, or you might change your data retention plan. In some cases you might want to move to a smaller cluster with more capable nodes, especially in cloud environments where this transition is made easier. Regardless of your reasons for doing so, knowing how to remove nodes from your cluster will certainly come in handy at some point in your Cassandra experience. Let’s take a look at this process now. www.it-ebooks.info Removing nodes within a data center Fortunately, the process to remove a node is quite simple: 1. Run nodetool repair on all your keyspaces. This will ensure that any updates which might be present only on the node you’re removing will be preserved in the remaining nodes. 2. Presuming the node is online, run nodetool decommission on the node you’re retiring. This process will move the retiring node’s token ranges to other nodes in the ring and then copy replicas to their appropriate locations based on the new token assignments. As mentioned previously, you can use nodetool netstats to keep track of each node’s progress during this operation. 3. If you’re manually assigning tokens, you must reassign all your tokens so that your distribution is even. This procedure is outlined in an earlier section in this chapter. 4. Validate that the node has been removed using nodetool status. If the node has been properly removed, it should no longer appear in the list output of this command. www.it-ebooks.info Decommissioning a data center If you want to remove an entire data center, the process closely mirrors what we outlined earlier in the section on scaling up via data center replication. For clarity however, let’s repeat just the important steps here: 1. Run nodetool repair on nodes in any other data centers (besides the one that you’re decommissioning) to ensure any data that was updated on the old data center is propagated to the rest of the cluster. 2. Use the ALTER KEYSPACE command to remove any references to the old data center as described in the earlier section on adding data centers. 3. Run nodetool decommission on each of the old nodes to permanently remove it from the cluster. Note Given the coordination required between multiple teams to successfully execute major topology changes, it is often advisable to appoint a single knowledgeable person who can oversee this process to ensure that all the proper steps are taken. This simple step can help to avoid significant issues. Even better, automated cluster management tools such as Puppet, Chef, or Priam can make this process much easier. By now, you should be familiar with the various possible operations to add and remove nodes or data centers. As you can see, these processes require planning and coordination between application designers, DevOps team members, and your infrastructure team. The consequences of improper execution of any of these processes can be quite substantial. www.it-ebooks.info Other data migration scenarios At times, you might need to migrate large amounts of data from one cluster to another. A common reason for this is the need to transition data between networks that cannot see each other, or moving from classic Amazon EC2 to a newer virtual private cloud infrastructure. If you find yourself in this situation, you can use these steps to ensure a smooth transition to the new infrastructure: 1. Set up your new cluster. Using the information you learned in this chapter, configure your cluster and duplicate the schema from your existing cluster. 2. Change your application to write to both clusters. This is certainly the most significant change, as it likely requires code changes in your application. 3. Verify that you are receiving writes to both clusters to avoid potential data loss. 4. Create a snapshot of your old cluster using the nodetool snapshot command. 5. Load the snapshot data into your new cluster using the sstableloader command. This command actually streams the data into the cluster rather than performing a blind copy, which means that your configured replication strategy will be honored. 6. Switch your application to point only to the new cluster. 7. Decommission the old cluster by running nodetool decommission on each of the old nodes. It’s possible to skip the step that requires your application to direct traffic to both clusters, provided you can schedule sufficient downtime. The problem is that it’s difficult to accurately predict how long the load will take, and considering the subject matter of this book, it’s likely that your application cannot sustain this downtime. One final topic that’s worth covering when talking about increasing cluster capacity is the possibility that you might need to change snitches. Often users will start with SimpleSnitch, then find that they want to add a data center later, which requires one of the data center-aware snitches. If done incorrectly, snitch changes can be problematic, so let’s discuss the proper way to handle this scenario. www.it-ebooks.info Snitch changes As you will recall from Chapter 4, Data Centers, the snitch tells Cassandra what your network topology looks like, and therefore affects data placement in the cluster. If you haven’t inserted any data, or if the change doesn’t alter your topology, you can change the snitch without consequence. Otherwise, multiple steps are required as well as a full cluster restart, which will result in downtime. How do you know if your topology has changed? If you’re not adding or removing nodes while changing the snitch, your topology has not changed. Presuming no change, the following procedure should be used to change snitches: 1. Update your topology properties files, which means cassandra- topology.properties or cassandra-rackdc.properties, depending on which snitch you specify. In the case of the PropertyFileSnitch, make sure all nodes have the same file. For GossipingPropertyFileSnitch or EC2MultiRegionSnitch, each node should have a file indicating its place in the topology. 2. Update the snitch in cassandra.yaml. You’ll need to do this for every node in the cluster. 3. Restart all nodes, one at a time. Any time you make a change to cassandra.yaml, you must restart the node. If you need to change your topology, you have two options: You can go ahead and make the change all at once, then shut down the entire cluster at one time. When you restart the cluster, your new topology will take effect. You can change the snitch (by following the previous steps) prior to making any topology changes. Once you have finished the snitch change procedure, you can then change your topology without having to restart your nodes. Tip If you’re just starting out with Cassandra, it’s best to plan for cluster growth from the beginning. Go ahead and choose either GossipingPropertyFileSnitch or EC2MultiRegionSnitch (for EC2 deployments) as this will help to avoid complications later when you inevitably decide to expand your cluster. www.it-ebooks.info Summary This chapter covered quite a few procedures to handle a variety of cluster changes, from adding a single node to expanding with a new data center to migrating your entire cluster. While it is unreasonable to expect anyone to commit all these processes to memory, let this chapter serve as a reference for the times when these events occur. Most importantly, take note of these scenarios so you can know when it’s time to read the manual rather than just trying to figure it out on your own. Distributed databases can be wonderful when handled correctly, but quite unforgiving when misused. We spent the last five chapters looking at a variety of mostly administrative and design related concepts, but now it’s time to dig in and look at some application code. In the next chapter, we will take a look at the relatively new native client library (specifically, the Java variant, although there are also drivers for C# and Python that follow similar principles). The new driver has a number of interesting features related to high availability, so it’s time to put on your developer’s hat as we transition from the database to the application layer. As you likely know from past experience, a properly architected client application is every bit as important as a correctly configured database. www.it-ebooks.info Chapter 6. High Availability Features in the Native Java Client If you are relatively new to Cassandra, you may be unaware that the native client libraries from DataStax are a recent development. In fact, prior to their introduction, there were numerous libraries (and forks of those projects) just for the Java language. Throw in the other languages, each with their own idiosyncrasies, and you’d know that the situation was really quite dire. Complicating the scenario was the lack of any universally accepted query mechanism as CQL was initially poorly received. The only real common ground to describe queries and data models was the underlying Thrift protocol. While this worked reasonably well for early adopters, it made assimilation of newer users quite difficult. It is a testament to Cassandra’s extraordinary architecture, speed, and scalability that it was able to survive those early days. After several revisions of CQL, the introduction of a native binary protocol, and DataStax’s work on a modern CQL-based native driver, we are fortunately in a much better place now than we were just a couple of years ago. In fact, the modern implementation of CQL is roughly 50 times faster than the equivalent Thrift query. In this chapter, we will introduce the native Java driver and discuss its high availability features, covering the following topics: Thrift versus the native protocol Client basics Asynchronous requests Load balancing Failover policies Retries Note While the chapter will focus specifically on the Java implementation, there are also similar drivers for Python and C#. Though the specific implementation details may vary among languages, the basic concepts will prove useful no matter which driver you end up using. It’s also worth noting that in most cases, it will be worth transitioning to the native Java driver if you’re using another JVM-based language (such as Scala, Clojure, and Groovy), even though your language of choice may have another community-supported Thrift- based driver available. www.it-ebooks.info Thrift versus the native protocol Cassandra users fall into two general categories. The first category consists of those who have been using it for a while and have grown accustomed to working directly with storage rows via a Thrift-based client, and second, those who are relatively new to Cassandra and are confused by the role Thrift plays in the modern Cassandra world. Hopefully, we can clear up the confusion and set both groups on the right path. Thrift is an RPC mechanism combined with a code generator, and for several years it formed the underlying protocol layer for clients communicating with Cassandra. This allowed the early developers of Cassandra itself to focus on the database rather than the clients. But, as we hinted at in the introduction, there are numerous negative side effects of this strategy: There was no common language to describe data models and queries as each client implemented different abstractions on top of the underlying Thrift protocol. Thrift was limited to the lowest common denominator implementation for all the supported languages, which proved to be a significant handicap as more advanced APIs became desirable. All requests were executed synchronously as Thrift has no built-in support for asynchronous calls. All query results had to be materialized into memory on both the server and the client. This forced clients to implement cumbersome paging techniques when requesting large datasets to avoid exceeding available memory on either the client or the server. Limitations in the protocol itself also made optimization difficult. For these reasons, the Thrift protocol is essentially deprecated in the favor of the newer binary protocol, which supports more advanced features such as cursors, batches, prepared statements, and cluster awareness. If you’re still not convinced that you should migrate away from your favorite Thrift-based library, keep reading to learn about some of the great new features in the native driver. Even the popular Astyanax driver from Netflix now uses the native protocol under the hood. www.it-ebooks.info Setting up the environment To get the most out of this chapter, you should prepare your development environment with the following prerequisites: Java Development Kit (JDK) 1.7 for your platform, which can be obtained at http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads- 1880260.html. Integrated Development Environment (IDE) or any text editor of your choice. Either a local Cassandra installation, or the ability to connect to a remote cluster. The DataStax native Java driver for your Cassandra version. If you’re using Maven for dependency management, add the following lines of code to your pom.xml file: com.datastax.cassandra cassandra-driver-core [version_number] If you’re using the 1.x driver, you may notice that it has a significant number of dependencies (compared to only four with the 2.x version). For this reason, you should make use of a dependency management tool, such as Maven, Ivy, or SBT. Now that you’re set up for coding, you should get familiar with some of the basics of the driver. The first step is to establish a connection to your Cassandra cluster, so we will start by doing just that. www.it-ebooks.info Connecting to the cluster To get connected, start by creating a Cluster reference, which you will construct using a builder pattern. You will specify each additional option by chaining method calls together to produce the desired configuration, then finally, calling the build() method to initialize the Cluster instance. Let’s build a cluster that’s initialized with a list of possible initial contact points: private Cluster cluster; // defined at class level // you should only build the cluster once per app cluster = Cluster.builder() .addContactPoints(“10.10.10.1”, “10.10.10.2”, “10.10.10.3”) .build(); Note You should only have one instance of Cluster in your application for each physical cluster as this class controls the list of contact points and key connection policies such as compression, failover, request routing, and retries. While this basic example will suffice to play around with the driver locally, the Cluster builder supports a number of additional options that are relevant for maintaining a highly available application, which we will explore throughout this chapter. www.it-ebooks.info Executing statements While the Cluster acts as a central place to manage connection-level configuration options, you will need to establish a Session instance to perform actual work against the cluster. This is done by calling the connect() method on your Cluster instance. Here, we connect to the contacts keyspace: private Session session; // defined at class level session = cluster.connect(“contacts”); Once you have created the Session, you will be able to execute CQL statements as follows: String insert = “INSERT INTO contact (id, email) ” + “VALUES (” + “bd297650-2885-11e4-8c21-0800200c9a66,” + “‘contact@example.com’ ” + “);”; session.execute(insert); You can submit any valid CQL statement to the execute() method, including schema modifications. Note Unless you have a large number of keyspaces, you should create one Session instance for each keyspace in your application, because it provides connection pooling and controls the node selection policy (it uses a round-robin approach by default). The Session is thread- safe, so it can be shared among multiple clients. www.it-ebooks.info Prepared statements One key improvement provided by the native driver (and Cassandra 1.2+) is its support for prepared statements. Readers with a background in traditional relational databases will be familiar with the concept. Essentially, the statement is preparsed at the time it is prepared, with placeholders left for parameters to be bound at execution time. Using the driver’s PreparedStatement is straightforward: String insert = “INSERT INTO contacts.contact (id, email) ” + “VALUES (?,?);”; PreparedStatement stmt = session.prepare(insert); BoundStatement boundInsert = stmt.bind( UUID.fromString(“bd297650-2885-11e4-8c21-0800200c9a66”), “contact@example.com” ); session.execute(boundInsert); Use prepared statements whenever you need to execute the same statement repeatedly, as this will reduce parsing overhead on the server. However, do not create the same prepared statement multiple times, as this will actually degrade performance. You should prepare statements only once and reuse them for multiple executions. www.it-ebooks.info Batched statements If you are using the 2.x driver, you can also use prepared statements with batches. When statements are grouped into a batch, they are executed atomically and without multiple network calls. This can be useful when you need either all or none of your statements to succeed. Here’s an example of preparing and executing a batch using the statement prepared in the previous code snippet: BatchStatement batch = new BatchStatement(); batch.add(stmt.bind( UUID.fromString(“bd297650-2885-11e4-8c21-0800200c9a66”), “contact@example.com” )); batch.add(stmt.bind( UUID.fromString(“a012a000-2899-11e4-8c21-0800200c9a66”), “othercontact@example.com” )); session.execute(batch); Caution with batches While batches can be quite useful when they’re needed, you should be aware of some pitfalls associated with them: They are atomic, but not isolated. This means clients will be able to see the incremental updates as they happen. The exception is updates to a single partition, which are isolated. They are slower. Specifically, the atomicity guarantee introduces approximately a 30 percent performance penalty across the batch. Sometimes this is worth it, but it means you shouldn’t automatically assume batching multiple requests is better than multiple single requests. To avoid this penalty, you can use unlogged batches, which turn off atomicity and provide increased performance over multiple statements executed against the same partition. They are all or nothing. In other words, either all statements fail or all succeed. This has the effect of increasing latency as you have to wait for responses for all the statements. They are unordered. Batching applies the same timestamp to all mutations in the batch, so statements don’t actually execute in the provided ordering. Don’t use them with prepared statements to update many sparse columns. It’s tempting to prepare a single statement with a number of parameters for use in a large batch. This works fine if you always supply all the parameters, but don’t assume you can insert nulls for missing columns, as inserting nulls creates tombstones. Refer to Chapter 8, Antipatterns, for details on why creating large numbers of tombstones is an antipattern. Now that you’re familiar with the basic client concepts, it’s time to delve into the more advanced features, beginning with the ability to execute requests asynchronously. www.it-ebooks.info Handling asynchronous requests Since Cassandra is designed for significant scale, it follows that most applications using it would be designed with similar scalability in mind. One principle characteristic of high performance applications is that they do not block threads unnecessarily, and instead attempt to maximize available resources. As previously discussed, one of the downsides to the older Thrift protocol was its lack of support for asynchronous requests. Fortunately, this situation has been remedied with the native driver, making the process of building scalable applications on top of Cassandra significantly easier. Tip Blocking on I/O, such as with calls to Cassandra, can cause significant bottlenecks in high-throughput applications. Since a slow application can be the same as a dead application, you should use the asynchronous API to avoid blocking whenever possible. If you are familiar with the java.util.concurrent package, and the Future class specifically, the asynchronous API will look familiar. Here’s a basic example: String query = “SELECT * FROM contact ” + “WHERE id = bd297650-2885-11e4-8c21-0800200c9a66;”; ResultSetFuture f = session.executeAsync(query); ResultSet rs = f.getUninterruptibly(); Obviously, this is a naïve example as it will simply block on the call to getUninterruptibly(), but it should give you a sense of the basic API. www.it-ebooks.info Running queries in parallel One common use case for the asynchronous API is to make multiple calls in parallel, then collect the results. This can be accomplished easily: String query = “SELECT * FROM contact WHERE id = ?;”; BoundStatement q1 = session.prepare(query).bind( UUID.fromString(“bd297650-2885-11e4-8c21-0800200c9a66”) ); BoundStatement q2 = session.prepare(query).bind( UUID.fromString(“a012a000-2899-11e4-8c21-0800200c9a66”) ); ResultSetFuture f1 = session.executeAsync(q1); ResultSetFuture f2 = session.executeAsync(q2); try { ResultSet rs1 = f1.getUninterruptibly(5, TimeUnit.SECONDS); ResultSet rs2 = f2.getUninterruptibly(5, TimeUnit.SECONDS); // do something with results } catch (Exception e) { // handle exception } A closer inspection of the ResultSetFuture class reveals that it inherits from both java.util.concurrent.Future and com.google.common.util.concurrent.ListenableFuture, which is from Google’s Guava library. The Guava Futures class provides a useful construct to collect multiple Future results into a single list of values, which can be helpful when aggregating queries. It can be used as follows: Future
- >future = Futures.allAsList( session.executeAsync(q1), session.executeAsync(q2) ); try { List