Pages

Friday, August 22, 2014

Data Types - Cassandra Query Language

CQL supports a rich set of data types for columns defined in a table, including collection types. On top of those native and collection types, users can also provide custom types (through a JAVA class extendingAbstractType loadable by Cassandra). The syntax of types is thus:
<type> ::= <native-type>
         | <collection-type>
         | <string>       // Used for custom types. The fully-qualified name of a JAVA class

<native-type> ::= ascii
                | bigint
                | blob
                | boolean
                | counter
                | decimal
                | double
                | float
                | inet
                | int
                | text
                | timestamp
                | timeuuid
                | uuid
                | varchar
                | varint

<collection-type> ::= list '<' <native-type> '>'
                    | set  '<' <native-type> '>'
                    | map  '<' <native-type> ',' <native-type> '>'
Note that the native types are keywords and as such are case-insensitive. They are however not reserved ones.
The following table gives additional informations on the native data types, and on which kind of constants each type supports:
typeconstants supporteddescription
asciistringsASCII character string
bigintintegers64-bit signed long
blobblobsArbitrary bytes (no validation)
booleanbooleanstrue or false
counterintegersCounter column (64-bit signed value). See Counters for details
decimalintegers, floatsVariable-precision decimal
doubleintegers64-bit IEEE-754 floating point
floatintegers, floats32-bit IEEE-754 floating point
inetstringsAn IP address. It can be either 4 bytes long (IPv4) or 16 bytes long (IPv6). There is no inet constant, IP address should be inputed as strings
intintegers32-bit signed int
textstringsUTF8 encoded string
timestampintegers, stringsA timestamp. Strings constant are allow to input timestamps as dates, see Working with dates below for more information.
timeuuiduuidsType 1 UUID. This is generally used as a “conflict-free” timestamp. Also see the functions on Timeuuid
uuiduuidsType 1 or type 4 UUID
varcharstringsUTF8 encoded string
varintintegersArbitrary-precision integer
For more information on how to use the collection types, see the Working with collections section below.

Working with dates

Values of the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as “the epoch”: January 1 1970 at 00:00:00 GMT.
Timestamp can be input in CQL as simple long integers, giving the number of milliseconds since the epoch, as defined above.
They can also be input as string literals in any of the following ISO 8601 formats, each representing the time and date Mar 2, 2011, at 04:05:00 AM, GMT.:
  • 2011-02-03 04:05+0000
  • 2011-02-03 04:05:00+0000
  • 2011-02-03 04:05:00.000+0000
  • 2011-02-03T04:05+0000
  • 2011-02-03T04:05:00+0000
  • 2011-02-03T04:05:00.000+0000
The +0000 above is an RFC 822 4-digit time zone specification; +0000 refers to GMT. US Pacific Standard Time is -0800. The time zone may be omitted if desired— the date will be interpreted as being in the time zone under which the coordinating Cassandra node is configured.
  • 2011-02-03 04:05
  • 2011-02-03 04:05:00
  • 2011-02-03 04:05:00.000
  • 2011-02-03T04:05
  • 2011-02-03T04:05:00
  • 2011-02-03T04:05:00.000
There are clear difficulties inherent in relying on the time zone configuration being as expected, though, so it is recommended that the time zone always be specified for timestamps when feasible.
The time of day may also be omitted, if the date is the only piece that matters:
  • 2011-02-03
  • 2011-02-03+0000
In that case, the time of day will default to 00:00:00, in the specified or default time zone.

Counters

The counter type is used to define counter columns. A counter column is a column whose value is a 64-bit signed integer and on which 2 operations are supported: incrementation and decrementation (see UPDATE for syntax). Note the value of a counter cannot be set. A counter doesn’t exist until first incremented/decremented, and the first incrementation/decrementation is made as if the previous value was 0. Deletion of counter columns is supported but have some limitations (see the Cassandra Wiki for more information).
The use of the counter type is limited in the following way:
  • It cannot be used for column that is part of the PRIMARY KEY of a table.
  • A table that contains a counter can only contain counters. In other words, either all the columns of a table outside the PRIMARY KEY have the counter type, or none of them have it.

Working with collections

Noteworthy characteristics

Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all the messages sent by a given user”, “events registered by a sensor”, ...), then collections are not appropriate anymore and a specific table (with clustering columns) should be used. Concretely, collections have the following limitations:
  • Collections are always read in their entirety (and reading one is not paged internally).
  • Collections cannot have more than 65535 elements. More precisely, while it may be possible to insert more than 65535 elements, it is not possible to read more than the 65535 first elements (see CASSANDRA-5428 for details).
  • While insertion operations on sets and maps never incur a read-before-write internally, some operations on lists do (see the section on lists below for details). It is thus advised to prefer sets over lists when possible.
Please note that while some of those limitations may or may not be loosen in the future, the general rule that collections are for denormalizing small amount of data is meant to stay.

Maps

map is a typed set of key-value pairs, where keys are unique. Furthermore, note that the map are internally sorted by their keys and will thus always be returned in that order. To create a column of type map, use themap keyword suffixed with comma-separated key and value types, enclosed in angle brackets. For example:
CREATE TABLE users (
    id text PRIMARY KEY,
    given text,
    surname text,
    favs map<text, text>   // A map of text keys, and text values
)
Writing map data is accomplished with a JSON-inspired syntax. To write a record using INSERT, specify the entire map as a JSON-style associative array. Note: This form will always replace the entire map.
// Inserting (or Updating)
INSERT INTO users (id, given, surname, favs)
           VALUES ('jsmith', 'John', 'Smith', { 'fruit' : 'apple', 'band' : 'Beatles' })
Adding or updating key-values of a (potentially) existing map can be accomplished either by subscripting the map column in an UPDATE statement or by adding a new map literal:
// Updating (or inserting)
UPDATE users SET favs['author'] = 'Ed Poe' WHERE id = 'jsmith'
UPDATE users SET favs = favs +  { 'movie' : 'Cassablanca' } WHERE id = 'jsmith'
Note that TTLs are allowed for both INSERT and UPDATE, but in both case the TTL set only apply to the newly inserted/updated values. In other words,
// Updating (or inserting)
UPDATE users USING TTL 10 SET favs['color'] = 'green' WHERE id = 'jsmith'
will only apply the TTL to the { 'color' : 'green' } record, the rest of the map remaining unaffected.
Deleting a map record is done with:
DELETE favs['author'] FROM users WHERE id = 'jsmith'

Sets

set is a typed collection of unique values. Sets are ordered by their values. To create a column of type set, use the set keyword suffixed with the value type enclosed in angle brackets. For example:
CREATE TABLE images (
    name text PRIMARY KEY,
    owner text,
    date timestamp,
    tags set<text>
);
Writing a set is accomplished by comma separating the set values, and enclosing them in curly braces. Note: An INSERT will always replace the entire set.
INSERT INTO images (name, owner, date, tags)
            VALUES ('cat.jpg', 'jsmith', 'now', { 'kitten', 'cat', 'pet' });
Adding and removing values of a set can be accomplished with an UPDATE by adding/removing new set values to an existing set column.
UPDATE images SET tags = tags + { 'cute', 'cuddly' } WHERE name = 'cat.jpg';
UPDATE images SET tags = tags - { 'lame' } WHERE name = 'cat.jpg';
As with maps, TTLs if used only apply to the newly inserted/updated values.

Lists

list is a typed collection of non-unique values where elements are ordered by there position in the list. To create a column of type list, use the list keyword suffixed with the value type enclosed in angle brackets. For example:
CREATE TABLE plays (
    id text PRIMARY KEY,
    game text,
    players int,
    scores list<int>
)
Do note that as explained below, lists have some limitations and performance considerations to take into account, and it is advised to prefer sets over lists when this is possible.
Writing list data is accomplished with a JSON-style syntax. To write a record using INSERT, specify the entire list as a JSON array. Note: An INSERT will always replace the entire list.
INSERT INTO plays (id, game, players, scores)
           VALUES ('123-afde', 'quake', 3, [17, 4, 2]);
Adding (appending or prepending) values to a list can be accomplished by adding a new JSON-style array to an existing list column.
UPDATE plays SET players = 5, scores = scores + [ 14, 21 ] WHERE id = '123-afde';
UPDATE plays SET players = 5, scores = [ 12 ] + scores WHERE id = '123-afde';
It should be noted that append and prepend are not idempotent operations. This means that if during an append or a prepend the operation timeout, it is not always safe to retry the operation (as this could result in the record appended or prepended twice).
Lists also provides the following operation: setting an element by its position in the list, removing an element by its position in the list and remove all the occurrence of a given value in the list. However, and contrarily to all the other collection operations, these three operations induce an internal read before the update, and will thus typically have slower performance characteristics. Those operations have the following syntax:
UPDATE plays SET scores[1] = 7 WHERE id = '123-afde';                // sets the 2nd element of scores to 7 (raises an error is scores has less than 2 elements)
DELETE scores[1] FROM plays WHERE id = '123-afde';                   // deletes the 2nd element of scores (raises an error is scores has less than 2 elements)
UPDATE plays SET scores = scores - [ 12, 21 ] WHERE id = '123-afde'; // removes all occurrences of 12 and 21 from scores
As with maps, TTLs if used only apply to the newly inserted/updated values.

No comments:

Post a Comment