Pages

Friday, August 22, 2014

Queries - Cassandra Query Language

SELECT

Syntax:
<select-stmt> ::= SELECT <select-clause>
                  FROM <tablename>
                  ( WHERE <where-clause> )?
                  ( ORDER BY <order-by> )?
                  ( LIMIT <integer> )?
                  ( ALLOW FILTERING )?

<select-clause> ::= DISTINCT? <selection-list>
                  | COUNT '(' ( '*' | '1' ) ')' (AS <identifier>)?

<selection-list> ::= <selector> (AS <identifier>)? ( ',' <selector> (AS <identifier>)? )*
                   | '*'

<selector> ::= <identifier>
             | WRITETIME '(' <identifier> ')'
             | TTL '(' <identifier> ')'
             | <function> '(' (<selector> (',' <selector>)*)? ')'

<where-clause> ::= <relation> ( AND <relation> )*

<relation> ::= <identifier> <op> <term>
             | '(' <identifier> (',' <identifier>)* ')' <op> <term-tuple>
             | <identifier> IN '(' ( <term> ( ',' <term>)* )? ')'
             | '(' <identifier> (',' <identifier>)* ')' IN '(' ( <term-tuple> ( ',' <term-tuple>)* )? ')'
             | TOKEN '(' <identifier> ( ',' <identifer>)* ')' <op> <term>

<op> ::= '=' | '<' | '>' | '<=' | '>='
<order-by> ::= <ordering> ( ',' <odering> )*
<ordering> ::= <identifer> ( ASC | DESC )?
<term-tuple> ::= '(' <term> (',' <term>)* ')'

Sample:
SELECT name, occupation FROM users WHERE userid IN (199, 200, 207);

SELECT name AS user_name, occupation AS user_occupation FROM users;

SELECT time, value
FROM events
WHERE event_type = 'myEvent'
  AND time > '2011-02-03'
  AND time <= '2012-01-01'

SELECT COUNT(*) FROM users;

SELECT COUNT(*) AS user_count FROM users;


The SELECT statements reads one or more columns for one or more rows in a table. It returns a result-set of rows, where each row contains the collection of columns corresponding to the query.

<select-clause>

The <select-clause> determines which columns needs to be queried and returned in the result-set. It consists of either the comma-separated list of or the wildcard character (*) to select all the columns defined for the table.
<selector> is either a column name to retrieve, or a <function> of one or multiple column names. The functions allows are the same that for <term> and are describe in the function section. In addition to these generic functions, the WRITETIME (resp. TTL) function allows to select the timestamp of when the column was inserted (resp. the time to live (in seconds) for the column (or null if the column has no expiration set)).
Any <selector> can be aliased using AS keyword (see examples). Please note that <where-clause> and <order-by> clause should refer to the columns by their original names and not by their aliases.
The COUNT keyword can be used with parenthesis enclosing *. If so, the query will return a single result: the number of rows matching the query. Note that COUNT(1) is supported as an alias.

<where-clause>

The <where-clause> specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY and/or have a secondary index defined on them.
Not all relations are allowed in a query. For instance, non-equal relations (where IN is considered as an equal relation) on a partition key are not supported (but see the use of the TOKEN method below to do non-equal queries on the partition key). Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows. For instance, given
CREATE TABLE posts (
    userid text,
    blog_title text,
    posted_at timestamp,
    entry_title text,
    content text,
    category int,
    PRIMARY KEY (userid, blog_title, posted_at)
)
The following query is allowed:
SELECT entry_title, content FROM posts WHERE userid='john doe' AND blog_title='John''s Blog' AND posted_at >= '2012-01-01' AND posted_at < '2012-01-31'
But the following one is not, as it does not select a contiguous set of rows (and we suppose no secondary indexes are set):
// Needs a blog_title to be set to select ranges of posted_at
SELECT entry_title, content FROM posts WHERE userid='john doe' AND posted_at >= '2012-01-01' AND posted_at < '2012-01-31'
When specifying relations, the TOKEN function can be used on the PARTITION KEY column to query. In that case, rows will be selected based on the token of their PARTITION_KEY rather than on the value. Note that the token of a key depends on the partitioner in use, and that in particular the RandomPartitioner won’t yield a meaningful order. Also note that ordering partitioners always order token values by bytes (so even if the partition key is of type int, token(-1) > token(0) in particular). Example:
SELECT * FROM posts WHERE token(userid) > token('tom') AND token(userid) < token('bob')
Moreover, the IN relation is only allowed on the last column of the partition key and on the last column of the full primary key.
It is also possible to “group” CLUSTERING COLUMNS together in a relation using the tuple notation. For instance:
SELECT * FROM posts WHERE userid='john doe' AND (blog_title, posted_at) > ('John''s Blog', '2012-01-01')
will request all rows that sorts after the one having “John's Blog” as blog_tile and ‘2012-01-01’ for posted_at in the clustering order. In particular, rows having a post_at <= '2012-01-01' will be returned as long as their blog_title > 'John''s Blog', which wouldn’t be the case for:
SELECT * FROM posts WHERE userid='john doe' AND blog_title > 'John''s Blog' AND posted_at > '2012-01-01'
The tuple notation may also be used for IN clauses on CLUSTERING COLUMNS:
SELECT * FROM posts WHERE userid='john doe' AND (blog_title, posted_at) IN (('John''s Blog', '2012-01-01), ('Extreme Chess', '2014-06-01'))

<order-by>

The ORDER BY option allows to select the order of the returned results. It takes as argument a list of column names along with the order for the column (ASC for ascendant and DESC for descendant, omitting the order being equivalent to ASC). Currently the possible orderings are limited (which depends on the table CLUSTERING ORDER):
  • if the table has been defined without any specific CLUSTERING ORDER, then then allowed orderings are the order induced by the clustering columns and the reverse of that one.
  • otherwise, the orderings allowed are the order of the CLUSTERING ORDER option and the reversed one.

LIMIT

The LIMIT option to a SELECT statement limits the number of rows returned by a query.

ALLOW FILTERING

By default, CQL only allows select queries that don’t involve “filtering” server side, i.e. queries where we know that all (live) record read will be returned (maybe partly) in the result set. The reasoning is that those “non filtering” queries have predictable performance in the sense that they will execute in a time that is proportional to the amount of data returned by the query (which can be controlled through LIMIT).
The ALLOW FILTERING option allows to explicitly allow (some) queries that require filtering. Please note that a query using ALLOW FILTERING may thus have unpredictable performance (for the definition above), i.e. even a query that selects a handful of records may exhibit performance that depends on the total amount of data stored in the cluster.
For instance, considering the following table holding user profiles with their year of birth (with a secondary index on it) and country of residence:
CREATE TABLE users (
    username text PRIMARY KEY,
    firstname text,
    lastname text,
    birth_year int,
    country text
)

CREATE INDEX ON users(birth_year);

Then the following queries are valid:
SELECT * FROM users;
SELECT firstname, lastname FROM users WHERE birth_year = 1981;
because in both case, Cassandra guarantees that these queries performance will be proportional to the amount of data returned. In particular, if no users are born in 1981, then the second query performance will not depend of the number of user profile stored in the database (not directly at least: due to secondary index implementation consideration, this query may still depend on the number of node in the cluster, which indirectly depends on the amount of data stored. Nevertheless, the number of nodes will always be multiple number of magnitude lower than the number of user profile stored). Of course, both query may return very large result set in practice, but the amount of data returned can always be controlled by adding a LIMIT.
However, the following query will be rejected:
SELECT firstname, lastname FROM users WHERE birth_year = 1981 AND country = 'FR';
because Cassandra cannot guarantee that it won’t have to scan large amount of data even if the result to those query is small. Typically, it will scan all the index entries for users born in 1981 even if only a handful are actually from France. However, if you “know what you are doing”, you can force the execution of this query by using ALLOW FILTERING and so the following query is valid:
SELECT firstname, lastname FROM users WHERE birth_year = 1981 AND country = 'FR' ALLOW FILTERING;

No comments:

Post a Comment