SELECT
Syntax:<select-stmt> ::= SELECT <select-clause> FROM <tablename> ( WHERE <where-clause> )? ( ORDER BY <order-by> )? ( LIMIT <integer> )? ( ALLOW FILTERING )? <select-clause> ::= DISTINCT? <selection-list> | COUNT '(' ( '*' | '1' ) ')' (AS <identifier>)? <selection-list> ::= <selector> (AS <identifier>)? ( ',' <selector> (AS <identifier>)? )* | '*' <selector> ::= <identifier> | WRITETIME '(' <identifier> ')' | TTL '(' <identifier> ')' | <function> '(' (<selector> (',' <selector>)*)? ')' <where-clause> ::= <relation> ( AND <relation> )* <relation> ::= <identifier> <op> <term> | '(' <identifier> (',' <identifier>)* ')' <op> <term-tuple> | <identifier> IN '(' ( <term> ( ',' <term>)* )? ')' | '(' <identifier> (',' <identifier>)* ')' IN '(' ( <term-tuple> ( ',' <term-tuple>)* )? ')' | TOKEN '(' <identifier> ( ',' <identifer>)* ')' <op> <term> <op> ::= '=' | '<' | '>' | '<=' | '>=' <order-by> ::= <ordering> ( ',' <odering> )* <ordering> ::= <identifer> ( ASC | DESC )? <term-tuple> ::= '(' <term> (',' <term>)* ')'
Sample:
SELECT name, occupation FROM users WHERE userid IN (199, 200, 207); SELECT name AS user_name, occupation AS user_occupation FROM users; SELECT time, value FROM events WHERE event_type = 'myEvent' AND time > '2011-02-03' AND time <= '2012-01-01' SELECT COUNT(*) FROM users; SELECT COUNT(*) AS user_count FROM users;
The
SELECT
statements reads one or more columns for one or more rows in a table. It returns a result-set of rows, where each row contains the collection of columns corresponding to the query.
<select-clause>
The <select-clause>
determines which columns needs to be queried and returned in the result-set. It consists of either the comma-separated list of *
) to select all the columns defined for the table.A
<selector>
is either a column name to retrieve, or a <function>
of one or multiple column names. The functions allows are the same that for <term>
and are describe in the function section. In addition to these generic functions, the WRITETIME
(resp. TTL
) function allows to select the timestamp of when the column was inserted (resp. the time to live (in seconds) for the column (or null if the column has no expiration set)).Any
<selector>
can be aliased using AS
keyword (see examples). Please note that <where-clause>
and <order-by>
clause should refer to the columns by their original names and not by their aliases.The
COUNT
keyword can be used with parenthesis enclosing *
. If so, the query will return a single result: the number of rows matching the query. Note that COUNT(1)
is supported as an alias.
<where-clause>
The <where-clause>
specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY
and/or have a secondary index defined on them.Not all relations are allowed in a query. For instance, non-equal relations (where
IN
is considered as an equal relation) on a partition key are not supported (but see the use of the TOKEN
method below to do non-equal queries on the partition key). Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows. For instance, givenThe following query is allowed:CREATE TABLE posts ( userid text, blog_title text, posted_at timestamp, entry_title text, content text, category int, PRIMARY KEY (userid, blog_title, posted_at) )
But the following one is not, as it does not select a contiguous set of rows (and we suppose no secondary indexes are set):SELECT entry_title, content FROM posts WHERE userid='john doe' AND blog_title='John''s Blog' AND posted_at >= '2012-01-01' AND posted_at < '2012-01-31'
When specifying relations, the// Needs a blog_title to be set to select ranges of posted_at SELECT entry_title, content FROM posts WHERE userid='john doe' AND posted_at >= '2012-01-01' AND posted_at < '2012-01-31'
TOKEN
function can be used on the PARTITION KEY
column to query. In that case, rows will be selected based on the token of their PARTITION_KEY
rather than on the value. Note that the token of a key depends on the partitioner in use, and that in particular the RandomPartitioner won’t yield a meaningful order. Also note that ordering partitioners always order token values by bytes (so even if the partition key is of type int, token(-1) > token(0)
in particular). Example:Moreover, theSELECT * FROM posts WHERE token(userid) > token('tom') AND token(userid) < token('bob')
IN
relation is only allowed on the last column of the partition key and on the last column of the full primary key.It is also possible to “group”
CLUSTERING COLUMNS
together in a relation using the tuple notation. For instance:will request all rows that sorts after the one having “John's Blog” asSELECT * FROM posts WHERE userid='john doe' AND (blog_title, posted_at) > ('John''s Blog', '2012-01-01')
blog_tile
and ‘2012-01-01’ for posted_at
in the clustering order. In particular, rows having a post_at <= '2012-01-01'
will be returned as long as their blog_title > 'John''s Blog'
, which wouldn’t be the case for:The tuple notation may also be used forSELECT * FROM posts WHERE userid='john doe' AND blog_title > 'John''s Blog' AND posted_at > '2012-01-01'
IN
clauses on CLUSTERING COLUMNS
:SELECT * FROM posts WHERE userid='john doe' AND (blog_title, posted_at) IN (('John''s Blog', '2012-01-01), ('Extreme Chess', '2014-06-01'))
<order-by>
The ORDER BY
option allows to select the order of the returned results. It takes as argument a list of column names along with the order for the column (ASC
for ascendant and DESC
for descendant, omitting the order being equivalent to ASC
). Currently the possible orderings are limited (which depends on the table CLUSTERING ORDER
):- if the table has been defined without any specific
CLUSTERING ORDER
, then then allowed orderings are the order induced by the clustering columns and the reverse of that one. - otherwise, the orderings allowed are the order of the
CLUSTERING ORDER
option and the reversed one.
LIMIT
The LIMIT
option to a SELECT
statement limits the number of rows returned by a query.
ALLOW FILTERING
By default, CQL only allows select queries that don’t involve “filtering” server side, i.e. queries where we know that all (live) record read will be returned (maybe partly) in the result set. The reasoning is that those “non filtering” queries have predictable performance in the sense that they will execute in a time that is proportional to the amount of data returned by the query (which can be controlled through LIMIT
).The
ALLOW FILTERING
option allows to explicitly allow (some) queries that require filtering. Please note that a query using ALLOW FILTERING
may thus have unpredictable performance (for the definition above), i.e. even a query that selects a handful of records may exhibit performance that depends on the total amount of data stored in the cluster.For instance, considering the following table holding user profiles with their year of birth (with a secondary index on it) and country of residence:
CREATE TABLE users ( username text PRIMARY KEY, firstname text, lastname text, birth_year int, country text ) CREATE INDEX ON users(birth_year);
Then the following queries are valid:
because in both case, Cassandra guarantees that these queries performance will be proportional to the amount of data returned. In particular, if no users are born in 1981, then the second query performance will not depend of the number of user profile stored in the database (not directly at least: due to secondary index implementation consideration, this query may still depend on the number of node in the cluster, which indirectly depends on the amount of data stored. Nevertheless, the number of nodes will always be multiple number of magnitude lower than the number of user profile stored). Of course, both query may return very large result set in practice, but the amount of data returned can always be controlled by adding aSELECT * FROM users; SELECT firstname, lastname FROM users WHERE birth_year = 1981;
LIMIT
.However, the following query will be rejected:
because Cassandra cannot guarantee that it won’t have to scan large amount of data even if the result to those query is small. Typically, it will scan all the index entries for users born in 1981 even if only a handful are actually from France. However, if you “know what you are doing”, you can force the execution of this query by usingSELECT firstname, lastname FROM users WHERE birth_year = 1981 AND country = 'FR';
ALLOW FILTERING
and so the following query is valid:SELECT firstname, lastname FROM users WHERE birth_year = 1981 AND country = 'FR' ALLOW FILTERING;
No comments:
Post a Comment