Standard SQL Dialect

Overview

The underlying Database behind Cooladata is Google BigQuery. Cooladata has added extensions and processing methods in order to improve BigQuery’s performance, run time as well as enables easier behavioral analysis with it’s CQL functions.

Currently, BigQuery supports two SQL dialects: standard SQL and legacy SQL. There are a few differences between the two dialects, including syntax, functions, data types and semantics. BigQuery standard SQL is compliant with the SQL 2011 standard and also includes extensions that support querying nested and repeated data.

Up to August 2018, Cooladata enabled running SQL scripts using Legacy SQL only. Cooladata now enables running freehand queries over Cooladata using Standard SQL, as part of a strategic plan to migrate Cooladata to operate solely in Standard SQL, encompassing all the advantages of the language, including performance enhancements, query optimizations and superior functionality.

Note! Standard SQL in Cooladata is still in beta. Please contact us with any issues or questions.

Using Standard SQL in Queries

Standard SQL is currently only supported in freehand queries, in the following features:

  • CQL report
  • Aggregation tables
  • R/Python Reports
  • Models
  • CQL Segments
  • CQL alert builders

Note! Using the CQL behavioral extensions (such as the Cohort, Funnel and Path functions) is not yet supported when using Standard Dialect.

To use Standard SQL, turn on the Standard SQL toggle in the top right hand side of the query editor:

 

 

Make sure you use Standard SQL syntax when writing the query. We’ve gathered a few syntax differences for you to start with. For more information see BigQuery documentation or contact support@cooladata.com:

Casting and Conversion Functions

In Legacy SQL, conversion and casting was done using <data type> (<expression>)

For example (Legacy):

In Standard SQL, conversion and casting is done by calling the CAST or SAFE_CAST functions.

For example (Standard):

When using CAST, a query can fail if BigQuery is unable to perform the cast. For example, CAST('david' as FLOAT)returned an error.

When using SAFE_CAST, returns NULL instead of raising an error:

 

Date and Time Functions

All Cooladata date_range functions, such as last n days, current month, previous quarter, etc. are supported in Standard SQL as well.

The main difference is in the extraction and date functions:

EXTRACT

The EXTRACT functions returns the value corresponding to the specified date part of a date or timestamp expression.

In Legacy SQL, extracting a part of a date would work like so: part(timestamp/date expression) would return the month number of the timestamp.

For example (Legacy):

In Standard SQL, extracting a part of a date needs to use the EXTRACT function: EXTRACT(part FROM date_expression)

For example (Standard):

DATE_ADD

The DATE_ADD function adds a specified time interval to a DATE or TIMESTAMP.

In Legacy SQL, DATE_ADD worked in the following syntax: DATE_ADD(timestamp or date expression,interval, interval_units)

For example (Legacy):

In Standard SQL, DATE_ADD only works with a date_expression (see TIMESTAMP_ADD for timestamp expression), using in the following syntax: DATE_DIFF(date_expression, date_expression, date_part)

For example (Standard):

To add a time interval to a timestamp expression, use TIMESTAMP_ADD

For example (Standard):

See the BigQuery reference for the supported interval units.

DATE_DIFF

The DATE_DIFF function returns the difference between two specified dates or timestamps.

In Legacy SQL, the function is called: DATEDIFF (no underscore) with the following syntax:  DATEDIFF(<timestamp1>,<timestamp2>) and returns the number of days between two TIMESTAMP data types. The result is positive if the first TIMESTAMP data type comes after the second TIMESTAMP data type, and otherwise the result is negative.

For example (Legacy):

In Standard SQL, as with the DATE_ADD function, DATE_DIFF is only supported for DATE inputs. The function is used with the following syntax: DATE_DIFF(date_expression, date_expression, date_part). Returns the number of date_part boundaries between the two date_expressions. If the first date occurs before the second date, then the result is non-positive.

For example (Standard):

To return the difference between 2 timestamp expressions use TIMESTAMP_DIFF. Returns the number of whole specified date_part intervals between two timestamps. The first timestamp_expression represents the later date; if the first timestamp_expression is earlier than the second timestamp_expression, the output is negative. Throws an error if the computation overflows the result type, such as if the difference in microseconds between the two timestamps would overflow an INT64 value.

For example (Standard):

 

Count Functions

Both legacy SQL and standard SQL contain a COUNT function. However, each function behaves differently, depending on the SQL dialect you use.

In legacy SQL, COUNT(DISTINCT x) returns an approximate count. In standard SQL, it returns an exact count. For an approximate count of distinct values that runs faster and requires fewer resources, use APPROX_COUNT_DISTINCT.

NOT IN conditions and NULL Values

Legacy SQL does not comply with the SQL standard in its handling of NULL with NOT IN conditions, whereas standard SQL does.

 

Using Expressions in Standard SQL

Since the Virtual Properties of type expression use SQL syntax when called in a query, the expression needs to match the query dialect type. To support Standard SQL dialect when using expressions, we have provided a way to enter the Standard SQL expression in the expression configuration.

To do so, in the expression configuration click on the “Advanced” section – there you can configure the Standard SQL expression. Click on “Copy legacy to standard expression” to copy the Legacy expression to the Standard one.

When using a Standard SQL query, the system will use the Standard SQL expression and if empty, it will try to use the Legacy one (in most functions the syntax remains the same between the two dialects).

 

Using Standard SQL in Query API

When querying Cooladata using API requests or any other way outside the UI (like when using our JDBC connection) – you can turn on standard SQL using one of the following methods:

  1. Using the SQL annotation: #standardSQL in the beginning of the query:
  2. Adding the standard parameter as part of the queryAPI request:
Print Friendly, PDF & Email

CQL Overview

CoolaSQL, otherwise known as CQL, is at the heart of our platform’s flexibility. CQL is an extension of SQL, which means we support full SQL with proprietary extensions, as described below. The main difference between CoolaSQL and SQL is that CoolaSQL provides the ability to examine a sequence of events, whereas SQL is more set-based.

CoolaSQL supports multiple powerful proprietary clauses, functions, and special fields – allowing for easy analysis of typical behavioral patterns such as with path analysis, cohort analysis, and funnel analysis. The great value of CQL is when it comes to very detailed, tailor-made questions that need specific answers.

Best Practices

  • Included date_range in every query FROM CoolaData
    The date range function must be included in every query running from CoolaData at the same query level (in the WHERE clause directly after the FROM CoolaData).
  • Included slicers in queries FROM CoolaData
    Including the slicers function in the query will enable report and document slicers to affect the query. If omitted they will have no effect. Note that the slicers function must be located at the same query level running from CoolaData  (in the WHERE clause directly after the FROM CoolaData)
  • Use Limit to make queries run faster
    CoolaData limits the result set that can be viewed in the AD to 25,000 rows. For example, enter limit 10 to limit the number of results to 10. Limit should be the last part of the Select clause.  For larger result sets use the Query API or contact your Customer Success Manager for more options.
  • Use user_id in CQL when possible
    CoolaData generates a user_id for better performance. When applicable always use this user_id instead of customer_user_id or alternative_user_id.
Print Friendly, PDF & Email

Cohort Analysis

Cohort analysis breaks down data into groups which usually share common characteristics or actions within a defined time frame. Cohort analysis helps identify patterns in the life cycle of customers, adapt, and tailor the service to specific cohorts. A group of people who share a common characteristic or actions within a defined period form a cohort, represented by a row in the returned result set. The date range selected in the report will specify at which date the first cohort begins.

Cohort

The syntax for the cohort function is as follows:

SELECT

Mandatory fields in the SELECT clause:

  • cohort_name: a string describing the cohort:
    • For DAY & WEEK – the date (YYYY-MM-DD) of the first day in the cohort
    • For MONTH – the month (YYYY-MM) of the cohort
    • For Properties – the value of the property
  • cohort_id: the ordinal number of the cohort (starts with 0)
  • cohort_size: the number of users in the cohort
  • bucket_id: the ordinal number of the bucket (starts with 0)
  • Aggregate Function: can be the count of users in the bucket, the cohort size minus the count of users in the bucket (i.e. the rest of the users), or the sum/average/max/min/stddev of a measure (property).

All arguments but the aggregate function must appear in the GROUP BY clause.

CLUSTER COHORT BY

The grouping logic for each cohort. Can be defined by one of the following:

  • Time period defined by a number and a time resolution (days, weeks, months, quarters or years)
  • User scope properties

STARTS WITH

The first event performed:

  • event_name: performed a specific event
  • TRUE: performed any event

Any number of additional conditions for the event can be specified, i.e. where any of the event’s properties have specific values, using “AND <condition>”.

FOLLOWED BY

The concurrent event performed:

  • event_name: performed a specific event
  • NOT event_name: did not perform a specific event
  • TRUE: performed any event

Note: to find the number of users who did nothing use the TRUE argument and, in the select clause, count the number of users left in the bucket – i.e.:

Any number of additional conditions for the event can be specified, i.e. where any of the event’s properties have specific values, using “AND <condition>”.

BUCKET BY

The duration of each bucket and the rule by which to count users in each bucket:

  • FIRST: only counts the user once on the first occurrence of the event throughout his life cycle
  • ALL: the user is counted each time he performed the event
  • RECURRING: each cohort user is counted only if the user’s activity occurred consecutively in each bucket throughout the duration of the cohort.

HAVING

Used to limit the number of buckets and cohorts in the query. When clustering cohorts by property, limiting cohort_id isn’t necessary (will return all the property values, sorted by cohort size).

 

Example:

Which users bought something after registering?
Users that registered within 7-day periods, and then paid for something, by months, showing only the first instance.
This will show us on which month each user made his first payment.

 

Reverse Cohort

It is also possible to analyze a sequence of two events in the reverse order, i.e. what happened prior to a certain event.
Replace “starts with… followed by…” with “ending with… preceded by…”, as follows:

replace

with

The rest of the query would still use the same syntax.

 

Export users

You can also generate a list of users and their properties using the cohort query.
To do so, replace the SELECT clause with the user properties you wish to export and remove the GROUP BY clause.

replace

with

 

Example:

List of user ID’s and emails who bought something within the first week after registration.
Using the the same cohort query, select the user ID’s and emails instead of the cohorts and buckets. To limit the list to the first week, limit the results to buckets 0 and 1.
Note that only user scope properties can be exported (see more on property scopes here).

 

Print Friendly, PDF & Email

String Functions

Function: concat(‘str1’, ‘str2’)

This function concatenates two strings into one.

Function: left(‘str’, numeric_expr)

This function returns a substring of numeric_expr characters starting the count from the leftmost character of ‘str’. The full string will be returned if the numeric_exp is longer than ‘str’.

Function: length(‘str’)

This function returns the length of ‘str’.

Function: lower(‘str’)

This function returns ‘str’ converted to lowercase characters.

Function: lpad(‘str1’, numeric_expr, ‘str2’)

This function returns the string str2, left-padded with the string str1 to a length of numeric_expr characters. The return value is shortened to numeric_expr characters if the returned string would be longer than numeric_expr.

Function: right(‘str’, numeric_expr)

This function returns the rightmost numeric_expr characters of str. If the number is longer than the string, it will return the whole string.

Function: rpad(‘str1’, numeric_expr, ‘str2’)

This function returns the string str2, right-padded with the string str1 to a length of numeric_expr characters. The return value is shortened to numeric_expr characters if the returned string would be longer than numeric_expr.

Function: substr(‘str’, start_from, sub_len)

This function returns a substring of str that is up to sub_len characters long, and that begins from the point in str that is specified by start_from. Counting starts at 1, so the first character in the string is in position 1 (not zero). If start_from is 3, the substring begins with the 3rd character from the left in str. If start_from is -3, the substring begins with the 3rd character from the right in str.

Function: upper(‘str’)

This function returns the string str in uppercase letters (only Latin characters).

Note: All regular expression support using the re2 library, see that documentation for its regular expression syntax.

Function: regexp_match(‘str’, ‘reg_exp’)

This function returns true if ‘str’ matches the regular expression.

Function: regexp_extract(‘str’, ‘reg_exp’)

This function returns the portion of ‘str’ that matches the capturing group within the regular expression.

Function: regexp_replace(‘orig_str’, ‘reg_exp’, ‘replace_str’)

This function returns a string where any substring of orig_str that matches reg_exp is replaced with replace_str.

Note: Regular Expression processing is based on http://code.google.com/p/re2/.

Print Friendly, PDF & Email

Path Analysis

A path is a chain of consecutive events performed by a given user or cohort. For example, a user logs on to your e-commerce platform, browses through a few product pages, creates an account, and then logs off. This path is portrayals the events performed during the user’s session.

Path analysis is used in behavioral analytics to investigate and understand user behavior and gain actionable insights.

SELECT clause

The PATH query SELECT clause can contain any User/Session Properties and/or path functions.

CLUSTER PATH BY

The following is the syntax diagram for the CLUSTER PATH BY clause:

CLUSTER PATH BY is composed of the following elements:

  • CLUSTER PATH BY session: lists the events performed during a session. If you need to drill down to data at the events scope use other CQL functions.
  • REMOVE DUPLICATES: (optional) when this element is used, repeated events are treated as one event. Example: the following path:
    [login, viewItem, viewItem, viewItem, saveItem, purchaseItem]

    will remove all consecutive viewItem events and be translated to:
    [login, viewItem, saveItem, purchaseItem]
    .
  • CONTAINS: Path contains is a non-sequential expression resulting in all the paths that contain any the events, regardless of their order. “Contains” should be used instead of the match expression, and can include multiple conditions using AND or OR operators .
  • MATCH : Describes a sequential path, comprised of steps, each defined by a condition. All steps are optional – any combination of them can be applied.
    • Path steps:
      • STARTS WITH: The first event.
      • FOLLOWED BY: Any following event. Can be used multiple times, consecutively.
      • ENDING WITH: The last event.
    • Supported conditions:
      • any: matching any event. Would replace the step’s entire condition.
      • [NOT] event_name = <event_name>: matching a specific events. A preceding NOT will negate the condition.
      • OR: matching a set of event names linked by OR, to indicate that any of these events can be applied to this step.
      • DIRECTLY: Forces the events to be consecutive, i.e. no events occurred between the preceding step and the current one. Applicable to any step except the first.
  • HAVING PATH: supports aggregative path functions (such as path_to_string, path_duration, path_count) with standard mathematical operators: >, <, =, =<, =>, !=, <>.

Path analysis can be used in two ways – sequential path, and non-sequential path:

Examples

Non-Sequential Path (Contains)
The sequence of events of users who made a purchase, and then performed a social action: Like or Share.

Sequential Path

The top 5 paths by number of session of users who started by logging in, did a search, and made a purchase.

 

Path Functions

  • sub_path(path, event1, event2, …. eventn)
    This function will show only the paths that contain all the events in the list and in this order, and remove preceding and succeeding events. All events must be included at least once within the path. Events preceding the first event in the list or succeeding the last event in the list will not be shown. The function should include the property “path” and then a list of events.
    Example:

    Original Path: login, search, view_item, add_to_cart, share, checkout, search
    Result: view_item, add_to_cart, share, checkout
  • part_path (path, event1, event2, …eventn)
    This function will show only the events included in this list of events, from paths which include all the events in the list, regardless of their order in the path.  The function should include the property “path” and then a list of events.
    Example:

    Original Path: login, search, view_item, add_to_cart, share, checkout, search
    Result: add_to_cart, checkout
  • path_to_string(path | category_path)
    This function converts the path or category path (as indicated by the reserved parameters) into a comma delimited string.
    Example:
  • path_duration()
    This function returns an integer representing the duration (in milliseconds) of a session path.
    Example:
  • path_count()
    This function returns an integer representing the number of events in a path.
    Example:
  • session_start_time()
    This function returns a timestamp of the session’s first event.
  • session_end_time()
    This function returns a timestamp of the session’s last event.
  • session_event_counter()
    This function returns the number of times the desired event appears in the session path, for every session. This function does not need to run on a path query.

    This query will return the number of times the events “login”, “like”, and “create playlist” appear in the session path for every session (running on the full path column).
  • first(path_to_string(path),[number_of_events])
    This presents the first x events in a path. It enables the analyst to learn what the popular first steps are that users take in each session. The function is intended for use in SELECT clause of the query.
    Example:

    In the example above we are interested in the learning the first events users take in a session.
  • last(path_to_string(path),[number_of_events])
    This presents the last x events in a path. It enables the analyst to learn what the popular last steps are that users take in each session. The function is intended for use in the SELECT clause of the query.
    Example:

    In the example above we are interested in the learning the last events users take in a session.
  • time_between(first_event, second_event[, first_event_condition, second_event_condition])
    This function calculates the time elapsed between two events with the option to apply conditions on each event. The function is intended for use in SELECT clause of the query. Event names must be quoted. Conditions may be any valid SQL style condition.
    Example:

    In the example above we are interested in the time elapsed between Gold users registration and a purchase they make over 5 dollars.
    As any field in the SELECT clause it is allowed to apply additional functions on top of the time_between(…) function result. For example:
Print Friendly, PDF & Email