As you know, Cooladata lets you control which data to collect from numerous sources. You decide what you want track and Cooladata warehouses it for all your analytical purposes.
Once you decided which sources you want to collect, cooladata normalizes the data and makes it very easy to execute queries no matter how complex they are, to summarize or aggregate data for reporting and serving to other people or other platforms. Since accuracy is essential, Cooladata manages the data at the raw event granularity.
Cooladata ensures that your data is validated against our self learning schema to ensure it is correctly formatted and processed successfully. On top of this validation, you can control how your data is processed, transformed and enriched before it is stored, or manipulate it after the storage is done for easier querying and reporting.
Recently, we’ve added two very useful features to the data manipulation “tool set” Cooladata offers their customers as part of every project:
- Custom Manipulations as part of the ETL
This new feature allows you to customize manipulations on the data before it is stored and requires custom set-up with your Cooladata Customer Success Manager.
– Blocking or invalidating events: Every data scientist knows: “garbage in – garbage out”. To avoid entering messy data to your project – you could block or invalidate events that match a certain condition as part of your custom project set-up. A common use case is blocking out bots or internal QA users.
– Changing events before they are stored: manipulating events as part of the ETL can have several use case, such as: hashing personal information, changing data types, extracting specific parameters from long strings or arrays into designated columns, etc. The biggest advantage of this type of manipulation is that it’s done before the data is even stored so it doesn’t require any further manipulation and helps a lot with achieving consistency of data.
- Sending retroactive events back to the ETL
Cooladata allows sending events to the ETL to be stored in the “cooladata” events table with the rest of your events. This is a very common case for events that can be uploaded from an external DB, historical data storage, the invalids table or even a service API. This task can be scheduled to run automatically like any other task in Cooladata and can even be set up as a step in a Job. Notice these events are out of context so the automatic sessionization done by our ETL might be affected. To avoid that you can turn off sessionization for these events.
These above mentioned features are already deployed and are ready to be exploited. Just to serve as a reminder, here is the full list of data manipulation tools already available:
1. ETL – out of the box data manipulations:
- Server – side Sessionization: unifies sessions from all platforms into one single session, based on the user ID and the custom session definitions defined per project
- Enrichment : automatic breakdown of user agent and IP into device details and geo location.
- User identities matching: unifying between several identities of the same user. The most common use case are anonymous and registered users that login in the middle of the session. Cooladata unifies the 2 users into one identity and stores the data accordingly. You can then query the entire journey of the user, from before he registered to after, understanding the complete picture.
- Auto-population based on scope: each column is auto-populated based on the scope of the property – for instance: A FIRST USER scope property stores the first values received for this user in all of that user’s events, whereas LAST SESSION scope property stores the last non-null value in all the session’s events received after this value. This saves the analyst the effort of joining multiple events and tables. Each user also receives automatic property of create date, to easily extract the first timestamp the user was seen in the app and it’s first session is always marked as is_new=1.
2. Virtual Properties – Some properties require dynamic manipulations that are applied ad hoc while querying the data. This feature allows you to store SQL expressions as virtual properties, and select these properties in your reports or even filter by them in dashboards. The expression can contain a formula that hides the complex processing of a property. A common use case is total order amount calculations based on several columns or unifications of two columns using the IFNULL function.
3. Aggregation Tables – Aggregation Tables automatically run scheduled calculations and data aggregations and save them in a separate, permanent table. This can be used to compute and store aggregated data in order to enhance the performance of queries running on large scales of data. Most customer store aggregated tables based on several data sources and multiple joined tables. Querying these tables are easier then writing the complex joins and SQLs in each query.
4. Models- Models are designed to add R and Python capabilities to your Cooladata’s workflow. Models enable writing tasks based on R or Python scripts. This allows you to encompass the capabilities of these languages to manipulate your data and save the data frames created by those models into tables in your project.
Both the Aggregation Tables and Models, as well as the new retroactive event sender task, can be scheduled as steps in our Jobs to make up a repeating data manipulation workflow.