DeepDrills and Anomaly Detection for Varied Datasets, Task Status Monitoring & Other Fixes - Chaos Genius 0.2.0

✨ What's New?

Thank you all for the feedback on Chaos Genius 0.1.3. Our main focus for this upgrade was on covering edge cases for making DeepDrills and Anomaly Detection work on as varied datasets as possible, adding Task Status monitoring to enable users to detect if any analytics is failing and other bug fixes.

Key highlights being:

  • Detailed status tracking for analytics for faster detection & debugging (cc: @bouke-nederstigt , @gxu-kangaroo, @davidhayter-karhoo, @mvaerle)
  • Configurations for edge cases like older data sets, smaller data sets, enabling KPI definition w/o dimensions etc. (cc: @davidhayter-karhoo)
  • DeepDrills handling for missing data, NULL/NaN values (cc: @davidhayter-karhoo)
  • New Anomaly Detection model - EWMA
  • Error & Analytics - Config for enabling Sentry & PostHog for Error handling & Analytics (cc: @coindcx-gh)
  • Improved Alerting logic
  • Bug Fixes
  • Data Sources not showing on installation (cc: @omriAl, @nsankar)
  • Other Bug Fixes

We're happy to inform you that we've reached a Community Size of 50 with teams from 10 different time zones in such a short period of time! We look forward to working closely with all of you to support your use cases before we open up to the Public.

🧮 New Model(s)

We added a new model for Anomaly Detection - EWMA. Exponentially Weighted Moving Average (EWMA) is a statistic that averages the data in a way that gives less and less weight to data as they are further removed in time. EWMA is better suited for cases where the data is largely static and then can have sudden state change.

  • feat(anomaly): add EWMA Model (#428)

🎉 New Features

Task and status observability on your Analytics

There are various unique reasons which can sometimes lead to analytics failing - e.g. Database access/authorization error, network error, incomplete data. While we are covering as many edge cases as possible, adding a Task Status is our first step towards faster incident detection. We are adding more features to it including exact errors & diagnoses when the analytics fails. The task status on local installation should be available at:

http://127.0.0.1:8080/api/status/
Task Monitor
  • Store task & subtask status and create a view for it for streamlined troubleshooting (#459)
  • Observable tasks deepdrills (#446)

Error handling and user analytics to give better support (sentry, posthog)

In order to identify the error sooner, you can now configure your Sentry account by updating the parameter SENTRY_DSN in docker-compose.yml. We can also provide you with our Sentry token so we can closely monitor any issues you might be facing.

We've also added Posthog - an open-source analytics tool, to capture user activity to help us better inform the product roadmap as we open up our repos for public access. We enabled an option for anonymizing the data before sharing. It is also possible to disable Posthog.

  • Init the sentry integration (#357)
  • Posthog user identification & redirection (#462)

More dataset configurations/missing data support

In the previous versions, there were analytics failures in cases where there was no data for the past 5 days. We call this 'Slack length'. We've made this value configurable (MAX_DEEPDRILLS_SLACK_DAYS and MAX_ANOMALY_SLACK_DAYS) in the docker-compose.yml and update the default to 14 days. This parameter helps us to perform anomaly detection on the latest data for the most accurate results.

In our previous versions, we also required users to select dimensions as a mandatory field. We've now made this optional. You need to specify dimensions only if you need sub-dimensional insights.

  • Make slack configurable for DeepDrills and Anomaly (#434)
  • Remove the mandatory option for the dimension (#445)

Robust DeepDrills for missing data & errors

Our first implementation of DeepDrills required complete datasets with the last 60 days of data to run successfully. We've enhanced DeepDrills to be more granular in order to work with incomplete data sets & handle missing data.

  • Handle DeepDrills analytics failures gracefully with partial analytics in case of subtask errors (#458)
  • Account for NaN & NULL values in DeepDrill analysis (#437)

Improved alerting logic

We've enhanced our alert logic to instantly trigger alerts once an anomaly is detected. We've also made a few improvements in the alert format. We'll continue to build out the alerting functionality in our future releases.

  • Update the anomaly alert implementation (#467)
  • Change the email format for more clarity (#477)

Improved analytics indexing

We have optimized our indexes to provide faster drill-downs for large KPIs & dimensions.

  • Add the analytics data index (#461)

🐛 Bug Fixes

  • Handle KPI queries with trailing semicolon for KPI validation & analytics (#429)
  • Validate the duplicate column in the result dataset of a query defined KPI (#441)
  • Snowflake connector mentions setting up with a hostname, where the hostname is actually not required (#438) (cc: @joshuataylor)
  • Metric columns having NaN's in first 10 or higher rows fails KPI Validation (#444)
  • Validation for the dimension column in the add KPI screen (#450)
  • DeepDrills fails for KPI with no dimensions defined (#468)
  • Handle empty data in comparison data frame for mean aggregation in DeepDrills (#494)

The Contributors

We have 15+ contributors spread across 10 different time zones across the world who have made commits to our GitHub repo to make Chaos Genius better than it was when they found it.

Contributors

We are thankful to each one of you, and we're very excited about what the future holds for Chaos genius in the open-source ecosystem.

Chaos Genius is an open-source business observability platform democratizing access to AI-powered Anomaly Detection for businesses around the world. Check out and access our Github Repository here. Give it a spin!