This feature is pre-release, and is made available to you as a Beta-quality feature. We welcome your feedback on what you find most effective. Please use the Intercom button or talk to us in Pollinators Community Slack to get in touch and let us know what you think.
BubbleUp is intended to help explain how some data points are different from the other points returned by a query. The goal is to try to explain how a subset of data differs from other data. This feature surfaces potential places to look for signal within your data.
For example, consider the graph below, which shows the statistical distribution of
roundtrip_dur of an application’s requests over the selected time period.
In this set of points, for example, the analyst might want to distinguish the strange group of events that have a surprisingly-high
In this screenshot:
mysql_replsetwas active during that section), but in the other fields, like C
hostnamethey are fairly similar.
mysql_dur. (It is also very different with regard to
roundtrip_dur, but that was the initial selection.)
This can help your analysis, because it helps figure out which fields are the most likely next starting points. In this case, it seems clear that one particular
mysql_replset had a transient period of slowing down the requests.
Currently, BubbleUp mode is only supported for heatmaps.
To access BubbleUp mode:
BubbleUp mode works based on a selection you make within a heatmap.
Click within the heatmap to select one corner, and drag to cover the opposite corner. Ensure your selection covers some or all of the points that you want to investigate.
The selected area is called the selection; the entire area of the shown heatmap is the baseline.
The BubbleUp charts are displayed below the heatmap.
A BubbleUp is based on a selection of points queried from the dataset. It shows every (non-empty) column in the dataset. For each column, it shows a histogram of values within the baseline in blue, and those from the selection in green. The histogram shows the distribution of different values for the dataset. The height of each bar is proportional to the number of times the value occurs in the results of the query.
A BubbleUp shows a series of miniature histograms, one for each column in the datset. The columns are divided into two groups, for categorical dimensions and continuous measures.
A dimension is a column that can be used to group, separate, or filter data items. In BubbleUp, categorical and ordinal data are visualized together. Categorical columns are those in which the values do not fall in a meaningful order. Examples of categorical columns include
A low-cardinality, categorical dimension. In BubbleUp, categorical dimensions are shown captioned with the relevant value. The field
platform has five distinct values; in both the baseline and selection, there are more “android” and “ios” values than “js” and “rest”.
A high cardinality, categorical dimension.
When there are many columns, only the top fifty are shown, including some from each of baseline and selection. In
hostname, the baseline set is truncated.
endpoint, the one bar of the selection stands out as a visible outlier. It can be interpreted to mean “there is only one value for
endpoint within the selection.”
An ordinal dimension is one that has a meaningful order. In
status_code, the values are numeric, and so are arranged in ascending order. The value 200 occurs frequently in both baseline and selection. Code 500 occurs less frequently in the selection — but almost never occurs in the baseline. Conversely, code 400 is rare in the baseline, but never appears at all in the selection.
Very different heights of bars in the baseline and selection can be indications that this column is unusual. For example, it could be valuable to learn how
status_code differs, or what happens with the one specific endpoint.
A tooltip is displayed when you hover your mouse over a pair of histogram bars, displaying the field value they represent.
Continuous, numerical dimensions are those where individual values are not as important. Instead, the distribution is important. In the screenshot below, the baseline and outliers are very different for
roundtrip_dur; they seem very similar for
fraud_dur. This can help validate hypotheses — for example, the fact that
mysql_dur is as different as
roundtrip_dur might suggest that roundtrip time is being driven by mysql time.
mysql_replsetabove, it might be valuable to filter on
db_shard_1. Other times, it can be valuable to pursue a breakdown to understand better how fields vary from each other.
user_idare stored as numbers, and so are shown as continuous measures. The easiest way to fix this is to go to the Dataset Settings page and adjust the data type to “string.”
REG_VALUE('$field', '*([0-9]+).*')’ will not show correctly. To fix this, coerce the regular expression to a float with the function
stringtyped columns, representing an empty string.