Table of Contents
- Introduction
- Detecting Differences in Distributions
Introduction
In data analysis, comparing columns of data across various splits can be likened to ensuring different sections of a pie have the same flavor and texture. Just as a chef wants each slice of a pie to taste the same, analysts aim for consistent feature distributions across dataset splits. This consistency is crucial for reliable inferences and predictions. In machine learning, datasets are commonly split into training, validation, and test sets. This practice assumes that feature distributions across these splits are similar. As a result understanding these distributions (along with strategies to discern differences) are highly useful. This blog post will share strategies to determine if a feature’s distribution significantly differs across splits to help address this.
What are Feature Distributions?
Feature distributions describe the statistical patterns and spread of values within individual variables in your dataset. For example, in a customer dataset, age might follow a normal distribution, while income could show a right-skewed distribution. These patterns provide insights into the underlying structure of your data and help validate assumptions about your population.
Detecting Differences in Distributions
Setup
Before we showcase the functionality we’ll need to set things up.
Dummy Dataset
For demo purposes we’ll generate some dummy data containing some features alongside a split column. You can utilise your own dataset for this purpose.
Note: Be wary that depending on the source of your data you may violate some of the tests assumptions. You’ll want to be careful to make sure you understand these assumptions and your data to decide on which tests are most applicable for your use case. The data shown here is just used as a placeholder to showcase the approach.
Here is a sample of that looks like:
split age income height gender occupation owns_car registration_date
0 train 20 34996 1.512022 Male Teacher True 2020-04-09
1 train 29 22473 1.623053 Male Artist False 2021-03-17
2 train 26 49354 1.715669 Female Engineer True 2020-04-18
3 test 31 98455 1.823319 Female Doctor False 2022-06-28
4 train 30 25759 1.554452 Female Teacher False 2020-06-08
As you can see we have a number of different features relating to a person. These features also vary widely in terms of datatype.
dummy dataset creation
Defining & Loading Config
For our configuration we’ll want to track the features and their corresponding datatypes so we can ensure alignment. The naming of these datatypes is somewhat arbitrary and based on the preprocessing function we’ll encounter in a minute.
data_types:
features:
discrete:
gender: string
occupation: string
owns_car: bool
registration_date: date
continuous:
age: numeric
income: numeric
height: numeric
With our config in hand we can load it using the simple snippet below
load_yaml
config = load_yaml('config.yaml')
config
which should yield our config:
{'data_types': {'features': {'discrete': {'gender': 'string',
'occupation': 'string',
'owns_car': 'bool',
'registration_date': 'date'},
'continuous': {'age': 'numeric',
'income': 'numeric',
'height': 'numeric'}}}
}
Preprocessing Dataset
- Get your data in the right format depending on the context. The main objective here is to get your data in the format you’d require to perform the desired comparisons.
- If you have data quality issues you’d likely want to address these first as they could impact your results making you think.
- Since we’ve generated a dataset in an already compatible format for demo purposes we don’t really perform much here but you may want to consider performing more steps.
- Time based features should be considered as continuous for the most part. This also allows use of powerful non-parametric tests to be applied.
- Time is an inherently continuous variable that can be measured to an exact amount, rather than just counted in discrete units.
- Even though time is often measured in discrete units like days, the underlying time scale is continuous and can be broken down into infinitely small increments like seconds, milliseconds, etc.
- Treating time as a continuous variable allows for more precise modeling and analysis, as the distance between any two time points is known. This is preferable to treating time as a discrete, ordinal variable.
- While time can be discretized into units like days, this is often done for convenience or due to data limitations, not because time is inherently discrete. The discretization is an approximation of the underlying continuous nature of time.
The way we defined our config lends itself to quickly performing datatype checks for instance and transforming it to the appropriate type.
datatypes_config = {**config['data_types']['features']['discrete'], **config['data_types']['features']['continuous']}
data_processed = convert_dtypes(data, datatypes_config)
data_processed.head()
which gives:
split age income height gender occupation owns_car registration_date
0 train 20 34996 1.512022 Male Teacher True 2020-04-09
1 train 29 22473 1.623053 Male Artist False 2021-03-17
2 train 26 49354 1.715669 Female Engineer True 2020-04-18
3 test 31 98455 1.823319 Female Doctor False 2022-06-28
4 train 30 25759 1.554452 Female Teacher False 2020-06-08
Visualising Dataset
Now that we have things setup the first step in seeing a difference is through visualisations. The focus here being that by inspection of graphs and associated decriptitve statistics you can get an indication of whether things look similar or not.
Generating Automated Reports
- SweetViz is a nice package for generating all standard plots and descriptive statistics for our data.
- For the purpose of viewing basic plots for the data, there is little point in creating lots of custom functionality for this.
- Can consider custom functionality when things are right.
- Resources:
- For the purpose of viewing basic plots for the data, there is little point in creating lots of custom functionality for this.
In a few lines of code we can generate our plots
import sweetviz as sv
report_train = sv.analyze(
data_processed[data_processed['split'] == 'train'],
pairwise_analysis='off'
)
report_validation = sv.analyze(
data_processed[data_processed['split'] == 'validation'],
pairwise_analysis='off'
)
report_test = sv.analyze(
data_processed[data_processed['split'] == 'test'],
pairwise_analysis='off'
)
report_train.show_html('./data/train_report.html')
report_validation.show_html('./data/validation_report.html')
report_test.show_html('./data/test_report.html')
All things going well you should get output shown below:
Feature: registration_date |██████████| [100%] 00:00 -> (00:00 left)
Feature: registration_date |██████████| [100%] 00:00 -> (00:00 left)
Feature: registration_date |██████████| [100%] 00:00 -> (00:00 left)
Report ./data/train_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Report ./data/validation_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Report ./data/test_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Performing Statistical Tests and Calculating Custom Metrics
Approach to Perform This:
- Perform statistical testing and identify significant features after correcting for multiple tests.
- For the significant results, visualize the distributions using density plots.
- Assess the practical significance by examining the degree of separation between the distributions and the effect size (e.g., Cohen’s d).
- Conclude these are significantly different.
- Prioritize results that are both statistically significant and practically meaningful based on these factors.
- For results that are statistically significant but lack practical significance, interpret them cautiously or consider them less important.
Test Selection:
- Various different tests exist, so it’s worthwhile investigating their use based on data types and assumptions.
- Non-parametric tests might be a good candidate due to their lax constraints on the null hypothesis. These work well with continuous features.
- For example, a good candidate for direct distribution comparison would be the Kolmogorov-Smirnov test.
Possible Tests with Assumptions:
- Kruskal Wallis
- Assumptions: Independent samples, observations in split features are ordinal or continuous, distribution of features among limbs have the same shape (but not necessarily normal).
- Interpretations: Tests whether the medians of the groups are significantly different. It indicates that at least one group has a significantly different median compared to others.
- Mann Whitney U
- Assumptions: Independent samples, observations in groups are ordinal or continuous, distribution of features among splits have the same shape (but not necessarily normal).
- Interpretations: Tests whether the median between the two groups are significantly different. If the results are significant, then the medians show a significant difference.
- Kolmogorov-Smirnov
- Assumptions: Independent samples, distributions in groups are continuous.
- Interpretations: Compares the cumulative distribution functions (CDFs) of two groups and tests whether they are significantly different. If results are significant, this indicates the cumulative distributions are significantly different, hence the distributions are.
- Permutation
- Assumptions: Independent samples, observations in groups are ordinal or continuous.
- Interpretations: Tests the difference in means between groups by randomly permuting the group labels and calculating the test statistic for each permutation. If results are significant (within the extreme tails of the permutation distribution), it indicates that the difference in means between the two groups is significant.
- Chi-squared
- Assumptions: Independent samples, observations in groups are nominal or ordinal, frequency/count data, nominal/ordinal level of measurement, value of the expected cells should be 5 or more in at least 80% of the cells, and no cell should have an expected of less than one. Some empirical analysis on the suggested rules of thumbs might be overly conservative anyway.
- Interpretations: Tests if two categorical variables are independent or related by calculating the chi-square statistic on data. If results are significant (small p-value), there is a statistically significant relationship between the two variables; otherwise, not sufficient evidence to conclude the variables are related.
- Alongside performing testing, calculating an effect size measure can be important to help quantify the difference between distributions rather than if there is a difference or not, as spoken about here.
- There exists a number of standard measures that people tend to use. Cohen’s d is a good example that works with continuous features that follow a normal distribution.
- Due to this normal constraint (which a bunch of our features seem to violate) alongside susceptibility to outlier corruption, I decided to implement a new measure based on the following article, which shows a more robust yet somewhat consistent measure that can be used.
- Separate to hypothesis testing, other metrics can help measure divergence between probability distributions.
- A good example of this is Jensen-Shannon Divergence, which provides a normalized measure (0 implying identical and 1 implying maximally different) for difference.
- Didn’t come across any singular clear rule as to whether you should perform multiple-test corrections or not, it depends on the scenario
- In our case, it might not be needed but perhaps desirable since we might be making conclusions off the back of the results from multiple tests (whether each feature shows significant difference across splits) in the end in which case this is an advisable scenario.
- Some detailed discussion on whether it’s necessary and how to go about it can be found here.
- A breakdown of when to use which correction method is outlined here
- Multiple test correction helps account for FWER but there is a trade-off against power. For example, Bonferroni allows us to control FWER; however, it is known for being conservative on results and may vary depending on which tests are considered as part of the family.
- Even defining the familiy of tests can be subjective but can be influenced by a few key factors:
- The purpose of the tests: Tests that address a common research question or hypothesis should be grouped together as a family.
- The content of the tests: Tests that are similar in terms of the variables, hypotheses, and populations involved are more likely to be considered part of the same family.
- Current Implementation:
- For now, I have added corrections at a family = feature-test level:
- Reason for this is that we want to conclude whether for a given feature-test, that test suggests a significant difference between the various limbs we adjust across those limbs (4 choose 2 = 6 pairwise comparisons).
- The choice seems to also account for the research question (which features show differences, hence does a feature show a notable difference?) too since if we correct across feature-tests, we can be more confident as to whether a particular test shows a difference for a feature between groups.
- For now, I have added corrections at a family = feature-test level:
- Future Consideration:
- Perhaps making family = feature level might be good too?
- Since the tests being performed are similar in terms of hypothesis variables, then it could be considered part of the same family?
- Only issue is that different features could have different numbers of tests performed, so some might have different levels of corrections. This could adjust some features more than others, but perhaps this is okay?
- Perhaps making family = feature level might be good too?
- In general, it’s useful when you are performing rigorous planned testing rather than exploratory testing.
General Resources
- https://towardsdatascience.com/how-to-compare-two-or-more-distributions-9b06ee4d430b
- https://www.statisticshowto.com/probability-and-statistics/definitions/parametric-and-non-parametric-data/
- https://influentialpoints.com/Training/kolmogorov-smirnov-test-principles-properties-assumptions.htm
- https://influentialpoints.com/Training/wilcoxon-mann-whitney_principles-properties-assumptions.htm
- https://influentialpoints.com/Training/Kruskal-Wallis_ANOVA_use_and_misuse.htm
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/
- https://www.questionpro.com/blog/nominal-ordinal-interval-ratio-data/
- https://www.geo.fu-berlin.de/en/v/soga/ivy/Basics-of-Statistical-Hypothesis-Tests/Chi-Square-Tests/Chi-Square-Independence-Test/index.html
Below is some functionality that has been built to perform the majority of these tests, all you need to do is pass in the desired test you want to perform based on your use case. If you pass no specific tests in then the defaults would kick in which are sensible (chi squared test for discrete features and kolmogorov-smirnov and permutation tests for continuous features).
perform_pairwise_tests and dependencies
hypothesis_testing_results = perform_pairwise_tests(
data_processed,
discrete_cols = config['data_types']['features']['discrete'],
continuous_cols = config['data_types']['features']['continuous'],
group_col = 'split'
)
hypothesis_testing_results
Which gives you something like:
feature group1 group2 test_name test_statistic p_value type js_divergence cohens_d aakinshin_gamma aakinshin_gamma_dist effect_size significant_bonferroni bonferroni_corrected_p_value
0 gender train test chi_squared_test 24.517898 4.742486e-06 discrete 0.019733 NaN NaN NaN None True 1.422746e-05
1 gender train validation chi_squared_test 0.502874 7.776825e-01 discrete 0.000676 NaN NaN NaN None False 1.000000e+00
2 gender test validation chi_squared_test 9.136032 1.037853e-02 discrete 0.016459 NaN NaN NaN None True 3.113559e-02
12 age train test kolmogorov_smirnov_test 0.760000 4.981798e-90 continuous 0.444449 -2.701849 2.066277 {'quantiles': [0.2, 0.23157894736842108, 0.263...]} large True 1.494539e-89
13 age train test permutation_test NaN 0.000000e+00 continuous 0.444449 -2.701849 2.066277 {'quantiles': [0.2, 0.23157894736842108, 0.263...]} large True 0.000000e+00
14 age train validation kolmogorov_smirnov_test 1.000000 6.254039e-135 continuous 0.692191 -6.079607 5.024814 {'quantiles': [0.2, 0.23157894736842108, 0.263...]} large True 1.876212e-134
- Now the tests just performed act in a pairwise fashion between data splits.
- Performing tests at a higher level (not pairwise between all combinations but instead feeding the featurea across each split).
- I suspect this might be the better way to perform the chi2 test (not pairwise but across all splits at once) as it aligns with the level at which we ultimately want to perform conclusions (is feature different across splits rather than a specific difference between a given split pair).
- These are not accounting for multiple tests but should be okay since exploratory and chi2 is conservative as is.
- Usage:
- These should be used in conjunction with the pairwise tests, plots, and other factors to gauge whether things are lining up.
Non-Pairwise tests: Continuous Features
Here we make use of “one-way ANOVA” to test the null hypothesis that our data splits have the same population mean.
perform_continuous_feature_test
perform_continuous_feature_test(
data_processed,
list(config["data_types"]["features"]["continuous"].keys()),
split_column='split'
)
This gives output like the following:
feature f_statistic p_value significant
0 age 1888.160614 0.000000e+00 True
1 income 3625.717099 0.000000e+00 True
2 height 509.976783 2.874950e-153 True
Pairwise tests: Discrete Features
Likewise we can use the chi squared test for discrete features. Again this is operating at an aggregated level.
perform_categorical_feature_test
perform_categorical_feature_test(
data_processed,
list(config["data_types"]["features"]["discrete"].keys()),
split_column='split'
)
Which gives:
feature chi2_statistic p_value significant
0 gender 25.201433 4.582897e-05 True
1 occupation 74.951330 5.044603e-13 True
2 owns_car 90.128987 2.683733e-20 True
3 registration_date 2000.000000 2.344813e-24 True
Custom Visualisations for Significant Features
In a prior section we already performed visualisations so you might be wondering what the point of this section is. The way I see it is the initial visualisations were more generic with a purpose of getting a feel for the features by inspection. However now that we have explored and have a grip on which features or significant vs not we can generate more custom plots to confirm these for those we deem as significant.
Plotting Distributions of “Statistically Significant” Features
- Purpose:
- Plotting those significant feature distributions to help grasp their practical significance.
Continuous Features
- Density Plots:
- Density plots can help assess practical significance by visualizing the distributions across the various pairs of limbs.
- If the distributions overlap substantially, even if they are statistically significant, the practical difference may be negligible. Conversely, if the distributions are well separated, it may indicate a practical difference even if the statistical difference is marginal due to small sample size.
- A red color has been added to regions in the plot which signifies no overlap between the two distributions. This can help visually flesh out the amount of difference between them.
Discrete Features
- Bar Charts:
- Grouped bar charts are perhaps the best visualizations to use for discrete data and the recommended go-to.
- The bars for each group are positioned side-by-side, allowing for easy comparison of the values between the two groups across the different categories.
- Designed for when you want to compare values between two or more groups across multiple categories.
- Stacked bar charts can be challenging to interpret precisely, especially for the lower segments. They are best used when the main goal is to show the overall trend and composition, not to make exact comparisons between subcategories.
- Grouped bar charts are perhaps the best visualizations to use for discrete data and the recommended go-to.
plot_significant_features
plot_significant_features(
data_processed,
hypothesis_testing_results,
significance_col="significant_bonferroni"
)
The output plots look something like below where both continuous and discrete features are displayed within the same figure. The red portion in the continuous plots represents the area in-between the curves, this helps highlight the magnitude of the differences giving us a better immediate sense of “how” different the features are.
Plotting Effect Sizes for Continuous “Statistically Significant” Features
- A single value for effect size might not summarize things well, especially for complex distributions, and thus it’s useful to view plots where appropriate. The following are key considerations:
Observations
- Edge Cases:
- Edge cases are possible where the median value is 0 for features, which can cause the pooled estimates to be 0, leading to zero-division errors.
- This, in general, can occur if at least 50% of your data equals the 2nd or 0.5 quantile.
- For now, these have been avoided, and the corresponding plots have been left blank, but more techniques can handle this which haven’t been implemented yet.
- Interpretation Notes:
- The order in which the two groups were compared matters, and if they were reversed, the graph would be reflected in the x-axis.
- The magnitude of the effect size is important as it can tell you for a given quantile of the distributions what the scale of difference was.
- If values fluctuate close to zero across all quantiles, it suggests the actual differences are likely very small and practically insignificant, meaning differences flagged as statistically significant likely arose due to large sample sizes.
- With a large enough sample, even trivial differences can become statistically significant.
- On the contrary, very large effect size differences across quantiles could indicate practical differences.
plot_significant_effect_sizes
plot_significant_effect_sizes(
hypothesis_testing_results,
significance_col="significant_bonferroni"
)
The results look something like:
Making Conclusions
- Summary of Findings:
- Taking into account our various datapoints (summary statistics, distribution plots, statistical tests, effect size measures & other metrics), it seems like the distributions of features amongst the splits for our data are showing signs of significant difference.
- Large differences are visible and most if not all tests show a significant result, the magnitude of the difference across the various tests seems large enough to suggest a true practical significant difference.
- Taking into account our various datapoints (summary statistics, distribution plots, statistical tests, effect size measures & other metrics), it seems like the distributions of features amongst the splits for our data are showing signs of significant difference.
- Ranking of Features:
- Below is a ranking of the features based on which seem to show the most difference.
- The lower the rank number (e.g., rank = 1) means more difference, and the higher the rank (e.g., rank = 10) means less difference.
- This is done by aggregating the testing results/measures each feature shows across the various pairwise limb comparisons and then using these to create a ranking based on which shows the most difference.
- In case you want to stratify or investigate certain features in more detail, this can provide some guidance.
- Below is a ranking of the features based on which seem to show the most difference.
rank_features
ranked_feature_differences = utils.rank_features(
hypothesis_testing_results,
significance_column="significant_bonferroni",
continuous_test="kolmogorov_smirnov",
)
ranked_feature_differences
This gives a view like:
feature feature_rank
0 income 1
1 age 2
2 height 3
3 registration_date 3
4 owns_car 4
5 occupation 5
6 gender 6