Datasets and Scenarios (Updating)
We support 3 types of scenarios with various datasets and move the common dataset splitting code into ./dataset/utils
for easy extension. If you need another dataset, just write another code to download it and then use the utils.
Label Skew Scenario
For the label skew scenario, we introduce 16 famous datasets:
- MNIST (see examples here)
- EMNIST
- FEMNIST
- Fashion-MNIST
- Cifar10
- Cifar100
- AG News
- Sogou News
- Tiny-ImageNet
- Country211
- Flowers102
- GTSRB
- Shakespeare
- Stanford Cars
- COVIDx (chest X-ray images for covid-19)
- kvasir (endoscopic images for gastrointestinal disease detection)
The datasets can be easily split into IID and non-IID versions. In the non-IID scenario, we distinguish between two types of distribution:
- Pathological non-IID: In this case, each client only holds a subset of the labels, for example, just 2 out of 10 labels from the MNIST dataset, even though the overall dataset contains all 10 labels. This leads to a highly skewed distribution of data across clients.
- Practical non-IID: Here, we model the data distribution using a Dirichlet distribution, which results in a more realistic and less extreme imbalance. For more details on this, refer to this paper.
Additionally, we offer a balance
option, where data amount is evenly distributed across all clients.
Feature Shift Scenario
For the feature shift scenario, we utilize 3 widely used datasets in Domain Adaptation:
- Amazon Review (raw data can be fetched from this link, see examples here)
- Digit5 (raw data available this link).
- DomainNet
Real-World Scenario
For the real-world scenario, we introduce 5 naturally separated datasets:
- Camelyon17 (tumor tissue patches extracted from breast cancer metastases in lymph node sections, 5 hospitals, 2 labels)
- iWildCam (194 camera traps, 158 labels)
- Omniglot (20 clients, 50 labels)
- HAR (Human Activity Recognition) (30 clients, 6 labels, see examples here)
- PAMAP2 (9 clients, 12 labels)
For more details on datasets and FL algorithms in IoT, please refer to FL-IoT.
Examples for MNIST in the label skew scenario
# In ./dataset # python generate_MNIST.py iid - - # for iid and unbalanced scenario # python generate_MNIST.py iid balance - # for iid and balanced scenario # python generate_MNIST.py noniid - pat # for pathological noniid and unbalanced scenario python generate_MNIST.py noniid - dir # for practical noniid and unbalanced scenario # python generate_MNIST.py noniid - exdir # for Extended Dirichlet strategy
The command line output of running python generate_MNIST.py noniid - dir
Number of classes: 10 Client 0 Size of data: 2630 Labels: [0 1 4 5 7 8 9] Samples of labels: [(0, 140), (1, 890), (4, 1), (5, 319), (7, 29), (8, 1067), (9, 184)] -------------------------------------------------- Client 1 Size of data: 499 Labels: [0 2 5 6 8 9] Samples of labels: [(0, 5), (2, 27), (5, 19), (6, 335), (8, 6), (9, 107)] -------------------------------------------------- Client 2 Size of data: 1630 Labels: [0 3 6 9] Samples of labels: [(0, 3), (3, 143), (6, 1461), (9, 23)] -------------------------------------------------- Client 3 Size of data: 2541 Labels: [0 4 7 8] Samples of labels: [(0, 155), (4, 1), (7, 2381), (8, 4)] -------------------------------------------------- Client 4 Size of data: 1917 Labels: [0 1 3 5 6 8 9] Samples of labels: [(0, 71), (1, 13), (3, 207), (5, 1129), (6, 6), (8, 40), (9, 451)] -------------------------------------------------- Client 5 Size of data: 6189 Labels: [1 3 4 8 9] Samples of labels: [(1, 38), (3, 1), (4, 39), (8, 25), (9, 6086)] -------------------------------------------------- Client 6 Size of data: 1256 Labels: [1 2 3 6 8 9] Samples of labels: [(1, 873), (2, 176), (3, 46), (6, 42), (8, 13), (9, 106)] -------------------------------------------------- Client 7 Size of data: 1269 Labels: [1 2 3 5 7 8] Samples of labels: [(1, 21), (2, 5), (3, 11), (5, 787), (7, 4), (8, 441)] -------------------------------------------------- Client 8 Size of data: 3600 Labels: [0 1] Samples of labels: [(0, 1), (1, 3599)] -------------------------------------------------- Client 9 Size of data: 4006 Labels: [0 1 2 4 6] Samples of labels: [(0, 633), (1, 1997), (2, 89), (4, 519), (6, 768)] -------------------------------------------------- Client 10 Size of data: 3116 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 920), (1, 2), (2, 1450), (3, 513), (4, 134), (5, 97)] -------------------------------------------------- Client 11 Size of data: 3772 Labels: [2 3 5] Samples of labels: [(2, 159), (3, 3055), (5, 558)] -------------------------------------------------- Client 12 Size of data: 3613 Labels: [0 1 2 5] Samples of labels: [(0, 8), (1, 180), (2, 3277), (5, 148)] -------------------------------------------------- Client 13 Size of data: 2134 Labels: [1 2 4 5 7] Samples of labels: [(1, 237), (2, 343), (4, 6), (5, 453), (7, 1095)] -------------------------------------------------- Client 14 Size of data: 5730 Labels: [5 7] Samples of labels: [(5, 2719), (7, 3011)] -------------------------------------------------- Client 15 Size of data: 5448 Labels: [0 3 5 6 7 8] Samples of labels: [(0, 31), (3, 1785), (5, 16), (6, 4), (7, 756), (8, 2856)] -------------------------------------------------- Client 16 Size of data: 3628 Labels: [0] Samples of labels: [(0, 3628)] -------------------------------------------------- Client 17 Size of data: 5653 Labels: [1 2 3 4 5 7 8] Samples of labels: [(1, 26), (2, 1463), (3, 1379), (4, 335), (5, 60), (7, 17), (8, 2373)] -------------------------------------------------- Client 18 Size of data: 5266 Labels: [0 5 6] Samples of labels: [(0, 998), (5, 8), (6, 4260)] -------------------------------------------------- Client 19 Size of data: 6103 Labels: [0 1 2 3 4 9] Samples of labels: [(0, 310), (1, 1), (2, 1), (3, 1), (4, 5789), (9, 1)] -------------------------------------------------- Total number of samples: 70000 The number of train samples: [1972, 374, 1222, 1905, 1437, 4641, 942, 951, 2700, 3004, 2337, 2829, 2709, 1600, 4297, 4086, 2721, 4239, 3949, 4577] The number of test samples: [658, 125, 408, 636, 480, 1548, 314, 318, 900, 1002, 779, 943, 904, 534, 1433, 1362, 907, 1414, 1317, 1526] Saving to disk. Finish generating dataset.
Examples for Amazon Review in the feature shift scenario
# In ./dataset generate_AmazonReview.py
The command line output of running generate_AmazonReview.py
Number of labels: [2, 2, 2, 2] Number of clients: 4 Client 0 Size of data: 6465 Labels: [0 1] Samples of labels: [(0, 3201), (1, 3264)] -------------------------------------------------- Client 1 Size of data: 5586 Labels: [0 1] Samples of labels: [(0, 2779), (1, 2807)] -------------------------------------------------- Client 2 Size of data: 7681 Labels: [0 1] Samples of labels: [(0, 3824), (1, 3857)] -------------------------------------------------- Client 3 Size of data: 7945 Labels: [0 1] Samples of labels: [(0, 3991), (1, 3954)] -------------------------------------------------- Total number of samples: 27677 The number of train samples: [4848, 4189, 5760, 5958] The number of test samples: [1617, 1397, 1921, 1987] Saving to disk. Finish generating dataset.
Examples for HAR in the real-world scenario
# In ./dataset python generate_HAR.py
The command line output of running python generate_HAR.py
Client 0 Size of data: 347 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 95), (1, 53), (2, 49), (3, 47), (4, 53), (5, 50)] -------------------------------------------------- Client 1 Size of data: 302 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 59), (1, 48), (2, 47), (3, 46), (4, 54), (5, 48)] -------------------------------------------------- Client 2 Size of data: 341 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 58), (1, 59), (2, 49), (3, 52), (4, 61), (5, 62)] -------------------------------------------------- Client 3 Size of data: 317 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 60), (1, 52), (2, 45), (3, 50), (4, 56), (5, 54)] -------------------------------------------------- Client 4 Size of data: 302 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 56), (1, 47), (2, 47), (3, 44), (4, 56), (5, 52)] -------------------------------------------------- Client 5 Size of data: 325 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 57), (1, 51), (2, 48), (3, 55), (4, 57), (5, 57)] -------------------------------------------------- Client 6 Size of data: 308 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 57), (1, 51), (2, 47), (3, 48), (4, 53), (5, 52)] -------------------------------------------------- Client 7 Size of data: 281 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 48), (1, 41), (2, 38), (3, 46), (4, 54), (5, 54)] -------------------------------------------------- Client 8 Size of data: 288 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 52), (1, 49), (2, 42), (3, 50), (4, 45), (5, 50)] -------------------------------------------------- Client 9 Size of data: 294 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 53), (1, 47), (2, 38), (3, 54), (4, 44), (5, 58)] -------------------------------------------------- Client 10 Size of data: 316 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 59), (1, 54), (2, 46), (3, 53), (4, 47), (5, 57)] -------------------------------------------------- Client 11 Size of data: 320 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 50), (1, 52), (2, 46), (3, 51), (4, 61), (5, 60)] -------------------------------------------------- Client 12 Size of data: 327 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 57), (1, 55), (2, 47), (3, 49), (4, 57), (5, 62)] -------------------------------------------------- Client 13 Size of data: 323 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 59), (1, 54), (2, 45), (3, 54), (4, 60), (5, 51)] -------------------------------------------------- Client 14 Size of data: 328 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 54), (1, 48), (2, 42), (3, 59), (4, 53), (5, 72)] -------------------------------------------------- Client 15 Size of data: 366 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 51), (1, 51), (2, 47), (3, 69), (4, 78), (5, 70)] -------------------------------------------------- Client 16 Size of data: 368 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 61), (1, 48), (2, 46), (3, 64), (4, 78), (5, 71)] -------------------------------------------------- Client 17 Size of data: 364 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 56), (1, 58), (2, 55), (3, 57), (4, 73), (5, 65)] -------------------------------------------------- Client 18 Size of data: 360 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 52), (1, 40), (2, 39), (3, 73), (4, 73), (5, 83)] -------------------------------------------------- Client 19 Size of data: 354 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 51), (1, 51), (2, 45), (3, 66), (4, 73), (5, 68)] -------------------------------------------------- Client 20 Size of data: 408 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 52), (1, 47), (2, 45), (3, 85), (4, 89), (5, 90)] -------------------------------------------------- Client 21 Size of data: 321 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 46), (1, 42), (2, 36), (3, 62), (4, 63), (5, 72)] -------------------------------------------------- Client 22 Size of data: 372 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 59), (1, 51), (2, 54), (3, 68), (4, 68), (5, 72)] -------------------------------------------------- Client 23 Size of data: 381 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 58), (1, 59), (2, 55), (3, 68), (4, 69), (5, 72)] -------------------------------------------------- Client 24 Size of data: 409 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 74), (1, 65), (2, 58), (3, 65), (4, 74), (5, 73)] -------------------------------------------------- Client 25 Size of data: 392 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 59), (1, 55), (2, 50), (3, 78), (4, 74), (5, 76)] -------------------------------------------------- Client 26 Size of data: 376 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 57), (1, 51), (2, 44), (3, 70), (4, 80), (5, 74)] -------------------------------------------------- Client 27 Size of data: 382 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 54), (1, 51), (2, 46), (3, 72), (4, 79), (5, 80)] -------------------------------------------------- Client 28 Size of data: 344 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 53), (1, 49), (2, 48), (3, 60), (4, 65), (5, 69)] -------------------------------------------------- Client 29 Size of data: 383 Labels: [0 1 2 3 4 5] Samples of labels: [(0, 65), (1, 65), (2, 62), (3, 62), (4, 59), (5, 70)] -------------------------------------------------- Total number of samples: 10299 The number of train samples: [260, 226, 255, 237, 226, 243, 231, 210, 216, 220, 237, 240, 245, 242, 246, 274, 276, 273, 270, 265, 306, 240, 279, 285, 306, 294, 282, 286, 258, 287] The number of test samples: [87, 76, 86, 80, 76, 82, 77, 71, 72, 74, 79, 80, 82, 81, 82, 92, 92, 91, 90, 89, 102, 81, 93, 96, 103, 98, 94, 96, 86, 96] Saving to disk. Finish generating dataset.