Datasets and Scenarios (Updating)

We support 3 types of scenarios with various datasets and move the common dataset splitting code into ./dataset/utils for easy extension. If you need another dataset, just write another code to download it and then use the utils.

Label Skew Scenario

For the label skew scenario, we introduce 16 famous datasets:

  • MNIST (see examples here)
  • EMNIST
  • FEMNIST
  • Fashion-MNIST
  • Cifar10
  • Cifar100
  • AG News
  • Sogou News
  • Tiny-ImageNet
  • Country211
  • Flowers102
  • GTSRB
  • Shakespeare
  • Stanford Cars
  • COVIDx (chest X-ray images for covid-19)
  • kvasir (endoscopic images for gastrointestinal disease detection)

The datasets can be easily split into IID and non-IID versions. In the non-IID scenario, we distinguish between two types of distribution:

  1. Pathological non-IID: In this case, each client only holds a subset of the labels, for example, just 2 out of 10 labels from the MNIST dataset, even though the overall dataset contains all 10 labels. This leads to a highly skewed distribution of data across clients.
  2. Practical non-IID: Here, we model the data distribution using a Dirichlet distribution, which results in a more realistic and less extreme imbalance. For more details on this, refer to this paper.

Additionally, we offer a balance option, where data amount is evenly distributed across all clients.

Feature Shift Scenario

For the feature shift scenario, we utilize 3 widely used datasets in Domain Adaptation:

  • Amazon Review (raw data can be fetched from this link, see examples here)
  • Digit5 (raw data available this link).
  • DomainNet

Real-World Scenario

For the real-world scenario, we introduce 5 naturally separated datasets:

  • Camelyon17 (tumor tissue patches extracted from breast cancer metastases in lymph node sections, 5 hospitals, 2 labels)
  • iWildCam (194 camera traps, 158 labels)
  • Omniglot (20 clients, 50 labels)
  • HAR (Human Activity Recognition) (30 clients, 6 labels, see examples here)
  • PAMAP2 (9 clients, 12 labels)

For more details on datasets and FL algorithms in IoT, please refer to FL-IoT.

Examples for MNIST in the label skew scenario

# In ./dataset
# python generate_MNIST.py iid - - # for iid and unbalanced scenario
# python generate_MNIST.py iid balance - # for iid and balanced scenario
# python generate_MNIST.py noniid - pat # for pathological noniid and unbalanced scenario
python generate_MNIST.py noniid - dir # for practical noniid and unbalanced scenario
# python generate_MNIST.py noniid - exdir # for Extended Dirichlet strategy

The command line output of running python generate_MNIST.py noniid - dir

Number of classes: 10
Client 0         Size of data: 2630      Labels:  [0 1 4 5 7 8 9]
                Samples of labels:  [(0, 140), (1, 890), (4, 1), (5, 319), (7, 29), (8, 1067), (9, 184)]
--------------------------------------------------
Client 1         Size of data: 499       Labels:  [0 2 5 6 8 9]
                Samples of labels:  [(0, 5), (2, 27), (5, 19), (6, 335), (8, 6), (9, 107)]
--------------------------------------------------
Client 2         Size of data: 1630      Labels:  [0 3 6 9]
                Samples of labels:  [(0, 3), (3, 143), (6, 1461), (9, 23)]
--------------------------------------------------
Client 3         Size of data: 2541      Labels:  [0 4 7 8]
                Samples of labels:  [(0, 155), (4, 1), (7, 2381), (8, 4)]
--------------------------------------------------
Client 4         Size of data: 1917      Labels:  [0 1 3 5 6 8 9]
                Samples of labels:  [(0, 71), (1, 13), (3, 207), (5, 1129), (6, 6), (8, 40), (9, 451)]
--------------------------------------------------
Client 5         Size of data: 6189      Labels:  [1 3 4 8 9]
                Samples of labels:  [(1, 38), (3, 1), (4, 39), (8, 25), (9, 6086)]
--------------------------------------------------
Client 6         Size of data: 1256      Labels:  [1 2 3 6 8 9]
                Samples of labels:  [(1, 873), (2, 176), (3, 46), (6, 42), (8, 13), (9, 106)]
--------------------------------------------------
Client 7         Size of data: 1269      Labels:  [1 2 3 5 7 8]
                Samples of labels:  [(1, 21), (2, 5), (3, 11), (5, 787), (7, 4), (8, 441)]
--------------------------------------------------
Client 8         Size of data: 3600      Labels:  [0 1]
                Samples of labels:  [(0, 1), (1, 3599)]
--------------------------------------------------
Client 9         Size of data: 4006      Labels:  [0 1 2 4 6]
                Samples of labels:  [(0, 633), (1, 1997), (2, 89), (4, 519), (6, 768)]
--------------------------------------------------
Client 10        Size of data: 3116      Labels:  [0 1 2 3 4 5]
                Samples of labels:  [(0, 920), (1, 2), (2, 1450), (3, 513), (4, 134), (5, 97)]
--------------------------------------------------
Client 11        Size of data: 3772      Labels:  [2 3 5]
                Samples of labels:  [(2, 159), (3, 3055), (5, 558)]
--------------------------------------------------
Client 12        Size of data: 3613      Labels:  [0 1 2 5]
                Samples of labels:  [(0, 8), (1, 180), (2, 3277), (5, 148)]
--------------------------------------------------
Client 13        Size of data: 2134      Labels:  [1 2 4 5 7]
                Samples of labels:  [(1, 237), (2, 343), (4, 6), (5, 453), (7, 1095)]
--------------------------------------------------
Client 14        Size of data: 5730      Labels:  [5 7]
                Samples of labels:  [(5, 2719), (7, 3011)]
--------------------------------------------------
Client 15        Size of data: 5448      Labels:  [0 3 5 6 7 8]
                Samples of labels:  [(0, 31), (3, 1785), (5, 16), (6, 4), (7, 756), (8, 2856)]
--------------------------------------------------
Client 16        Size of data: 3628      Labels:  [0]
                Samples of labels:  [(0, 3628)]
--------------------------------------------------
Client 17        Size of data: 5653      Labels:  [1 2 3 4 5 7 8]
                Samples of labels:  [(1, 26), (2, 1463), (3, 1379), (4, 335), (5, 60), (7, 17), (8, 2373)]
--------------------------------------------------
Client 18        Size of data: 5266      Labels:  [0 5 6]
                Samples of labels:  [(0, 998), (5, 8), (6, 4260)]
--------------------------------------------------
Client 19        Size of data: 6103      Labels:  [0 1 2 3 4 9]
                Samples of labels:  [(0, 310), (1, 1), (2, 1), (3, 1), (4, 5789), (9, 1)]
--------------------------------------------------
Total number of samples: 70000
The number of train samples: [1972, 374, 1222, 1905, 1437, 4641, 942, 951, 2700, 3004, 2337, 2829, 2709, 1600, 4297, 4086, 2721, 4239, 3949, 4577]
The number of test samples: [658, 125, 408, 636, 480, 1548, 314, 318, 900, 1002, 779, 943, 904, 534, 1433, 1362, 907, 1414, 1317, 1526]

Saving to disk.

Finish generating dataset.

Examples for Amazon Review in the feature shift scenario

# In ./dataset
generate_AmazonReview.py

The command line output of running generate_AmazonReview.py

Number of labels: [2, 2, 2, 2]
Number of clients: 4
Client 0         Size of data: 6465      Labels:  [0 1]
                    Samples of labels:  [(0, 3201), (1, 3264)]
--------------------------------------------------
Client 1         Size of data: 5586      Labels:  [0 1]
                    Samples of labels:  [(0, 2779), (1, 2807)]
--------------------------------------------------
Client 2         Size of data: 7681      Labels:  [0 1]
                    Samples of labels:  [(0, 3824), (1, 3857)]
--------------------------------------------------
Client 3         Size of data: 7945      Labels:  [0 1]
                    Samples of labels:  [(0, 3991), (1, 3954)]
--------------------------------------------------
Total number of samples: 27677
The number of train samples: [4848, 4189, 5760, 5958]
The number of test samples: [1617, 1397, 1921, 1987]

Saving to disk.

Finish generating dataset.

Examples for HAR in the real-world scenario

# In ./dataset
python generate_HAR.py

The command line output of running python generate_HAR.py

Client 0         Size of data: 347       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 95), (1, 53), (2, 49), (3, 47), (4, 53), (5, 50)]
--------------------------------------------------
Client 1         Size of data: 302       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 59), (1, 48), (2, 47), (3, 46), (4, 54), (5, 48)]
--------------------------------------------------
Client 2         Size of data: 341       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 58), (1, 59), (2, 49), (3, 52), (4, 61), (5, 62)]
--------------------------------------------------
Client 3         Size of data: 317       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 60), (1, 52), (2, 45), (3, 50), (4, 56), (5, 54)]
--------------------------------------------------
Client 4         Size of data: 302       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 56), (1, 47), (2, 47), (3, 44), (4, 56), (5, 52)]
--------------------------------------------------
Client 5         Size of data: 325       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 57), (1, 51), (2, 48), (3, 55), (4, 57), (5, 57)]
--------------------------------------------------
Client 6         Size of data: 308       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 57), (1, 51), (2, 47), (3, 48), (4, 53), (5, 52)]
--------------------------------------------------
Client 7         Size of data: 281       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 48), (1, 41), (2, 38), (3, 46), (4, 54), (5, 54)]
--------------------------------------------------
Client 8         Size of data: 288       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 52), (1, 49), (2, 42), (3, 50), (4, 45), (5, 50)]
--------------------------------------------------
Client 9         Size of data: 294       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 53), (1, 47), (2, 38), (3, 54), (4, 44), (5, 58)]
--------------------------------------------------
Client 10        Size of data: 316       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 59), (1, 54), (2, 46), (3, 53), (4, 47), (5, 57)]
--------------------------------------------------
Client 11        Size of data: 320       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 50), (1, 52), (2, 46), (3, 51), (4, 61), (5, 60)]
--------------------------------------------------
Client 12        Size of data: 327       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 57), (1, 55), (2, 47), (3, 49), (4, 57), (5, 62)]
--------------------------------------------------
Client 13        Size of data: 323       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 59), (1, 54), (2, 45), (3, 54), (4, 60), (5, 51)]
--------------------------------------------------
Client 14        Size of data: 328       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 54), (1, 48), (2, 42), (3, 59), (4, 53), (5, 72)]
--------------------------------------------------
Client 15        Size of data: 366       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 51), (1, 51), (2, 47), (3, 69), (4, 78), (5, 70)]
--------------------------------------------------
Client 16        Size of data: 368       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 61), (1, 48), (2, 46), (3, 64), (4, 78), (5, 71)]
--------------------------------------------------
Client 17        Size of data: 364       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 56), (1, 58), (2, 55), (3, 57), (4, 73), (5, 65)]
--------------------------------------------------
Client 18        Size of data: 360       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 52), (1, 40), (2, 39), (3, 73), (4, 73), (5, 83)]
--------------------------------------------------
Client 19        Size of data: 354       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 51), (1, 51), (2, 45), (3, 66), (4, 73), (5, 68)]
--------------------------------------------------
Client 20        Size of data: 408       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 52), (1, 47), (2, 45), (3, 85), (4, 89), (5, 90)]
--------------------------------------------------
Client 21        Size of data: 321       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 46), (1, 42), (2, 36), (3, 62), (4, 63), (5, 72)]
--------------------------------------------------
Client 22        Size of data: 372       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 59), (1, 51), (2, 54), (3, 68), (4, 68), (5, 72)]
--------------------------------------------------
Client 23        Size of data: 381       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 58), (1, 59), (2, 55), (3, 68), (4, 69), (5, 72)]
--------------------------------------------------
Client 24        Size of data: 409       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 74), (1, 65), (2, 58), (3, 65), (4, 74), (5, 73)]
--------------------------------------------------
Client 25        Size of data: 392       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 59), (1, 55), (2, 50), (3, 78), (4, 74), (5, 76)]
--------------------------------------------------
Client 26        Size of data: 376       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 57), (1, 51), (2, 44), (3, 70), (4, 80), (5, 74)]
--------------------------------------------------
Client 27        Size of data: 382       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 54), (1, 51), (2, 46), (3, 72), (4, 79), (5, 80)]
--------------------------------------------------
Client 28        Size of data: 344       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 53), (1, 49), (2, 48), (3, 60), (4, 65), (5, 69)]
--------------------------------------------------
Client 29        Size of data: 383       Labels:  [0 1 2 3 4 5]
    Samples of labels:  [(0, 65), (1, 65), (2, 62), (3, 62), (4, 59), (5, 70)]
--------------------------------------------------
Total number of samples: 10299
The number of train samples: [260, 226, 255, 237, 226, 243, 231, 210, 216, 220, 237, 240, 245, 242, 246, 274, 276, 273, 270, 265, 306, 240, 279, 285, 306, 294, 282, 286, 258, 287]
The number of test samples: [87, 76, 86, 80, 76, 82, 77, 71, 72, 74, 79, 80, 82, 81, 82, 92, 92, 91, 90, 89, 102, 81, 93, 96, 103, 98, 94, 96, 86, 96]

Saving to disk.

Finish generating dataset.