CloudSEN12 - a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2

  1. Luis, Cesar 1
  2. Ysuhuaylas, Luis 2
  3. Jhomira Loja 2
  4. Gonzales, Karen 2
  5. Herrera, Fernando 2
  6. Bautista, Lesly 2
  7. Yali, Roy 3
  8. Flores, Angie 2
  9. Diaz, Lissette 2
  10. Cuenca, Nicole 2
  11. Espinoza, Wendy 2
  12. Prudencio, Fernando 4
  13. Inga, Joselyn 2
  14. Llactayo, Valeria 2
  15. Montero, David 5
  16. Sudmanns, Martin 6
  17. Tiede, Dirk 6
  18. Mateo-García, Gonzalo 1
  19. Gómez-Chova, Luis 1
  1. 1 Universitat de València
    info

    Universitat de València

    Valencia, España

    ROR https://ror.org/043nxc105

  2. 2 Universidad Nacional Mayor de San Marcos
    info

    Universidad Nacional Mayor de San Marcos

    Lima, Perú

    ROR https://ror.org/006vs7897

  3. 3 Pontificia Universidad Católica del Perú
    info

    Pontificia Universidad Católica del Perú

    Lima, Perú

    ROR https://ror.org/00013q465

  4. 4 Instituto Geofísico del Perú
  5. 5 University of Leipzig
    info

    University of Leipzig

    Leipzig, Alemania

    ROR https://ror.org/03s7gtk40

  6. 6 University of Salzburg
    info

    University of Salzburg

    Salzburgo, Austria

    ROR https://ror.org/05gs8cd61

Editor: Science Data Bank

Year of publication: 2022

Type: Dataset

CC BY-NC 4.0

Abstract

CloudSEN12 is a large dataset for cloud semantic understanding that consists of 9880 regions of interest (ROIs). Each ROI has five 5090x5090 meters image patches (IPs) collected on different dates; we manually choose the images to guarantee that each IP inside an ROI matches one of the following cloud cover groups:- clear (0%)- low-cloudy (1% - 25%) - almost clear (25% - 45%)- mid-cloudy (45% - 65%)- cloudy (65% >)An IP is the core unit in CloudSEN12. Each IP contains data from Sentinel-2 optical levels 1C and 2A, Sentinel-1 Synthetic Aperture Radar (SAR), digital elevation model, surface water occurrence, land cover classes, and cloud mask results from eight cutting-edge cloud detection algorithms. Besides, in order to support standard, weakly, and self-/semi-supervised learning procedures, cloudSEN12 includes three distinct forms of hand-crafted labelling data: high-quality, scribble, and no annotation. Consequently, each ROI is randomly assigned to a different annotation group:2000 ROIs with pixel-level annotation, where the average annotation time is 150 minutes (high-quality group).2000 ROIs with scribble-level annotation, where the annotation time is 15 minutes (scribble group).5880 ROIs with annotation only in the cloud-free (0\%) image (no annotation group).For high-quality labels, we use the Intelligence foR Image Segmentation\cite{iris2019} (IRIS) active learning technology, combining human photo-interpretation and machine learning. For scribble, ground truth pixels were drawn using IRIS but without ML support. Finally, the no-annotation dataset is generated automatically, with manual annotation only in the clear image patch. A backup of the dataset in STAC format is available here: https://shorturl.at/cgjtz. Check out our website https://cloudsen12.github.io/ for examples.