CloudSEN12 - a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2
-
Luis, Cesar
1
-
Ysuhuaylas, Luis
2
-
Jhomira Loja
2
-
Gonzales, Karen
2
-
Herrera, Fernando
2
-
Bautista, Lesly
2
-
Yali, Roy
3
-
Flores, Angie
2
-
Diaz, Lissette
2
-
Cuenca, Nicole
2
-
Espinoza, Wendy
2
-
Prudencio, Fernando
4
-
Inga, Joselyn
2
-
Llactayo, Valeria
2
-
Montero, David
5
-
Sudmanns, Martin
6
-
Tiede, Dirk
6
-
Mateo-García, Gonzalo
1
-
Gómez-Chova, Luis
1
-
1
Universitat de València
info
-
2
Universidad Nacional Mayor de San Marcos
info
-
3
Pontificia Universidad Católica del Perú
info
- 4 Instituto Geofísico del Perú
-
5
University of Leipzig
info
-
6
University of Salzburg
info
Editor: Science Data Bank
Year of publication: 2022
Type: Dataset
Abstract
CloudSEN12 is a large dataset for cloud semantic understanding that consists of 9880 regions of interest (ROIs). Each ROI has five 5090x5090 meters image patches (IPs) collected on different dates; we manually choose the images to guarantee that each IP inside an ROI matches one of the following cloud cover groups:- clear (0%)- low-cloudy (1% - 25%) - almost clear (25% - 45%)- mid-cloudy (45% - 65%)- cloudy (65% >)An IP is the core unit in CloudSEN12. Each IP contains data from Sentinel-2 optical levels 1C and 2A, Sentinel-1 Synthetic Aperture Radar (SAR), digital elevation model, surface water occurrence, land cover classes, and cloud mask results from eight cutting-edge cloud detection algorithms. Besides, in order to support standard, weakly, and self-/semi-supervised learning procedures, cloudSEN12 includes three distinct forms of hand-crafted labelling data: high-quality, scribble, and no annotation. Consequently, each ROI is randomly assigned to a different annotation group:2000 ROIs with pixel-level annotation, where the average annotation time is 150 minutes (high-quality group).2000 ROIs with scribble-level annotation, where the annotation time is 15 minutes (scribble group).5880 ROIs with annotation only in the cloud-free (0\%) image (no annotation group).For high-quality labels, we use the Intelligence foR Image Segmentation\cite{iris2019} (IRIS) active learning technology, combining human photo-interpretation and machine learning. For scribble, ground truth pixels were drawn using IRIS but without ML support. Finally, the no-annotation dataset is generated automatically, with manual annotation only in the clear image patch. A backup of the dataset in STAC format is available here: https://shorturl.at/cgjtz. Check out our website https://cloudsen12.github.io/ for examples.