Home | LIRIS's datasets library

Perfect Pet: synthetic relational database polluted by artificial unicity - version1

Version

version1

Licence

Our current research project focuses on the cleaning for analytical use of relational databases implemented with surrogate keys and surrogate foreign keys but without natural keys. Despite their numerous benefits, surrogate keys can not only induce the presence of data quality issues within a database but also act as a major obstacle to any regular data cleaning technique, due to the artificial unicity they carry and propagate. We developed RED2Hunt, a framework dedicated to cleaning such databases, described in https://arxiv.org/abs/2503.20593.

Because of their private nature, none of the operational databases our team members worked on could be made available to the research/academic community. Thus, we decided to generate Perfect Pet, a synthetic relational database 1) to be used to facilitate the diffusion of our work on artificial unicity (the phenomenon commonly found in operational databases which resolution motivated this research project), testing and demonstrating RED2Hunt and enable the reproductibility of our experiments and results, and make it available to the community for their own use.

For these purposes, the database had to satisfy the following requirements:

Ressemble real operational data: mimic data that would be collected in a real operational settings by their nature, structure, distributions.
Have a very simple, straightforward schema to be easily understandable by any audience.
Include the following data quality issues as observed in real operational databases:
Comprise high levels of artificial unicity (above 60% in each relation suffering from the phenomenon) due to:
- Exclusive reliance on surrogate keys and no encoding of any natural key,
- Schema denormalization,
- Non-respect of observed cardinalities in real life (separation between entities' description and their historization).
Include data inconsistencies within duplicate groups due to:
- denormalisation,
- data entry error,
- data collection practice evolution.
Inlude format inconsistencies, due to a modification of data type after deployment of the data collection system and start of data collection.

Perfect Pet data should suffer from the following data quality issues:

- Redundancy,
- Inconsistencies,
- Inaccuracy,
- Incompleteness.

The database was designed as a relational database supporting the appointment management application of the fictitious veterinary clinic Perfect Pet. It includes information describing the pets visiting the clinic, their owners (the clinic’s clients), the medical appointments, and the doctors working at the clinic.

Although team members never worked on a data research project related to the animal health and welfare sectors, this topic was selected to generate the synthetic data for two reasons: 1) guarantee the anonymity of the original operational databases by preventing a possible connection with the synthetic data, 2) simplify the generation process by leveraging the domain knowledge and access to an operational database in the sector of animal welfare, from one of the team members.

We thank The Jordanian Society for Animal Protection (JSAP) for allowing us to use a part of their operational database as a starting point to generate our synthetic Perfect Pet database, although it does not suffer from any of the data quality issues mentioned above.

Date de publication

27/03/2025

Auteur(s)

Mathilde MARCY, Jean-Marc PETIT

Mot(s) clé

relational database, redundancy, artificial unicity, data quality

Taille totale et nombre de fichier(s)

0.4Mo - 2 files

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

GenVegeFruits3D - version1

Version

version1

Licence

Creative common

This dataset includes 1,000 meshes for each of 100 categories of fruits and vegetables. A sub-sample is currently available, and the full dataset will be released in the coming weeks.

Date de publication

20/01/2025

Auteur(s)

Guillaume Duret, Liming Chen, et al.

Mot(s) clé

3D generations, 3D meshs, Category 6D pose, Datasets and Benchmarks, 3D generations, Fruits and Vegetables

Taille totale et nombre de fichier(s)

~200Go of 10K meshes

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

Deteriorated IMDB PostgreSQL databases - version1

Version

version1

Licence

Creative common

Set of deteriorated versions of the publicly available non-commercial IMDB database, comprising different amounts of duplicates.

The datasets were extracted from PostgreSQL databases including relations titles, name_basics, title_episode, title_ratings, and title_principals from https://datasets.imdbws.com/ IMDB database version downloaded on April 7th, 2024. these databases were deteriorated on purpose to experiment the Red2Hunt method that generates a redundant-free database from any relational operational database comprising surrogate keys and duplicates.

Date de publication

03/06/2024

Auteur(s)

Mathilde MARCY, Jean-Marc PETIT

Mot(s) clé

relational database, redundancy, surrogate keys

Taille totale et nombre de fichier(s)

70Go - 7 fichiers dump

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

synthetic-cave-and-tunnel-systems - version1

Version

version1

Licence

Creative common

A set of computer generated cave system and tunnel system

available at various resolutions,
as watertight triangulations and/or point-clouds,
under the PLY and 3DTiles file formats.

Date de publication

16/04/2024

Auteur(s)

TeaTime and LIRIS (VCity team)

Mot(s) clé

cave, tunnel, point cloud, PLY, 3DTiles

Voir toutes les versions du dataset

synthetic-cave-and-tunnel-systems

Taille totale et nombre de fichier(s)

3GO

URL pour visualiser le dataset

https://github.com/VCityTeam/TT-Ribs

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

MEP-seg dataset : Synthetic images generated from Building Information Modeling (BIM) - v1

Version

Licence

Creative common

This dataset has been generated from 3 constructions models, transferred from Autodesk Revit to NVIDIA Isaac Sim. It contains 8751 samples of RGB images associated with the semantic segmentation masks and label files for 13 classes (rectangular_sheath, circular_sheath, pipe, air_vent, fan_coil, stair, wall, floor, pipe_accessory, framework, radiant_panel, climate_engineering_equipment, ceiling, handrail, roof, cable_tray, pole).

Date de publication

20/12/2023

Auteur(s)

Mathis Baubriaud

Mot(s) clé

Building Information Modeling, Semantic segmentation, Indoor Construction, Mechanical Electrical and Plumbing, Synthetic, Isaac Sim

Taille totale et nombre de fichier(s)

14.8GO

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

PyPI projects backends - version2

Version

version2

Licence

Creative common

Backends declared in the pyproject.toml files

This dataset contains CSV and SQLite files with data about projects backends extracted from Metadata about every file uploaded to PyPI :

extract-pyproject-all-versions.csv, extract-pyproject-all-versions.db : for the projects having a pyproject.toml file and having been uploaded after 2018, get the project_name, max project_version, max uploaded_on, list of distinct project_version, list of distinct uploaded_on, list of distinct path, ...
extract-pyproject-latest.csv, extract-pyproject-latest.db : for each project found in extract-pyproject-all-versions, get the data of the latest uploaded_on date (1)
pyproject_backends.csv, pyproject_backends.db : the build backend found in extract-pyproject-latest.db for each project only in the pyproject.toml file on the root of the project (2)

Source code for the data extraction.

(1) There are several pyproject.toml files for some projects (e.g poetry), often in test folders
(2) The test is quite basic, but there are few projects that have several pyproject.toml file matching this test

PyPI metadata further analysis

After the publication of the first charts, I wanted to know how many projects had no source package, how many projects had no pyproject.toml to complete the first statistics.

This dataset contains CSV and SQLite files extracted from the same source (parquet files from "Metadata about every file uploaded to PyPI"):

extract-project-releases-2018-and-later.csv, extract-project-releases-2018-and-later.db : extract the metadata of the projects uploaded to since 2018 : get the project_name, project_version, project_release, release type (source or wheel), ...

These files weight 1.1 and 1.3 Go respectively.

Source code for the second data extraction.

Date de publication

30/12/2023

Auteur(s)

Françoise CONIL

Mot(s) clé

python, packaging, build-backend, pyproject.toml

Voir toutes les versions du dataset

pypi-projects-backends

Taille totale et nombre de fichier(s)

2,8 Go, 8 files

URL pour visualiser le dataset

https://gitlab.liris.cnrs.fr/fconil-small-programs/packaging/get-pypi-packages-…

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

elaphes-cave - version1

Version

version1

Licence

Creative common

This dataset holds

a 3D lasergrammetric dataset of the so called "Creux des Elaphes" cave as authored by EDYTEM / USMB / CNRS (under the laser point cloud data LAS format)
its conversion to the 3DTiles format

Date de publication

24/11/2023

Auteur(s)

EDYTEM / USMB / CNRS and LIRIS (VCity team)

Mot(s) clé

cave, laser point cloud, LAS, 3dTiles

Taille totale et nombre de fichier(s)

781Mo for the original LAZ file ( 94 465 067 RGB points)

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

AMETHYST Dataset : PDFs of Epoxy/Amine-related Publications and Properties Tables Images - version1

Version

version1

Licence

Creative common

This dataset originates from the AMETHYST project. It comprises a collection of PDFs and images that undergo Machine Learning and NLP processing to extract tables containing information about Epoxy/Amine (EA) compounds and their properties.

Date de publication

20/09/2023

Auteur(s)

Aymar TCHAGOUE, Véronique EGLIN, ,Jean-Marc PETIT, Jannick DUCHET, Sébastien PRUVOST, Jean-Francois GERARD

Mot(s) clé

Epoxy/Amine, Polymère, Chimie, NLP, Machine Learning, OCR, Resine

Taille totale et nombre de fichier(s)

4.3 GB, 4 dossiers, 1612 files

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

FruitBin - version1

Version

version1

Licence

Creative common

FruitBin contains more than 1M images and 40M instance-level 6D pose annotations over both symmetric and asymmetric fruits with or without texture. Rich annotations and metadata (including 6D pose, segmentation mask, point cloud, 2D and 3D bounding boxes, occlusion rate) allow the tuning of the proposed dataset for benchmarking the robustness of object instance segmentation and 6D pose estimation models (with respect to variations in terms of lighting, texture, occlusion, camera pose and scenes). We further propose three scenarios presenting significant challenges of 6D pose estimation models: new scene generalization; new camera viewpoint generalization; and occlusion robustness. We show the results of these three scenarios for two 6D pose estimation baselines making use of RGB or RGBD images. To the best of our knowledge, FruitBin is the first dataset for the challenging task of fruit bin picking and the biggest large-scale dataset for 6D pose estimation with the most comprehensive challenges, tunable over scenes, camera poses and occlusions.

License : CC BY-NC-SA

Date de publication

08/06/2023

Auteur(s)

Guillaume Duret, Mahmoud Ali, Nicolas Cazin, Alexandre Chapin, Florence Zara, Emmanuel Dellandrea, Jan Peters, Liming Chen

Mot(s) clé

6D pose estimation, Occlusion, Bin picking, Fruits, Computer vision

Voir toutes les versions du dataset

FruitsBin

Taille totale et nombre de fichier(s)

Raw data ~7x250 Go, Benchmarks ~7x20 Go

URL pour visualiser le dataset

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS

Eagle - version1

Version

version1

Licence

Creative common

Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated on datasets of static objects in static scenes with fixed geometry. We attempt to go beyond existing work in complexity and introduce a new model, method and benchmark. We propose EAGLE, a large-scale dataset of ∼1.1 million 2D meshes resulting from simulations of unsteady fluid dynamics caused by a moving flow source interacting with nonlinear scene structure, comprised of 600 different scenes of three different types. To perform future forecasting of pressure and velocity on the challenging EAGLE dataset, we introduce a new mesh transformer. It leverages node clustering, graph pooling and global attention to learn long-range dependencies between spatially distant data points without needing a large number of iterations, as existing GNN methods do. We show that our transformer outperforms state-of-the-art performance on, both, existing synthetic and real datasets and on EAGLE. Finally, we highlight that our approach learns to attend to airflow, integrating complex information in a single iteration.

Date de publication

30/01/2023

Auteur(s)

Steeven Janny, Aurélien Bénéteau, Madiha Nadri, Julie Digne, Nicolas Thome, Christian Wolf

Mot(s) clé

deep learning, fluid mechanics

Voir toutes les versions du dataset

eagle-dataset

URL pour visualiser le dataset

https://eagle-dataset.github.io/

Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)

FTP

HTTPS