The CIMPLO-project aims at developing a cross-industry predictive maintenance optimization platform, which addresses the real-world requirements for dynamic, scalable multiple-criteria maintenance scheduling. The platform consists of three components: Data Infrastructure, Modeling Engine, and Optimization Engine.
The Modeling Engine uses techniques from Predictive Maintenance to predict Remaining Useful Life (RUL) of vehicle components (brakes, engine, wheels, etc), while the Optimization Engine uses Dynamic Schedule Optimization to (re-)create maintenance schedules, based on the predictions of RUL and available resources (personnel, tools, materials) in the workshops.
For a more detailed description of the project check: https://cimplo.nl/project/
The Data Infrastructure acts as the back-end of both Engines, it ingests data from maintenance procedures, sensors present in vehicles, and company resources; it also conducts all operations the Engines require, e.g., statistical analysis, machine learning, SQL queries, data exploration, etc.
The Infrastructure tackles the following problems: data can have different formats (CSV, JSON, plain text, …), be stored in distinct places, have incorrect/missing values, and need to be managed in a time- and/or memory-efficient manner. To study those problems, the Data Infrastructure is divided into three steps:
- Data Preparation and Analysis: This step collects the data from each source, translates it to a common structure, and loads it into a database management system (DBMS) — here specifically the open-source analytical DBMS MonetDB  for analysis and exploration.
- Data Cleaning: Dirty data can sabotage the results of some algorithms. This step cleans the data, correcting or warning about missing values and outliers detected;
- Modeling Algorithms Integration: The final step integrates the Engine’s algorithms to the database system so that they can execute inside the warehouse near the data. As opposed to the classical way of moving data from the warehouse to the program.
This blog post describes the first step, Data Preparation and Analysis. In later posts, the other two steps will be explored.
Data Preparation and Analysis
The Data Preparation and Analysis step is the first stage of the data ingestion. This is where data gets collected from different sources, pre-processed and ingested into our warehouse so that other researchers from the project can work with it.
Up until this point, the only data format received is CSV (comma-separated values), tabular ASCII data with no information about column types, and the data is centralized, as it comes from only one of our partners. However, the pipeline will be expanded to work with any kind of file independent of its location.
Figure 1 presents the general idea of the data pipeline developed for the initial challenges. Data goes through two stages of pre-processing, Schema Discovery; and Compression and Loading, before reaching the Warehouse.
Figure 1: Data Preparation and Analysis pipeline.
Stage 1 Schema Discovery. To leverage the analytical warehouse’s full capabilities, the types of each column need to be determined beforehand, hence the first step takes each file, reads its contents, and identifies the best schema. It makes sure that the derived schema represents the file with the smallest possible types without losing information. It then outputs the CSV file plus the metadata.
Stage 2 Compression and Loading. As we expect considerable amounts of data, it is necessary to make sure all the files are compressed to their best, yet remain searchable with reasonable speed. This stage analyzes the schema and the CSV file to find the best compressed and searchable representation. Then, the data is loaded into the Data Warehouse so that every other CIMPLO researcher can work with it.
Here we conclude our post, we hope that you started to understand how CIMPLO works internally. Stay tuned for our next entries to know even more about the other parts of the project.