Python SQL Cross-platform Docker Continuous Integration PostgreSQL TimescaleDB Neo-Vim Linux Windows

SIDERA’s Datawarehouse (technology lead, research, SW analysis)

This is one of the most significant projects I’ve had the opportunity to lead. It’s a data warehouse built for the core product of the company, SIDERA. From research to design, implementation, and ongoing maintenance, I managed the entire life cycle of this data warehouse project. This experience pushed me to dig deeper into advanced data management techniques, explore cutting-edge tools, and apply best practices like SOLID principles and Docker to ensure scalability and maintainability.

Investigation & Research

Before jumping into the development phase, I focused on thorough research. I wanted to understand the core principles of designing a data warehouse and the tools that can help in managing vast amounts of data efficiently. This research led me to discover a variety of tools and technologies I hadn’t worked with before, including:

I spent a significant amount of time learning how to apply these tools and technologies effectively, ensuring the solution I built was both reliable and efficient.

Data Model

I took a careful, methodical approach when designing the data model. The goal was to store historical data in an efficient and scalable way, while also ensuring easy integration with BI tools for reporting and dashboards.

I decided to store historical data in a single table with a flexible schema that could be easily integrated with data visualization tools like Grafana. The table consists of the following tuple columns: timestamp, device, variable, and value. This format provides a clear structure for time-series data.

For the database engine, I chose TimescaleDB, an extension of PostgreSQL that is optimized for time-series data. TimescaleDB allows us to define hypertables, which automatically partition data by time, ensuring that even large datasets are handled efficiently. To keep everything organized, I used a 3NF relational schema for the definition tables, which map device types, devices, and variables. All the tables were optimized with primary keys (PK), foreign keys (FK), and indexes to ensure fast query performance.

In addition to the base tables, I created several views for ease of querying, including a view that aggregates all the information into a single data table by joining the historical data table with the definition tables. Furthermore, I introduced a materialized view to store a cached version of the definition tables, which speeds up queries and ensures data consistency. This materialized view is updated on demand, using triggers to refresh data when changes are made to the definition tables. Given that these tables do not change often, the refresh rate is minimal, ensuring optimal performance.

ETL: Building a Custom Solution

As I explored existing ETL (Extract, Transform, Load) solutions, I realized that the complexities of our needs required a more tailored approach. I decided to build a custom ETL solution to connect directly to SIDERA’s API and stream data in real-time. This custom solution would:

This approach allowed for greater flexibility and customization compared to pre-built solutions, and it aligned with the unique needs of our system.

Implementation

With all the research and design work in place, I proceeded to build an MVP (Minimum Viable Product) for the system. To test the data import and ETL pipeline, I first exported historical data from SIDERA as CSV files, applying necessary transformations to match the data model of the warehouse.

Instead of performing a direct bulk load, I decided to develop a tool that allowed for smoother CSV imports into the PostgreSQL database, ensuring that the data was properly structured and cleaned before being stored. This tool was an excellent way to polish the code and ensure that the system was functioning correctly before scaling it further.

Once the data import was successful, I tested the data stream reader connected to SIDERA’s API. Finally, I turned my attention to integrating the system with Power BI and Grafana to visualize the data and generate reports.

Docker Integration

To improve the scalability and portability of the solution, I decided to introduce Docker into the development and deployment process. Using Docker allowed me to:

Docker helped significantly reduce deployment time and ensured consistency across multiple environments, from development to production.

Deployment and Maintenance

As of now, my primary responsibility is deploying the data warehouse across various projects within the company and maintaining its operational health. This involves:

It’s immensely gratifying to see the data warehouse integrate seamlessly into the company’s product, providing valuable insights to all stakeholders. It feels like I’ve given birth to something that now supports key operations across multiple teams—truly, a moment of pride.

Conclusion: Key Skills and Takeaways

Throughout this project, I’ve not only learned a great deal about data management, but also about scalable architecture and system design. Some of the key skills and principles I applied during this project include:

This project, without a doubt, reflects my curiosity and commitment to continuous learning. It’s been a fulfilling experience that I’m excited to continue building on in the future.


Related entries

LoreEngine
Curiosity
Proficiency
Good practices
My Custom Linux Desktop Environment
Curiosity
Resilience
Linux and Face Recognition Authorization
Curiosity
Fixing a Broken Rooted Tablet
Curiosity
Resilience
Problem resolution
SIDERA
Team work
Leadership
Catenary Energization Algorithm for SIDERA
Performance
Curiosity
Best practices
SENDA
Best practices
Proactivity
Team work
Proficiency
Leadership
Genetec SDK Integration in SIDERA
Best practices
Proactivity
Problem resolution
Proficiency
Cylock
Curiosity