# Data Workflow Assignment — HoGent Datalinux Labs

---

## Context

This project for the assignment "Data workflow" for the course "Linux for Data Scientists (2024-2025)" aims to show my skill in extracting live data from public APIs and translate them into a coherent database and report.

I chose to do this assignment entirely in English and focused on Guernsey, as I will (hopefully) be doing an internship there soon. This makes it not only a great addition to my portfolio but also makes it an extra challenge finding relevant, live APIs.

--

## Overview

This project contains a full data workflow for analyzing and reporting datasets related to the weather, boat- and plane arrivals in Guernsey.
The workflow extracts, processes and visualizes this data daily in reports in multiple formats (Markdown, HTML, PDF, and Word) as well as keeps a backup of every PDF before overwriting the reports the next day.

Live, relevant, and sufficiently sizeable Guernsey data was hard to find, I ended up using the World Weather API (https://www.weatherapi.com/weather/q/guernsey) for the weather and data.gg (https://data.gg/Developers) for the air and sea traffic.

---

## Project Structure

```
data_workflow/
├── data/                   # Raw input JSON data, organized by timestamped folders
│   └── (timestamp)/
│       ├── flights.json    # Raw JSON data of all ARRIVING flights
│       ├── sailings.json   # Raw JSON data of ALL boats
│       └── weather.json    # Raw JSON weather data
├── output/
│   ├── old_reports/        # Old pdf reports
│   ├── plots/              # PNG plots generated from the data
│   ├── combined.csv        # Combined CSV data from JSON sources
│   └── report.*            # Report in all formats
├── scripts/
│   ├── analyze_data.py     # Analyses and plots the data, saves the plots as PNGs
│   ├── data_collect.sh     # Extracts the data from the APIs in raw JSON
│   ├── data_transform.sh   # Transforms the data from raw JSONs into a CSV
│   └── generate_report.sh  # Generates a report in multiple formats and moves the old report
├── run_workflow.sh         # executes the scripts in the correct order, use for automation
└── README.md               # <── You're here!
```

---

## Workflow Description

1. **Data Extraction (`data_collect.sh`)**

    - Calls the Weather API and data.gg API for raw JSON data.
    - Makes a directory named the current timestamp.
    - Puts the raw JSON data in separate JSON files (`weather.json`, `sailings.json`, `flights.json`) in this directory.

2. **Data Transformation (`data_transform.sh`)**

    - Identifies the latest directory by timestamp in the data directory.
    - Extracts the relevant JSON files (`weather.json`, `sailings.json`, `flights.json`) containing the most recent data.
    - Uses `jq` to parse JSON files and extract key fields such as temperature, humidity, wind speed, and arrival information.
    - Organizes the extracted data with clear timestamps.
    - Processes the raw JSON data into a unified CSV format (`combined.csv`), combining the weather and arrival data.
    - Converts timestamps into date and time fields.
    - Calculates arrival counts for boats and planes, and determines the most popular origins for each mode of transport.
    - Appends new data rows to the existing CSV, maintaining a running record.
    - Validates data integrity and handles missing or incomplete data.

3. **Data Analysis (`analyze_data.py`)**

    - Loads the combined CSV file into a Pandas DataFrame for processing.
    - Computes rolling averages for arrivals to smooth fluctuations and support predictions.
    - Generates multiple informative visualizations, including:
        - Temperature and humidity trends over time with dual y-axes.
        - Boat and plane arrivals over time with actual counts and predicted next-day values.
        - Regional distribution of the most popular origins (by day) for boats and planes shown in bar charts.
    - Saves plots in PNG format in an output directory.

4. **Report Generation (`generate_report.sh`)**

    - Compiles a Markdown report with the analyses and extra text (Lorem Ipsum).
    - Backs up existing reports before overwriting, preserving a history of previous reports.
    - Uses Pandoc to convert the Markdown report into multiple formats: HTML, PDF, and Word.

5. **Automation (`run_workflow.sh`)**

    - Acts as the main script to execute all stages of the workflow in correct order.
    - Enables scheduling (e.g., via cron) for automatic (daily) execution.
    - Important since the data.gg API has no historical records.

---

## Tools and Techniques

* **Shell scripting:** Extraction, workflow orchestration, backup management
* **jq:** JSON parsing and data extraction from raw data files
* **Python & Pandas:** Data processing, rolling averages, and complex visualizations
* **Matplotlib:** Plot generation
* **Pandoc:** Report conversion (Markdown ─> HTML, PDF, DOCX)

---

## Usage

1. Make sure all necessary dependencies are installed
2. Make sure you are in the DATA_WORKFLOW/ directory
3. Make sure you added an API key in data_collect.sh
4. Within a Linux environment (Ubuntu, WSL, Git Bash, ...) run the workflow script:

```bash
./run_workflow.sh
```

---

## Conclusion

This project demonstrates the full cycle of a data workflow, from collecting real-time data through public APIs, to transforming it into a structured format, performing analysis, generating visual reports, and automating the process for daily execution.

While the scope was focused on Guernsey’s weather and transport data, which might not be the most relevant, especially on the time scale this project will run, the principles and architecture serve as a blueprint for similar projects where more relevant and current data can be easily compared over time. 
For example:
    - Will fewer boats arrive when there is a lot of wind?
    - Will people ride less bikes in the Netherlands if it rains?
    - Does an increase in ice cream sales cause higher rates of pickpocketing?
These types of questions can be answered with a project using the same structure as this, if they compare the relevant APIs.
