DigiScan360: End-to-End Data Analysis for Competitive Intelligence

Team

Md Kamrul Islam, Hareem Raza, Muhammad Qasim Khan, and Narmina Mahmudova

Overview

DigiScan360 is an end-to-end intelligence system designed to address the challenges of competitive analysis in modern digital markets for headphone companies. Market-relevant information is inherently fragmented across e-commerce platforms, expert review websites, and social media, spanning structured records, semi-structured logs, and unstructured text. Traditional analytics pipelines largely centered on relational models—are insufficient to integrate these sources holistically or to capture complex relationships between products, brands, and consumer behavior.

This project unifies large-scale data ingestion, distributed processing, analytical modeling, and graph-based semantic representations within a single architecture. By combining data warehouse analytics with knowledge graph modeling and metadata-aware integration, DigiScan360 enables both quantitative analysis and relationship-driven reasoning over heterogeneous market data.

Design Objectives

Integrate heterogeneous data — Structured and unstructured data from multiple external sources
Preserve raw data fidelity — Enable scalable, distributed transformations while maintaining data integrity
Enrich with semantic representations — Large-scale datasets augmented with graph-based and semantic models
Support multi-mode analytics — Descriptive, predictive, and relational analytics
Enable extensibility — Traceability and schema evolution through knowledge graph-based metadata

Key Contributions

Designed an end-to-end, reproducible big data and semantic intelligence pipeline
Integrated data warehouse analytics with knowledge graphs and metadata-aware modeling
Formalized sentiment, weaknesses, and strengths analysis as feature engineering using LLMs
Enabled multi-modal analytics and storytelling the outcomes using PowerBI dashboard

System Architecture

DigiScan360 follows a layered, end-to-end architecture that integrates heterogeneous data ingestion, scalable storage, distributed processing, semantic modeling, and analytical exploitation. Data flows through progressively refined stages from raw source data to curated analytical tables and semantic graphs while preserving lineage and traceability.

The architecture combines a data warehouse-based analytical backbone with graph-based semantic representations. Relational models support scalable quantitative analysis, while knowledge graphs capture higher-order relationships between products, brands, consumers, and content.

DigiScan360 system architecture — **Figure 1.** DigiScan360 end-to-end pipeline from data sources through ingestion, storage, processing, semantic modeling, and analytical exploitation.

Figure 1 illustrates the complete DigiScan360 pipeline from data sources to analytical and semantic exploitation. The following sections describe each stage in detail.

Data Collection and Storage

Data collection

Data is collected from e-commerce platforms (Amazon, MediaMarkt), expert review websites (CNET), and social media platforms (Facebook, Twitter). Source-specific Python collectors encapsulate extraction logic while emitting standardized outputs, isolating downstream processing from source-specific changes and enabling extensibility.

For Facebook and Twitter with restricted or costly API access, synthetic datasets are generated based on official API specifications and augmented with noise and missing values to realistically simulate real-world conditions. All collection processes are logged and versioned to support reproducibility and auditing.

Storage and landing design

The architecture employs a two-stage landing strategy:

Temporal Landing Zone: Raw CSV and JSON files are stored in Azure Blob Storage in their original form via automated upload scripts. This preserves data fidelity and supports schema-on-read processing, auditing, and reprocessing.
Persistent Lakehouse: Azure Data Factory orchestrates ingestion pipelines that transfer validated data into a Microsoft Fabric lakehouse, where data is stored as versioned tables with transactional guarantees.

This separation decouples ingestion from downstream processing, enabling scalable transformations, fault isolation, and long-term maintainability.

Data Processing and Feature Engineering

After ingestion into the persistent lakehouse, data is processed using distributed Spark-based workflows designed to clean, enrich, and align heterogeneous datasets for both analytical and semantic use.

Data cleaning and integration

Raw datasets undergo schema normalization, duplicate removal, missing-value handling, and noise reduction in textual fields. Multiple e-commerce sources are merged into unified product- and brand-level representations, enabling consistent downstream analysis across platforms.

Feature engineering and representation learning

To enrich analytical value beyond basic aggregates, the system derives higher-level features from unstructured text:

Sentiment features extracted from reviews and social media content
Text embeddings generated for product descriptions to support similarity analysis
Product and brand similarity metrics computed using vector-based nearest-neighbor search

Engineered features are materialized as versioned tables in the lakehouse and selectively propagated to the data warehouse for analytical querying.

LLM-based semantic feature generation

In addition to statistical and embedding-based features, DigiScan360 integrates a controlled LLM-based enrichment workflow to extract qualitative semantic insights such as product and brand strengths and weaknesses. LLM outputs are treated as structured analytical features rather than free-form text.

LLM enrichment workflow for semantic feature generation — **Figure 2.** LLM-based enrichment workflow for semantic feature generation.

Formatting and Exploitation

The formatting and exploitation layer bridges large-scale data processing with downstream analytical and semantic consumption. PySpark-based formatters standardize and enrich datasets before materializing them into structures optimized for analytics, graph modeling, and visualization.

Knowledge Graph and Property Graph Modeling

DigiScan360 employs complementary graph paradigms:

A Knowledge Graph implemented in GraphDB using RDF and SPARQL to support semantic interoperability, metadata management, and ontology-driven reasoning
A Property Graph implemented using Neo4j to enable efficient traversal-based analysis of product and brand relationships

Local schemas are mapped to global schemas using a Local-as-View (LAV) approach, enabling controlled integration and schema evolution.

Vector Search and LLM Integration

Vector embeddings are indexed in Pinecone to support efficient similarity-based product comparison
LLaMA-3, accessed via the Groq API, is used for sentiment analysis and semantic insight extraction
A fault-tolerant execution strategy prioritizes external inference

Semantic and Graph-Based Modeling

To overcome the limitations of purely relational models in representing complex, evolving relationships, DigiScan360 incorporates a semantic and graph-based modeling layer. This layer provides a unified, machine-interpretable representation of domain entities and relationships, while explicitly modeling metadata such as provenance, schema mappings, and transformation logic. Treating metadata as a first-class graph entity enables traceability, controlled schema evolution, and reproducible integration.

End-to-end semantic integration BPMN workflow — **Figure 3.** End-to-end semantic integration and graph construction workflow.

Key aspects include:

Global semantic schema design
Local-to-global schema mappings
Graph-based metadata and provenance management
BPMN-modeled semantic integration
Relationship-centric analytical queries

Analytical Exploitation and Results

The analytical exploitation layer operationalizes the DigiScan360 pipeline by exposing curated analytical and semantic data through interactive views. Engineered features are materialized in a star-schema data warehouse and accessed through analytical queries optimized for descriptive and comparative analysis.

Business intelligence dashboards using PowerBI serve as both an exploitation and validation layer, enabling systematic exploration of market dynamics, brand performance, and consumer engagement, while validating upstream ingestion, processing, and modeling decisions.

Twitter competitor analysis dashboard — **Figure 4.** Analytical exploitation through PowerBI dashboards built on curated warehouse tables.

Facebook page analysis dashboard — **Figure 4.** Analytical exploitation through PowerBI dashboards built on curated warehouse tables.

Key outcomes include:

Descriptive analysis of engagement and reach
Cross-platform brand comparison
Aggregation of semantic and sentiment features
Validation of end-to-end data pipelines
Complementary graph-based exploration

Technologies

Python, PySpark, LLMs, Microsoft Fabric, Azure Data Factory, Power BI, GraphDB, SPARQL