The Yelp Big Data Dive: Part 1 — Working on the Menu

5 min readDec 27, 2023

“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore

In today’s digital era, data-driven decision-making is paramount for businesses seeking to optimize their revenue streams. Yelp, a popular platform for business reviews and user-generated content, is no exception. Yelp primarily generates revenue through a Cost-Per-Click (CPC) model, which charges businesses for every click their ad receives on the platform. To enhance this revenue model, we can leverage Yelp’s extensive dataset, which includes detailed information about businesses, user reviews, and user interactions.

Understanding Yelp’s Revenue Model: CPC

Cost-per-click (CPC) is a digital advertising model where advertisers pay a fee each time their ad is clicked. This model benefits businesses by ensuring they only pay for actual engagement rather than mere ad views. For Yelp, this translates into revenue generated from local businesses eager to attract potential customers.

The Yelp Dataset: A Treasure Trove of Information

The Yelp dataset is a comprehensive collection of over 9GB of user reviews, business details, and user information. This dataset is valuable for deriving insights that can drive targeted advertising strategies.

Data sources description image by Author

Enhancing Revenue: The Big Data Strategy

Developing a Dashboard for Business Category Analysis

To utilize the Yelp dataset effectively, I developed a dashboard based on Streamlit that focuses on analyzing business categories. This dashboard enables us to:

Identify the most popular and trending business categories.
Understand user demographics and preferences related to these categories.
Track user engagement and interaction patterns with different businesses.

Creating a Big Data Ecosystem

Given the volume of the Yelp dataset, a robust big data ecosystem is essential. My first step was to create a Docker compose setup for the Big Data project. This setup includes:

Hadoop HDFS: A scalable and fault-tolerant storage system for handling large volumes of data.
Jupyter Notebook with PySpark: An interactive environment for data processing and analysis using PySpark, which offers the capability to handle big data in a distributed manner.
Hive: A data warehousing solution that provides a SQL-like interface for querying large datasets.
Apache Hue: A web-based interface for easier interaction with the Hadoop ecosystem.

docker-compose services, image by author

Preparing the Docker Compose ingredients

To prepare the docker-compose, I followed my last post on how to create a docker-compose for PostgreSQL. In this case, I went even further and created a docker-compose for my Big Data project.

This docker-compose had the services I stated above, which include Hadoop components (name node, data node, and resource manager), Hive components (hive-server, hive-metastore, hive-meta store-PostgreSQL), Hue UI, and a Jupyter notebook for Spark.

Namenode (Hadoop)

Image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
Purpose: Acts as the master server, managing the file system namespace and controlling access to files by clients.
Configuration: Uses volumes for storage and an environment file for configuration.
Ports: Exposes port 50070 for web UI.

namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
    volumes:
      - type: volume
        source: namenode
        target: /hadoop/dfs/name
      - type: bind
        source: ./hadoop
        target: /home/hadoop
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop-hive.env
    ports:
      - "50070:50070"

Datanode (Hadoop)

Image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
Purpose: Stores data in the Hadoop cluster; it serves read and write requests from the file system’s clients.
Ports: Exposes port 50075.

datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
    volumes:
      - datanode:/hadoop/dfs/data
    env_file:
      - ./hadoop-hive.env
    environment:
      SERVICE_PRECONDITION: "namenode:50070"
    ports:
      - "50075:50075"

Resourcemanager (Hadoop)

Image: bde2020/hadoop-resourcemanager:2.0.0-hadoop2.7.4-java8
Purpose: Manages the resources and scheduling of user applications.

resourcemanager:
    image: bde2020/hadoop-resourcemanager:2.0.0-hadoop2.7.4-java8
    environment:
      SERVICE_PRECONDITION: "namenode:50070 datanode:50075"
    env_file:
      - ./hadoop-hive.env

Hive Server

Image: bde2020/hive:2.3.2-postgresql-metastore
Purpose: Provides a JDBC interface for querying data stored in Hadoop.
Ports: Exposes port 10000 for JDBC connections.

hive-server:
    image: bde2020/hive:2.3.2-postgresql-metastore
    env_file:
      - ./hadoop-hive.env
    environment:
      HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
      SERVICE_PRECONDITION: "hive-metastore:9083"
    ports:
      - "10000:10000"

Hive Metastore

Image: bde2020/hive:2.3.2-postgresql-metastore
Purpose: Stores metadata for Hive tables (like schema and location).
Ports: Exposes port 9083 for metastore service.

hive-metastore:
    image: bde2020/hive:2.3.2-postgresql-metastore
    env_file:
      - ./hadoop-hive.env
    command: /opt/hive/bin/hive --service metastore
    environment:
      SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088"
    ports:
      - "9083:9083"

Hive Metastore Postgresql

Image: bde2020/hive-metastore-postgresql:2.3.0
Purpose: Backend database for storing Hive metadata.
Ports: Exposes PostgreSQL default port 5432.
Description: This server is just in case a PostgreSQL DataBase is needed

hive-metastore-postgresql:
    image: bde2020/hive-metastore-postgresql:2.3.0
    ports:
      - "5432:5432"

Hue Database (huedb)

Image: postgres:12.1-alpine
Purpose: Adding the PostgreSQL Database on the Hue interface.
Ports: Exposes PostgreSQL port.

huedb:
    image: postgres:12.1-alpine
    volumes:
      - pg_data:/var/lib/postgresl/data/
    ports:
      - "5432"
    env_file:
      - ./hadoop-hive.env
    environment:
        SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088 hive-metastore:9083"

Hue

Image: gethue/hue:4.6.0
Purpose: Web-based interactive query editor in the Hadoop ecosystem. It also helps you make SQL-like queries to the Hive Warehouse.
Ports: Exposes port 8000, mapped to 8888.

hue:
    image: gethue/hue:4.6.0
    environment:
        SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088 hive-metastore:9083 huedb:5000"
    ports:
      - "8000:8888"
    volumes:
      - ./hue-overrides.ini:/usr/share/hue/desktop/conf/hue-overrides.ini
    links:
      - huedb

Spark Notebook

Image: jupyter/pyspark-notebook
Purpose: Provides a Jupyter notebook environment with Spark integration. This was the environment in which I made all the data wrangling using PySpark, and created all the PySpark dataframes that where used on the dashboard.
Ports: Exposes ports for WebUI, Spark WebUI, and Streamlit.

spark-notebook:
    image: jupyter/pyspark-notebook
    user: root
    ports: 
      - 8888:8888  # WebUI
      - 4040:4040  # Spark WebUI when local session started
      - 8501:8501  # Streamlit Port
    volumes:
      - ./:/home/jovyan/work

You can see the whole docker-compose on my GitHub account.

Finally, the whole data ecosystem will look like this: