The Yelp Big Data Dive: Part 1 — Working on the Menu
“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore
In today’s digital era, data-driven decision-making is paramount for businesses seeking to optimize their revenue streams. Yelp, a popular platform for business reviews and user-generated content, is no exception. Yelp primarily generates revenue through a Cost-Per-Click (CPC) model, which charges businesses for every click their ad receives on the platform. To enhance this revenue model, we can leverage Yelp’s extensive dataset, which includes detailed information about businesses, user reviews, and user interactions.
Understanding Yelp’s Revenue Model: CPC
Cost-per-click (CPC) is a digital advertising model where advertisers pay a fee each time their ad is clicked. This model benefits businesses by ensuring they only pay for actual engagement rather than mere ad views. For Yelp, this translates into revenue generated from local businesses eager to attract potential customers.
The Yelp Dataset: A Treasure Trove of Information
The Yelp dataset is a comprehensive collection of over 9GB of user reviews, business details, and user information. This dataset is valuable for deriving insights that can drive targeted advertising strategies.

Enhancing Revenue: The Big Data Strategy
Developing a Dashboard for Business Category Analysis
To utilize the Yelp dataset effectively, I developed a dashboard based on Streamlit that focuses on analyzing business categories. This dashboard enables us to:
- Identify the most popular and trending business categories.
- Understand user demographics and preferences related to these categories.
- Track user engagement and interaction patterns with different businesses.
Creating a Big Data Ecosystem
Given the volume of the Yelp dataset, a robust big data ecosystem is essential. My first step was to create a Docker compose setup for the Big Data project. This setup includes:
- Hadoop HDFS: A scalable and fault-tolerant storage system for handling large volumes of data.
- Jupyter Notebook with PySpark: An interactive environment for data processing and analysis using PySpark, which offers the capability to handle big data in a distributed manner.
- Hive: A data warehousing solution that provides a SQL-like interface for querying large datasets.
- Apache Hue: A web-based interface for easier interaction with the Hadoop ecosystem.

Preparing the Docker Compose ingredients
To prepare the docker-compose, I followed my last post on how to create a docker-compose for PostgreSQL. In this case, I went even further and created a docker-compose for my Big Data project.
This docker-compose had the services I stated above, which include Hadoop components (name node, data node, and resource manager), Hive components (hive-server, hive-metastore, hive-meta store-PostgreSQL), Hue UI, and a Jupyter notebook for Spark.
Namenode (Hadoop)
- Image:
bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
- Purpose: Acts as the master server, managing the file system namespace and controlling access to files by clients.
- Configuration: Uses volumes for storage and an environment file for configuration.
- Ports: Exposes port 50070 for web UI.
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
volumes:
- type: volume
source: namenode
target: /hadoop/dfs/name
- type: bind
source: ./hadoop
target: /home/hadoop
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop-hive.env
ports:
- "50070:50070"
Datanode (Hadoop)
- Image:
bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
- Purpose: Stores data in the Hadoop cluster; it serves read and write requests from the file system’s clients.
- Ports: Exposes port 50075.
datanode:
image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
volumes:
- datanode:/hadoop/dfs/data
env_file:
- ./hadoop-hive.env
environment:
SERVICE_PRECONDITION: "namenode:50070"
ports:
- "50075:50075"
Resourcemanager (Hadoop)
- Image:
bde2020/hadoop-resourcemanager:2.0.0-hadoop2.7.4-java8
- Purpose: Manages the resources and scheduling of user applications.
resourcemanager:
image: bde2020/hadoop-resourcemanager:2.0.0-hadoop2.7.4-java8
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075"
env_file:
- ./hadoop-hive.env
Hive Server
- Image:
bde2020/hive:2.3.2-postgresql-metastore
- Purpose: Provides a JDBC interface for querying data stored in Hadoop.
- Ports: Exposes port 10000 for JDBC connections.
hive-server:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hadoop-hive.env
environment:
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
SERVICE_PRECONDITION: "hive-metastore:9083"
ports:
- "10000:10000"
Hive Metastore
- Image:
bde2020/hive:2.3.2-postgresql-metastore
- Purpose: Stores metadata for Hive tables (like schema and location).
- Ports: Exposes port 9083 for metastore service.
hive-metastore:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hadoop-hive.env
command: /opt/hive/bin/hive --service metastore
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088"
ports:
- "9083:9083"
Hive Metastore Postgresql
- Image:
bde2020/hive-metastore-postgresql:2.3.0
- Purpose: Backend database for storing Hive metadata.
- Ports: Exposes PostgreSQL default port 5432.
- Description: This server is just in case a PostgreSQL DataBase is needed
hive-metastore-postgresql:
image: bde2020/hive-metastore-postgresql:2.3.0
ports:
- "5432:5432"
Hue Database (huedb)
- Image:
postgres:12.1-alpine
- Purpose: Adding the PostgreSQL Database on the Hue interface.
- Ports: Exposes PostgreSQL port.
huedb:
image: postgres:12.1-alpine
volumes:
- pg_data:/var/lib/postgresl/data/
ports:
- "5432"
env_file:
- ./hadoop-hive.env
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088 hive-metastore:9083"
Hue
- Image:
gethue/hue:4.6.0
- Purpose: Web-based interactive query editor in the Hadoop ecosystem. It also helps you make SQL-like queries to the Hive Warehouse.
- Ports: Exposes port 8000, mapped to 8888.
hue:
image: gethue/hue:4.6.0
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088 hive-metastore:9083 huedb:5000"
ports:
- "8000:8888"
volumes:
- ./hue-overrides.ini:/usr/share/hue/desktop/conf/hue-overrides.ini
links:
- huedb
Spark Notebook
- Image:
jupyter/pyspark-notebook
- Purpose: Provides a Jupyter notebook environment with Spark integration. This was the environment in which I made all the data wrangling using PySpark, and created all the PySpark dataframes that where used on the dashboard.
- Ports: Exposes ports for WebUI, Spark WebUI, and Streamlit.
spark-notebook:
image: jupyter/pyspark-notebook
user: root
ports:
- 8888:8888 # WebUI
- 4040:4040 # Spark WebUI when local session started
- 8501:8501 # Streamlit Port
volumes:
- ./:/home/jovyan/work
You can see the whole docker-compose on my GitHub account.
Finally, the whole data ecosystem will look like this:

Finally!
After firing up the docker-compose, I had the perfect Big Data ecosystem for starting my project!
To be continued…
PS: check my Docker Compose repository for different docker containers for Data Science projects! I hope this helps your journey, too!