The Yelp Big Data Dive: Part 1 — Working on the Menu

Sebastian Carmona A.
5 min readDec 27, 2023

--

“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore

In today’s digital era, data-driven decision-making is paramount for businesses seeking to optimize their revenue streams. Yelp, a popular platform for business reviews and user-generated content, is no exception. Yelp primarily generates revenue through a Cost-Per-Click (CPC) model, which charges businesses for every click their ad receives on the platform. To enhance this revenue model, we can leverage Yelp’s extensive dataset, which includes detailed information about businesses, user reviews, and user interactions.

Photo by Eaters Collective on Unsplash

Understanding Yelp’s Revenue Model: CPC

Cost-per-click (CPC) is a digital advertising model where advertisers pay a fee each time their ad is clicked. This model benefits businesses by ensuring they only pay for actual engagement rather than mere ad views. For Yelp, this translates into revenue generated from local businesses eager to attract potential customers.

The Yelp Dataset: A Treasure Trove of Information

The Yelp dataset is a comprehensive collection of over 9GB of user reviews, business details, and user information. This dataset is valuable for deriving insights that can drive targeted advertising strategies.

Data sources description image by Author

Enhancing Revenue: The Big Data Strategy

Developing a Dashboard for Business Category Analysis

To utilize the Yelp dataset effectively, I developed a dashboard based on Streamlit that focuses on analyzing business categories. This dashboard enables us to:

  • Identify the most popular and trending business categories.
  • Understand user demographics and preferences related to these categories.
  • Track user engagement and interaction patterns with different businesses.

Creating a Big Data Ecosystem

Given the volume of the Yelp dataset, a robust big data ecosystem is essential. My first step was to create a Docker compose setup for the Big Data project. This setup includes:

  • Hadoop HDFS: A scalable and fault-tolerant storage system for handling large volumes of data.
  • Jupyter Notebook with PySpark: An interactive environment for data processing and analysis using PySpark, which offers the capability to handle big data in a distributed manner.
  • Hive: A data warehousing solution that provides a SQL-like interface for querying large datasets.
  • Apache Hue: A web-based interface for easier interaction with the Hadoop ecosystem.
docker-compose services, image by author

Preparing the Docker Compose ingredients

To prepare the docker-compose, I followed my last post on how to create a docker-compose for PostgreSQL. In this case, I went even further and created a docker-compose for my Big Data project.

This docker-compose had the services I stated above, which include Hadoop components (name node, data node, and resource manager), Hive components (hive-server, hive-metastore, hive-meta store-PostgreSQL), Hue UI, and a Jupyter notebook for Spark.

Namenode (Hadoop)

  • Image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
  • Purpose: Acts as the master server, managing the file system namespace and controlling access to files by clients.
  • Configuration: Uses volumes for storage and an environment file for configuration.
  • Ports: Exposes port 50070 for web UI.
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
volumes:
- type: volume
source: namenode
target: /hadoop/dfs/name
- type: bind
source: ./hadoop
target: /home/hadoop
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop-hive.env
ports:
- "50070:50070"

Datanode (Hadoop)

  • Image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
  • Purpose: Stores data in the Hadoop cluster; it serves read and write requests from the file system’s clients.
  • Ports: Exposes port 50075.
datanode:
image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
volumes:
- datanode:/hadoop/dfs/data
env_file:
- ./hadoop-hive.env
environment:
SERVICE_PRECONDITION: "namenode:50070"
ports:
- "50075:50075"

Resourcemanager (Hadoop)

  • Image: bde2020/hadoop-resourcemanager:2.0.0-hadoop2.7.4-java8
  • Purpose: Manages the resources and scheduling of user applications.
resourcemanager:
image: bde2020/hadoop-resourcemanager:2.0.0-hadoop2.7.4-java8
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075"
env_file:
- ./hadoop-hive.env

Hive Server

  • Image: bde2020/hive:2.3.2-postgresql-metastore
  • Purpose: Provides a JDBC interface for querying data stored in Hadoop.
  • Ports: Exposes port 10000 for JDBC connections.
hive-server:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hadoop-hive.env
environment:
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
SERVICE_PRECONDITION: "hive-metastore:9083"
ports:
- "10000:10000"

Hive Metastore

  • Image: bde2020/hive:2.3.2-postgresql-metastore
  • Purpose: Stores metadata for Hive tables (like schema and location).
  • Ports: Exposes port 9083 for metastore service.
hive-metastore:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hadoop-hive.env
command: /opt/hive/bin/hive --service metastore
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088"
ports:
- "9083:9083"

Hive Metastore Postgresql

  • Image: bde2020/hive-metastore-postgresql:2.3.0
  • Purpose: Backend database for storing Hive metadata.
  • Ports: Exposes PostgreSQL default port 5432.
  • Description: This server is just in case a PostgreSQL DataBase is needed
hive-metastore-postgresql:
image: bde2020/hive-metastore-postgresql:2.3.0
ports:
- "5432:5432"

Hue Database (huedb)

  • Image: postgres:12.1-alpine
  • Purpose: Adding the PostgreSQL Database on the Hue interface.
  • Ports: Exposes PostgreSQL port.
huedb:
image: postgres:12.1-alpine
volumes:
- pg_data:/var/lib/postgresl/data/
ports:
- "5432"
env_file:
- ./hadoop-hive.env
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088 hive-metastore:9083"

Hue

  • Image: gethue/hue:4.6.0
  • Purpose: Web-based interactive query editor in the Hadoop ecosystem. It also helps you make SQL-like queries to the Hive Warehouse.
  • Ports: Exposes port 8000, mapped to 8888.
hue:
image: gethue/hue:4.6.0
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432 resourcemanager:8088 hive-metastore:9083 huedb:5000"
ports:
- "8000:8888"
volumes:
- ./hue-overrides.ini:/usr/share/hue/desktop/conf/hue-overrides.ini
links:
- huedb

Spark Notebook

  • Image: jupyter/pyspark-notebook
  • Purpose: Provides a Jupyter notebook environment with Spark integration. This was the environment in which I made all the data wrangling using PySpark, and created all the PySpark dataframes that where used on the dashboard.
  • Ports: Exposes ports for WebUI, Spark WebUI, and Streamlit.
spark-notebook:
image: jupyter/pyspark-notebook
user: root
ports:
- 8888:8888 # WebUI
- 4040:4040 # Spark WebUI when local session started
- 8501:8501 # Streamlit Port
volumes:
- ./:/home/jovyan/work

You can see the whole docker-compose on my GitHub account.

Finally, the whole data ecosystem will look like this:

Big Data Ecosystem. Image by Author

Finally!

After firing up the docker-compose, I had the perfect Big Data ecosystem for starting my project!

To be continued…

Photo by Kelly Sikkema on Unsplash

PS: check my Docker Compose repository for different docker containers for Data Science projects! I hope this helps your journey, too!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sebastian Carmona A.
Sebastian Carmona A.

Written by Sebastian Carmona A.

I like coffee and Data, Data and Coffee, simple! OH! And Python, and SQL! www.sebascarmona.com

No responses yet

Write a response