Using Hadoop to Process Big Data ecosystem

Harshit Sharma
5 min readApr 25, 2023

Hadoop is an open-source software framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage.

Hadoop as a big data ecosystem provides a platform for solving data problems. It includes Apache projects and other commercial tools that work together to provide services like analysis, absorption, storage, and maintenance of your data. Some of the well-known Hadoop Big Data Tools include HDFS, MapReduce, Pig, and Spark.

Key Use Cases for Hadoop:

  1. Batch processing: Hadoop is ideal for processing large volumes of data in batch mode. For example, a company may want to analyze millions of customer transactions to identify patterns or anomalies. Hadoop can process these transactions in parallel and provide insights that would be difficult or impossible to obtain using traditional relational databases.
  2. Data warehousing: Hadoop can be used to store and manage large volumes of structured and unstructured data. This can include data from a variety of sources, such as social media, sensor data, or log files. Hadoop’s distributed file system can provide a cost-effective alternative to traditional data warehousing solutions.
  3. Data analysis: Hadoop’s MapReduce framework can be used to perform complex data analysis tasks, such as data mining, text analytics, and machine learning. Hadoop’s distributed processing capabilities enable parallel processing of large datasets, enabling faster and more accurate analysis.
  4. Stream processing: Hadoop can also be used for real-time data processing, such as processing data from sensors or social media feeds. Technologies like Apache Storm or Apache Flink can be used to process data in real-time using Hadoop’s distributed architecture.

For our project we have Hadoop specifically for Preprocessing and Data analysis. We have used Hadoop to preprocess a large dataset and then applied the k-means clustering algorithm to identify groups or communities within the data. This is a common use case for Hadoop, as it is ideal for processing and analyzing large volumes of data to gain insights and inform business decisions.

Dataset Used:

For this blog post, we will use the Twitter dataset from Stanford SNAP, which contains the social network of a Twitter user and their followers. The dataset can be downloaded from the following link: https://snap.stanford.edu/data/ego-Twitter.html

The dataset “ego-Twitter” is a collection of social network data from Twitter. It includes 10 different ego networks, where each ego network consists of a single user and all of their direct connections (friends and followers) on Twitter. The dataset was collected in 2009 and contains information on the user’s profile, their friends and followers, and the relationships between them.

The dataset is available in a variety of formats, including edge lists, node lists, and adjacency matrices. The edge lists contain information on the relationships between users, while the node lists contain information on the users themselves. The adjacency matrices represent the network as a matrix where each row and column correspond to a user, and the matrix entries indicate whether there is a relationship between two users.

Parsing and Storing Facebook Network Data using Python:

Social networks are all around us, from Facebook and Twitter to professional networking sites like LinkedIn. These networks allow us to connect with friends, family, colleagues, and people with similar interests. But have you ever wondered how these networks are structured and how you can analyze them? In this blog post, we’ll explore how to use Python and NetworkX to analyze social networks.

Before we can analyze our social network, we need to load and preprocess our data. In this example, we’ll use a dataset that contains information about the Facebook friendships of a group of users. We’ll start by loading the data from a text file using the parse_txt_file function:

def parse_txt_file(file_path):
vertices = set()
edges = []

with open(file_path, 'r') as f:
for line in f:
line = line.strip()

if not line or line.startswith("#"):
continue

from_node, to_node = line.split(' ')

vertices.add(from_node)
vertices.add(to_node)

edges.append({"_from": "vertices/" + from_node, "_to": "vertices/" + to_node})

return [{"_key": v} for v in vertices], edges

vertex_data, edge_data = parse_txt_file("facebook_combined.txt")

In this code snippet, we define the parse_txt_file function which reads in a text file and extracts the vertices and edges. We then call this function to get the vertex and edge data for our Facebook friendship dataset.

Identifying Communities through Labeling and Clustering Techniques:

Community detection is an essential task in network analysis, where the goal is to identify groups of nodes that are more densely connected within the group than outside. In this blog post, we will explore the process of community detection using the Louvain method in Python.

To start, we load vertices and edges from ArangoDB and create a graph using igraph. We then apply the Louvain community detection algorithm to the graph to identify communities within it. We print the number of communities identified, and get the membership list for each node.

# Load vertices and edges from ArangoDB
vertex_data = execute_query("FOR v IN vertices RETURN v")
edge_data = execute_query("FOR e IN edges RETURN e")

# Create the igraph graph
g = ig.Graph.TupleList([(e["_from"], e["_to"]) for e in edge_data], directed=False)

# Perform community detection using the Louvain method
louvain_communities = g.community_multilevel()
print(f"Number of communities: {len(louvain_communities)}")
# Get the membership list
membership = louvain_communities.membership

Next, we count the number of members in each community and get the top 4 largest communities based on member count. We print out the size of each community and its ID.

# Count the number of members in each community
from collections import Counter
community_counts = Counter(membership)

# Get the 4 largest communities based on member count
top_communities = community_counts.most_common(4)

# Print the top 4 communities
for i, (community_id, count) in enumerate(top_communities, start=1):
print(f"Community {i}: ID: {community_id}, Size: {count}")

After identifying the communities, we visualize the graph using the NetworkX package in Python. We first generate a list of colors equal to the number of communities and then plot the graph with each node colored by its community.

import networkx as nx
import seaborn as sns

# Assuming you already have a networkx graph called nx_graph with communities assigned as node attributes

communities = set()
for node in nx_graph.nodes:
node_attr = nx_graph.nodes[node]
if 'community' in node_attr and isinstance(node_attr['community'], dict):
communities.add(tuple(node_attr['community'].items()))

# Generate a list of colors equal to the number of communities
colors = sns.color_palette("hls", len(communities))

# Plot the graph with each node colored by its community
# Generate node colors
node_colors = [colors[list(communities).index(tuple(nx_graph.nodes[node]['community'].items()))] if 'community' in nx_graph.nodes[node] and isinstance(nx_graph.nodes[node]['community'], dict) else 'black' for node in nx_graph.nodes]
pos = nx.spring_layout(nx_graph)
plt.figure(figsize=(30,30))
# You can change the layout algorithm here
nx.draw(nx_graph, pos, node_color=node_colors)
plt.show()
Communities Detected

Conclusion:

In this blog post, we preprocessed a large Twitter dataset and applied the k-means clustering algorithm to identify groups or communities within the data. Additionally, we explored how to use Python and NetworkX to analyze social networks. We loaded and preprocessed a Facebook friendship dataset and identified communities within the network using the Louvain method in Python.

Overall, Hadoop is an essential tool for big data processing and analysis, providing a scalable and cost-effective solution for managing large datasets. With its robust ecosystem of tools and frameworks, Hadoop can handle a variety of use cases, from batch processing to real-time data processing.

--

--