What is Big Data?

8 min readMar 17, 2024

Introduction

I intend to explore Big Data in this post, what it really means and why it’s so important. I’ll get into the basics like parallel processing and scaling, and showcase some of the tools and ecosystems that make it all possible. Plus, I’ll cut through the buzzwords to see how Big Data is actually used in the real world, with a couple of examples of use cases.

Seriously, what is Big Data?
Impact of Big Data?
Parallel Processing, Scaling, and Data Parallelism
Big Data Tools and Ecosystem
Open Source and Big Data
Beyond the Hype
Big Data Use Cases

Seriously, what is Big Data?

Bernard Marr, a renowned figure in analytics, KPIs, AI, and Big Data, provides a concise definition of Big Data as the digital footprint generated in our modern digital era.

To understand the concept fully, it’s essential to compare Big Data with Small Data. While Small Data is easily interpretable by humans and stored in structured formats, Big Data presents a stark contrast. It’s characterized by massive volumes, little to no structure, and continuous generation across various formats like text, images, audio, and videos.

Big Data Life Cycle

Big Data’s life cycle includes collection, storage, processing, analysis, and visualization. Initially spurred by business needs, data is collected and stored using frameworks like Hadoop HDFS. Then, through processes like MapReduce and scripting, it’s modeled and organized into databases for further analysis. Tools like Apache Spark are used to extract meaningful insights, which are then visualized to facilitate informed decision-making, thus perpetuating a continuous cycle of value creation.

Gartner defines Big Data as a high-volume, high-velocity, and high-variety information asset requiring innovative processing tools. To grasp the enormity of Big Data, consider the astronomical storage capacities measured in exabytes, zettabytes, and yottabytes.

The Four Vs

Traditionally, Big Data is characterized by the four Vs: Velocity, Volume, Variety, and Veracity.

Velocity signifies the speed of data generation and the need for real-time processing.
Volume denotes the sheer scale of data, driven by an increase in data sources and advanced hardware infrastructure.
Variety refers to the diversity of data formats, ranging from structured to unstructured.
Veracity addresses the accuracy and reliability of data, vital for making informed decisions.

Moreover, there’s a fifth V of Big Data: Value. It represents the ultimate goal of leveraging Big Data, which is to drive intelligent business decisions, optimize resource utilization, and uncover new opportunities for innovation and growth.

Impact of Big Data?

Big Data is a crucial aspect of the modern era, yet we all interact with it unknowingly, generating personal data through everyday activities like sharing photos, videos, and messages. This data forms a significant portion of what consumer goods companies collect. Platforms like Amazon, Netflix, and Spotify use Big Data algorithms to tailor recommendations based on user behavior and preferences, showcasing the widespread application of Big Data in creating better user experiences.

Also, virtual assistants such as Siri and Alexa leverage Big Data to provide intelligent responses to user queries, using advanced neural networks for speech recognition and complex task execution. External factors like location and calendar events further enrich the data collected, enabling services like Google Now to anticipate user needs and preferences, illustrating how Big Data is utilized to forecast future behavior and requirements.

Data is the New Oil

Big Data fundamentally alters the landscape of business operations. Often referred to as the “New Oil,” data fuels business decisions through complex machine learning algorithms, driving effectiveness and competitiveness. The demand for data scientists and Big Data engineers continues to rise, underscoring the enduring value of Big Data skills in the business world.

In the world of the Internet of Things (IoT), connected devices generate massive volumes of data, which are then analyzed to derive insights for enhancing productivity and improving user experiences. The proliferation of IoT devices is expected to skyrocket in the coming years, further highlighting the critical role of Big Data analytics in unlocking the potential of IoT ecosystems.

Parallel Processing, Scaling, and Data Parallelism

Data, in varying sizes, offers a lot of possibilities. In a standard analytics cycle, computers store, compute, and retrieve data. However, Big Data surpasses single-computer capacity. Parallel processing, unlike linear processing, breaks down tasks into executable instructions distributed across multiple nodes for simultaneous execution.

Linear Processing vs Parallel Processing

Linear processing follows sequential instruction execution, suitable for minor tasks but inefficient for Big Data. In contrast, parallel processing distributes instructions to multiple nodes, reducing processing times, memory, and processing requirements, offering flexibility by adding or removing execution nodes.

Data Scaling

Data scaling manages data overflow, preferring horizontal scaling by adding nodes for increased capacity. Computing clusters handle “embarrassingly parallel” calculations independently, ensuring no impact if a process fails, enhancing fault tolerance.

Complexity arises when coordinating multiple computations, necessitating network communication or shared file systems. Most enterprise calculations are “embarrassingly parallel,” influencing framework designs. In the Hadoop ecosystem, compute-to-data design optimizes computations at data locations, enhancing fault tolerance.

Fault Tolerance

In case of system failures, fault tolerance ensures uninterrupted operation. Systems like HDFS and similar storage systems ensure reliability, showcasing complex maintenance processes but robust frameworks, reliable up to 99.999%.

Big Data Tools and Ecosystem

Big Data tools have six distinct categories: Data Technologies, Analytics and Visualization, Business Intelligence, Cloud Service Providers, NoSQL Databases, and Programming Tools.

Data Technologies: These tools analyze, process, and extract insights from Big Data, beyond traditional processing capabilities. Open-source projects like Apache Hadoop, Apache HDFS, and Apache Spark dominate this field, with commercial vendors like Cloudera and Databricks offering support.

Analytics and Visualization: Analytical tools uncover patterns and trends within large datasets, while visualization tools present these findings graphically. Leading examples include Tableau, Palantir, SAS, Pentaho, and Teradata.

Business Intelligence: BI tools transform raw data into actionable insights, leveraging mathematical concepts like probability and statistics. Cognos, Oracle, PowerBI, Business Objects, and Hyperion are notable examples.

Cloud Service Providers: Cloud providers offer infrastructure and software services for storing, processing, and visualizing data. AWS, IBM, GCP, and Oracle are prominent players in this space.

NoSQL Databases: These databases are tailored for Big Data processing, storing data in documents rather than relational tables. MongoDB, CouchDB, Cassandra, and Redis are popular NoSQL options.

Programming Tools: Programming languages like R, Python, SQL, Scala, and Julia facilitate large-scale data analysis and operationalization.

Open Source and Big Data

Open source software, often referred to as OSS, is not only free to use but also grants access to its complete source code, enabling users to view, modify, and redistribute it as needed. However, true open source projects adopt an open-governance model, allowing contributions from any organization to steer the project to serve community needs. It’s essential to understand that not all open source software operates under the same license, so it’s crucial to check the specific license type before use.

The open source model is widely adopted for Big Data due to its collaborative nature. Projects like the Linux Kernel illustrate how open source efforts persist beyond individual organizations, laying the groundwork for modern Big Data infrastructure. This transparent development model mirrors democracy in government, serving the will of the participating community and often proving more profitable in the long run.

Major open source projects typically have formal contribution processes, distinguishing between committers, contributors, user group, and users. While committers have direct code modification access, contributors submit code for review. Open source foundations establish best practices for development and governance, ensuring transparency and democracy within projects.

Big Data Projects

In the world of Big Data, the Hadoop ecosystem is the supreme, with components like MapReduce, Hadoop File System (HDFS), and Yet Another Resource Negotiator (YARN) forming its backbone. While HDFS remains a cornerstone for storing and managing large datasets, modern approaches like S3 and object storage are emerging. YARN serves as the default resource manager for many Big Data applications, although container-based alternatives like Kubernetes are gaining traction.

The Hadoop ecosystem powers a vast array of Big Data applications, from ETL tasks to computation. Systems like Apache HBase integrate closely with Hadoop, offering extensive NoSQL data storage capabilities. Platforms like the Hortonworks Data Platform (HDP) streamline Big Data toolsets, providing pre-configured environments with essential open source packages.

Beyond the Hype

The surge in Big Data discussions comes from the exponential growth in data creation, with more data generated in the last two years than in all of human history. Projections show a 40% annual increase in global data, with digital data expected to double by 2026. This growth is illustrated in the chart displaying data growth in zettabytes.

Where does it come from?

Big Data originates from three main sources: social data, machine-generated data, and transactional data. Social data consists of content shared on social media platforms, while machine-generated data comes from IoT sensors and web logs. Transactional data includes daily transactions both online and offline, such as invoices and delivery receipts.

Types of Big Data

Big Data is classified into structured, semi-structured, and unstructured data.

Structured data follows a tabular format with predefined models, and often originates from relational databases and spreadsheets, facilitating easy querying with SQL.
Unstructured data lacks a predetermined format and includes images, videos, and text messages.
Semi-structured data combines aspects of both, containing some metadata alongside unstructured content. Semi-structured data sources, like XML and JSON files, use tags to organize records hierarchically. With the proliferation of Internet activity, particularly in video production and social media, data volumes continue to escalate.

Cloud computing has played a pivotal role in the Big Data era by providing scalable computing and storage resources via the Internet. Companies leverage cloud computing to access server capacity as needed, allowing rapid scalability for processing large datasets and complex models. Additionally, cloud computing reduces the cost of analyzing Big Data by sharing resources across users, who only pay for the capacity they utilize.

Big Data Use Cases

Unprecedented data generation offers competitive advantages to data-driven companies through effective aggregation and analysis, leading to breakthrough insights. Industries across the board use data insights to enhance decision-making, expand into new markets, and enhance customer experiences.

Predominantly, financial services, technology, and telecommunications sectors lead in Big Data usage, driving innovation. Retail, government, healthcare, advertising, entertainment, and gaming industries follow suit, alongside data services, energy, utilities, system integration consulting, shipping, and transportation.

Industry applications of Big Data analytics vary. Retailers employ price analytics and sentiment analysis for market segmentation and effective marketing strategies. Insurance companies utilize Big Data for fraud detection and risk assessments, while telecoms leverage it for network security and targeted promotions.

In manufacturing, predictive maintenance optimizes equipment usage patterns and production lines, while automotive industries rely on Big Data for predictive maintenance, supply chain analytics, and real-time adjustments in self-driving vehicles.

Similarly, financial companies use Big Data for fraud detection, risk assessment, customer segmentation, and algorithmic trading, leveraging machine learning for quicker and continuously improving decisions.

If you found this blog helpful, you might want to check out some of my other data-related blogs:

DISCLAIMER

Disclaimer: The notes and information presented in this blog post were compiled during the course “Introduction to Big Data with Spark and Hadoop” and are intended to provide an educational overview of the subject matter for personal use.