Siddharth Chandracodekaro.hashnode.net·Dec 2, 2021Big Data Open Source FrameworksBig Data is a term used to define large scale data sets that are too complex to be manipulated with basic DBMS. Handling Big Data requires sophisticated hardware and software technologies. Just as open-source has been the primary reason for the Big D...Discuss·77 likes·369 readsScalabig data
mani nekkalapudimaninekkalapudi.hashnode.net·Sep 14, 2020Apache Spark Of House Big DataHello! It's been 6 months that I started my new role as a Data Engineer and I'm excited to work with the best tools to build ETL pipelines at work. One of them is Apache Spark. I'll share my knowledge on Spark and this post the we'll discuss The Ori...Syed Fazle Rahman and 2 others are discussing this3 people are discussing thisDiscuss·40 likes·760 readsspark
Seema Saharanseemasaharan.hashnode.net·Jan 1, 2021Make your first Instagram filter | #christmashackathonFinal Product Introduction We all use social media like Instagram, Snapchat, and we must have used many filters too. And if you're in the tech industry, then you must have thought of making a filter by yourself, and you will be wondering how can you...Discuss·30 likes·518 readsAR
Antony Prince Jantoprince001.hashnode.net·Apr 19, 2023Test Driven Development in PySparkTest Driven Development is a software development practice where a failing test is written so that the code can be written following that to make it pass. It enables the development of automated tests before actual development of the application. Py...DiscussPySpark
Renjitha Krenjithak.hashnode.net·Apr 7, 2023Setting up Apache SparkIn this blog, I will be focusing on setting up the workspace for Windows so that we can get started with Apache Spark and do some hands-on in my upcoming series of Apache Kafka. If you haven't taken a look at it and wish to, here is the link https://...Discuss·1 like·73 readsspark
padmanabha reddypadmanabha.hashnode.net·Mar 31, 2023Apache Spark - CoreApache spark is a General-purpose, in-memory compute engine. It is a plug-and-play compute engine - we can plug spark with any storage system(S3, Local storage, HDFC etc..) and any resource manager(YARN, Kubernetes, Mesos, etc). Spark on top of Hadoo...Discussdata-engineering
Renjitha Krenjithak.hashnode.net·Mar 30, 2023Demystifying Big Data Analytics with Apache Spark : Part-1Posted by Renjitha K in Renjitha K's Blog on Mar 25, 2023 2:27:13 PM As the amount of data generated by individuals and businesses continue to grow exponentially, the need for technologies like Apache Spark that can process and analyze large dataset...Discuss·2 likes·101 readsspark
Stephen David-Williamsspiritman7.hashnode.net·Mar 19, 2023Process JSON data using Spark in DatabricksPreface One of the most popular file formats for flat files in data engineering is the JSON (JavaScript Object Notation) format. A typical JSON file is made up of key-value pairs, where the key is the name of a value or object and the value is the ob...Discuss·55 readsjson
Stephen David-Williamsspiritman7.hashnode.net·Mar 10, 2023Writing Data Quality Tests in Databricks using PytestDisclaimer: This post assumes you have a fundamental knowledge of PySpark (the Python API for using Spark), but if you’re comfortable with the Pytest framework with vanilla Python then you should be comfortable navigating through this too. I will inv...Discuss·107 readspytest
Vijay Uradeshuffleandsort.hashnode.net·Feb 24, 2023Dataset repartition in Apache SparkAs we know, Apache Spark is one of the fastest big data computational frameworks and it gives the best performance if the data is distributed evenly across nodes or executors. But, we cannot guarantee the partitions in intermittent stages of applicat...Discuss·1 like·36 readsspark
George Githiriggithiri.hashnode.net·Jan 31, 2023Spark StreamingSpark Streaming: Processing Big Data in Real-Time Big data processing has become an essential aspect of modern data management and analysis. With the growth of connected devices and the Internet of Things (IoT), organizations are faced with the chall...Discuss·46 readsspark
Yash Srivastavayash722.hashnode.net·Jan 24, 2023Spark Stages, Tasks, and JobsThere are mainly 3 components in spark UI Jobs A spark application can have multiple jobs based on the number of actions (#jobs =#actions) in the application. These jobs can have a common rdd somewhere in the execution map or they can be separate al...Discussspark
Sandip Nikalesandiip.hashnode.net·Jan 23, 2022How to create a custom parcels in cloudera cluster.Many of you might have come across a scenario where you wanted to install a package with a consistent version across all nodes in cloudera cluster, for example, if you want a higher python version on all nodes in the cluster what would you do... It's...Discuss·41 readsCloudera