GeorgiaFreelance Big Data Developer at Toptal Since September 4, 2019
George is a seasoned systems engineer with great breadth and depth knowledge of building and automating complex systems. As an early adopter of cloud technology, he led a team to design and build an on-premise cloud. His 12 years of teaching developed the skill of coaching and communicating complex concepts. George is fluent in C, Go, and Python languages, with a keen interest in data science and AI, focused on delivering the highest quality results. He is eager to work with complex problems.
United StatesFreelance Big Data Developer at Toptal Since October 17, 2016
James is a results-driven, can-do, and entrepreneurial engineer with eight years of C-level experience (15+ years of professional engineering)—consistently delivering successful bleeding-edge products to support business goals. He's an architect in innovative tech initiatives that add to and accelerate business revenue streams. He's also the CTO and lead developer of Toon Goggles—an SVOD/AVOD kids' entertainment service with 8 million users.
GermanyFreelance Big Data Developer at Toptal Since March 6, 2018
Reza has worked as a CTO for various startups and a German government institution with a one billion revenue that needed restructuring. Ever since he dismantled his first VCR over 30 years ago, Reza has been passionate about building meaningful hardware and software that people use and love. His obsession has grown to include a love for solving complex problems across a broad spectrum of technologies.
AustraliaFreelance Big Data Developer at Toptal Since June 18, 2020
As a highly effective technical leader with over 20 years of experience, Andrew specializes in data: integration, conversion, engineering, analytics, visualization, science, ETL, big data architecture, analytics platforms, and cloud architecture. He has an array of skills in building data platforms, analytic consulting, trend monitoring, data modeling, data governance, and machine learning.
BrazilFreelance Big Data Developer at Toptal Since September 27, 2018
For over the past decade, Bruno's been working with databases in various fields. He also has an Oracle SQL Expert certification and specializes in optimizing SQL queries and PLSQL procedures, but he’s also developed with PostgreSQL and MySQL. Bruno likes to keep himself up to date, and that's why he’s undertaking a Ph.D. degree in computer science.
NetherlandsFreelance Big Data Developer at Toptal Since September 8, 2014
Pieter has 39 years of programming experience, including time spent as a software product manager. He is a challenger, an independent worker, and a team player as circumstances demand, and boasts expertise and skill in a range of topics including big data, cryptography, and machine learning.
United StatesFreelance Big Data Developer at Toptal Since June 23, 2014
With almost 20 years working as an engineer, architect, director, vice president, and CTO, Bryce brings a deep understanding of enterprise software, management, and technical strategy to any project. His specialties include Amazon Web Services, real-time systems, business intelligence, big data, enterprise web apps, scalability, education, and open-source software.
United StatesFreelance Big Data Developer at Toptal Since October 8, 2013
Brandon has 13+ years identifying business objectives and defining technical strategies and processes to achieve them. He demonstrates an extraordinary aptitude for leveraging technology to efficiently and concisely solve complex problems. His expertise includes thought leadership, technical strategy, enterprise architecture, cloud computing, and big data.
Big Data is an extremely broad domain, typically addressed by a hybrid team of data scientists, software engineers, and statisticians. Real expertise in big data therefore requires far more than learning the ins and outs of a particular technology. This guide offers a sampling of effective questions to help evaluate the breadth and depth of a candidate's mastery of this complex domain.
... allows corporations to quickly assemble teams that have the right skills for specific projects.
Despite accelerating demand for coders, Toptal prides itself on almost Ivy League-level vetting.
Building a cross-platform app to be used worldwide
Tripcents wouldn't exist without Toptal. Toptal Projects enabled us to rapidly develop our foundation with a product manager, lead developer, and senior designer. In just over 60 days we went from concept to Alpha. The speed, knowledge, expertise, and flexibility is second to none. The Toptal team were as part of tripcents as any in-house team member of tripcents. They contributed and took ownership of the development just like everyone else. We will continue to use Toptal. As a start up, they are our secret weapon.
Brantley Pace, CEO & Co-Founder
I am more than pleased with our experience with Toptal. The professional I got to work with was on the phone with me within a couple of hours. I knew after discussing my project with him that he was the candidate I wanted. I hired him immediately and he wasted no time in getting to my project, even going the extra mile by adding some great design elements that enhanced our overall look.
Paul Fenley, Director
K Dunn & Associates
The developers I was paired with were incredible -- smart, driven, and responsive. It used to be hard to find quality engineers and consultants. Now it isn't.
Ryan Rockefeller, CEO
Toptal understood our project needs immediately. We were matched with an exceptional freelancer from Argentina who, from Day 1, immersed himself in our industry, blended seamlessly with our team, understood our vision, and produced top-notch results. Toptal makes connecting with superior developers and programmers very easy.
Jason Kulik, Co-Founder
As a small company with limited resources we can't afford to make expensive mistakes. Toptal provided us with an experienced programmer who was able to hit the ground running and begin contributing immediately. It has been a great experience and one we'd repeat again in a heartbeat.
Stuart Pocknee , Principal
Site Specific Software Solutions
We used Toptal to hire a developer with extensive Amazon Web Services experience. We interviewed four candidates, one of which turned out to be a great fit for our requirements. The process was quick and effective.
Abner Guzmán Rivera, CTO and Chief Scientist
Sergio was an awesome developer to work with. Top notch, responsive, and got the work done efficiently.
Dennis Baldwin, Chief Technologist and Co-Founder
Working with Marcin is a joy. He is competent, professional, flexible, and extremely quick to understand what is required and how to implement it.
André Fischer, CTO
We needed a expert engineer who could start on our project immediately. Simanas exceeded our expectations with his work. Not having to interview and chase down an expert developer was an excellent time-saver and made everyone feel more comfortable with our choice to switch platforms to utilize a more robust language. Toptal made the process easy and convenient. Toptal is now the first place we look for expert-level help.
Derek Minor, Senior VP of Web Development
Networld Media Group
Toptal's developers and architects have been both very professional and easy to work with. The solution they produced was fairly priced and top quality, reducing our time to launch. Thanks again, Toptal.
Jeremy Wessels, CEO
We had a great experience with Toptal. They paired us with the perfect developer for our application and made the process very easy. It was also easy to extend beyond the initial time frame, and we were able to keep the same contractor throughout our project. We definitely recommend Toptal for finding high quality talent quickly and seamlessly.
Ryan Morrissey, CTO
Applied Business Technologies, LLC
I'm incredibly impressed with Toptal. Our developer communicates with me every day, and is a very powerful coder. He's a true professional and his work is just excellent. 5 stars for Toptal.
Pietro Casoar, CEO
Ronin Play Pty Ltd
Working with Toptal has been a great experience. Prior to using them, I had spent quite some time interviewing other freelancers and wasn't finding what I needed. After engaging with Toptal, they matched me up with the perfect developer in a matter of days. The developer I'm working with not only delivers quality code, but he also makes suggestions on things that I hadn't thought of. It's clear to me that Amaury knows what he is doing. Highly recommended!
George Cheng, CEO
As a Toptal qualified front-end developer, I also run my own consulting practice. When clients come to me for help filling key roles on their team, Toptal is the only place I feel comfortable recommending. Toptal's entire candidate pool is the best of the best. Toptal is the best value for money I've found in nearly half a decade of professional online work.
Ethan Brooks, CTO
Langlotz Patent & Trademark Works, Inc.
In Higgle's early days, we needed the best-in-class developers, at affordable rates, in a timely fashion. Toptal delivered!
Lara Aldag, CEO
Toptal makes finding a candidate extremely easy and gives you peace-of-mind that they have the skills to deliver. I would definitely recommend their services to anyone looking for highly-skilled developers.
Michael Gluckman, Data Manager
Toptal’s ability to rapidly match our project with the best developers was just superb. The developers have become part of our team, and I’m amazed at the level of professional commitment each of them has demonstrated. For those looking to work remotely with the best engineers, look no further than Toptal.
Laurent Alis, Founder
Toptal makes finding qualified engineers a breeze. We needed an experienced ASP.NET MVC architect to guide the development of our start-up app, and Toptal had three great candidates for us in less than a week. After making our selection, the engineer was online immediately and hit the ground running. It was so much faster and easier than having to discover and vet candidates ourselves.
Jeff Kelly, Co-Founder
We needed some short-term work in Scala, and Toptal found us a great developer within 24 hours. This simply would not have been possible via any other platform.
Franco Arda, Co-Founder
Toptal offers a no-compromise solution to businesses undergoing rapid development and scale. Every engineer we've contracted through Toptal has quickly integrated into our team and held their work to the highest standard of quality while maintaining blazing development speed.
Greg Kimball, Co-Founder
How to Hire Big Data Architects through Toptal
Talk to One of Our Industry Experts
A Toptal director of engineering will work with you to understand your goals, technical needs, and team dynamics.
Work With Hand-Selected Talent
Within days, we'll introduce you to the right big data architect for your project. Average time to match is under 24 hours.
The Right Fit, Guaranteed
Work with your new big data architect for a trial period (pay only if satisfied), ensuring they're the right fit before starting the engagement.
Find Experts With Related Skills
Access a vast pool of skilled developers in our talent network and hire the top 3% within just 48 hours.
At Toptal, we thoroughly screen our big data architects to ensure we only match you with talent of the highest caliber. Of the more than 200,000 people who apply to join the Toptal network each year, fewer than 3% make the cut. You'll work with engineering experts (never generalized recruiters or HR reps) to understand your goals, technical needs, and team dynamics. The end result: expert vetted talent from our network, custom matched to fit your business needs. Start now.
Can I hire big data architects in less than 48 hours through Toptal?
Depending on availability and how fast you can progress, you could start working with a big data architect within 48 hours of signing up. Start now.
What is the no-risk trial period for Toptal big data architects?
We make sure that each engagement between you and your big data architect begins with a trial period of up to two weeks. This means that you have time to confirm the engagement will be successful. If you're completely satisfied with the results, we'll bill you for the time and continue the engagement for as long as you'd like. If you're not completely satisfied, you won't be billed. From there, we can either part ways, or we can provide you with another expert who may be a better fit and with whom we will begin a second, no-risk trial. Start now.
How to Hire an Excellent Big Data Architect
Big Data is an extremely broad domain, typically addressed by a hybrid team of data scientists, data analysts, software engineers, and statisticians. Finding a single individual knowledgeable in the entire breadth of this domain versus say a Microsoft Azure expert is therefore extremely unlikely and rare. Rather, one will most likely be searching for multiple individuals with specific sub-areas of expertise. This guide is therefore divided at a high level into two sections:
This guide highlights questions related to key concepts, paradigms, and technologies in which a big data expert can be expected to have proficiency. Bear in mind, though, that not every “A” candidate will be able to answer them all, nor does answering them all guarantee an “A” candidate. Ultimately, effective interviewing and hiring is as much of an art as it is a science.
Big Data Algorithms, Techniques, and Approaches
When it comes to big data and data management, fundamental knowledge of and work experience with relevant algorithms, techniques, big data platforms, and approaches is essential. Generally speaking, mastering these areas requires more time and skill than becoming an expert with a specific set of software languages or tools. As such, software engineers who do have expertise in these areas are both hard to find and extremely valuable to your team. The questions that follow can be helpful in gauging such expertise.
Q: Given a stream of data of unknown length, and a requirement to create a sample of a fixed size, how might you perform a simple random sample across the entire dataset? (i.e., given N elements in a data stream, how can you produce a sample of k elements, where N > k, whereby every element has a 1/N chance of being included in the sample?
Clusters are represented by a central vector, which is not necessarily a member of the set.
When the number of clusters is fixed to K, K-means clustering gives a formal definition as an optimization problem: find the K cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
With large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small)
K-Means may produce tighter clusters than hierarchical clustering, especially if clusters are globular
Requires number of clusters (K) to be specified in advance
Prefers clusters of approximately similar size, which often leads to incorrectly set borders between clusters
Clusters are defined as areas of higher density than the remainder of the dataset. Objects in sparse areas are usually considered to be noise and/or border points.
Connects points based on distance thresholds (similar to linkage-based clustering), but only connects those that satisfy a specified density criterion.
Most popular density-based clustering method is DBSCAN, which features a well-defined cluster model called "density-reachability".
Doesn't require specifying number of clusters a priori
Can find arbitrarily-shaped clusters; can even find a cluster completely surrounded by (but not connected to) a different cluster
Mostly insensitive to the ordering of the points in the database
Expects a density "drop" or "cliff" to detect cluster borders
DBSCAN is unable to detect intrinsic cluster structures that are prevalent in much of real-world data
On datasets consisting of mixtures of Gaussians, almost always outperformed by methods such as EM clustering that are able to precisely model such data
Q: Define and discuss ACID, BASE, and the CAP theorem.
ACID refers to the following set of properties that collectively guarantee reliable processing of database transactions, with the goal being “immediate consistency”:
Atomicity. Requires each transaction to be “all or nothing”; i.e., if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged.
Consistency. Requires every transaction to bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including (but not limited to) constraints, cascades, triggers, and any combination thereof.
Isolation. Requires concurrent execution of transactions to yield a system state identical to that which would be obtained if those same transactions were executed sequentially.
Durability. Requires that, once a transaction has been committed, it will remain so even in the event of power loss, crashes, or errors.
Many databases rely upon locking to provide ACID capabilities. Locking means that the transaction marks the data that it accesses so that the DBMS knows not to allow other transactions to modify it until the first transaction succeeds or fails. An alternative to locking is multiversion concurrency control in which the database provides each reading transaction the prior, unmodified version of data that is being modified by another active transaction.
Guaranteeing ACID properties in a distributed transaction across a distributed database where no single node is responsible for all data affecting a transaction presents additional complications. Network connections might fail, or one node might successfully complete its part of the transaction and then be required to roll back its changes, because of a failure on another node. The two-phase commit protocol (not to be confused with two-phase locking) provides atomicity for distributed transactions to ensure that each participant in the transaction agrees on whether the transaction should be committed or not.
BASE was developed as an alternative for producing more scalable and affordable data architectures. Allowing less constantly updated data gives developers the freedom to build other efficiencies into the overall system. In BASE, engineers embrace the idea that data has the flexibility to be “eventually” updated, resolved or made consistent, rather than instantly resolved.
The Eventual Consistency model employed by BASE informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is sometimes criticized as increasing the complexity of distributed software applications.
In nearly all models, elements like consistency and availability often are viewed as resource competitors, where adjusting one can impact another. Accordingly, the CAP theorem (a.k.a. Brewer’s theorem) states that it’s impossible for a distributed computer system to provide more than two of the following three guarantees concurrently:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
In the decade since its introduction, designers and researchers have used the CAP theorem as a reason to explore a wide variety of novel distributed systems. The CAP Theorem has therefore certainly proven useful, fostering much discussion, debate, and creative approaches to addressing tradeoffs, some of which have even yielded new systems and technologies.
Yet at the same time, the “2 out of 3” constraint does somewhat oversimplify the tensions between the three properties. By explicitly handling partitions, for example, designers can optimize consistency and availability, thereby achieving some trade-off of all three. Although designers do need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them.
Aspects of the CAP theorem are often misunderstood, particularly the scope of availability and consistency, which can lead to undesirable results. If users cannot reach the service at all, there is no choice between C and A except when part of the service runs on the client. This exception, commonly known as disconnected operation or offline mode, is becoming increasingly important. Some HTML5 feature make disconnected operation easier going forward. These systems normally choose A over C and thus must recover from long partitions.
Q: What is dimensionality reduction and how is it relevant to processing big data? Name some techniques commonly employed for dimensionality reduction.
Dimensionality reduction is the process of converting data of very high dimensionality into data of lower dimensionality, typically for purposes such as visualization (i.e, projection onto a 2D or 3D space for visualization purposes), compression (for efficient storage and retrieval), or noise removal.
Some of the more common techniques for dimensionality reduction include:
Note: Each of the techniques listed above is itself a complex topic, so each is provided as a hyperlink to further information for those interested in learning more.
Q: Discuss some common statistical sampling techniques, including their strengths and weaknesses.
When analyzing big data, processing the entire dataset would often be operationally untenable. Exploring a representative sample is easier, more efficient, and can in many cases be nearly as accurate as exploring the entire dataset. The table below describes some of the statistical sampling techniques that are more commonly used with big data.
COMMON SAMPLING TECHNIQUES
Simple random sampling
Every element has the same chance of selection (as does any pair of elements, triple of elements, etc.)
Simplifies analysis of results
Vulnerable to sampling error
Orders data and selects elements at regular intervals through the ordered dataset
Easy to implement
Can be efficient (depending on ordering scheme)
Vulnerable to periodicities in the ordered data
Theoretical properties make it difficult to quantify accuracy
Divides data into separate strata (i.e., categories) and then samples each stratum separately
Able to draw inferences about specific subgroups
Focuses on important subgroups; ignores irrelevant ones
Improves accuracy/efficiency of estimation
Different sampling techniques can be applied to different subgroups
Can increase complexity of sample selection
Selection of stratification variables can be difficult
Not useful when there are no homogeneous subgroups
Can sometimes require a larger sample than other methods
Q: Explain the term “cache oblivious”. Discuss some of its advantages and disadvantages in the context of processing big data.
A cache oblivious (a.k.a., cache transcendent) algorithm is designed to take advantage of a CPU cache without knowing its size. Its goal is to perform well – without modification or tuning – on machines with different cache sizes, or for a memory hierarchy whose levels are of different cache sizes.
Typically, a cache oblivious algorithm employs a recursive divide and conquer approach, whereby the problem is divided into smaller and smaller sub-problems, until a sub-problem size is reached that fits into the available cache. For example, an optimal cache oblivious matrix multiplication is obtained by recursively dividing each matrix into four sub-matrices to be multiplied.
In tuning for a specific machine, one may use a hybrid algorithm which uses blocking tuned for the specific cache sizes at the bottom level, but otherwise uses the cache-oblivious algorithm.
The ability to perform well, independent of cache size and without cache-size-specific tuning, is the primary advantage of the cache oblivious approach. However, it is important to acknowledge that this lack of any cache-size-specific tuning also means that a cache oblivious algorithm may not perform as well as a cache-aware algorithm (i.e., an algorithm tuned to a specific cache size). Another disadvantage of the cache oblivious approach is that it typically increases the memory footprint of the data structure, which may further degrade performance. Interestingly though, in practice, performance of cache oblivious algorithms are often surprisingly comparable to that of cache aware algorithms, making them that much more interesting and relevant for big data processing.
Big Data Technologies
These days, it’s not only about finding a single tool to get the job done; rather, it’s about building a scalable architecture to effectively collect, process, and query enormous volumes of data. Armed with a strong foundational knowledge of big data algorithms, techniques, and approaches, a big data expert will be able to employ tools from a growing landscape of technologies that can be used to exploit big data to extract actionable information. The questions that follow can help evaluate this dimension of a candidate’s expertise.
Q: What is Hadoop? What are its key features? What modules does it consist of?
Apache Hadoop is a software framework that allows for the distributed processing of large datasets across clusters of computers. It is designed to effectively scale from single servers to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer.
The project includes these modules:
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A system for parallel processing of large datasets.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop Common: The common utilities that support the other Hadoop modules.
Q: Provide an overview of HDFS, including a description of an HDFS cluster and its components.
HDFS is a highly fault-tolerant distributed filesystem, commonly used as a source and output of data in Hadoop jobs. It builds on a simple coherence model of write-once-read-many (with append possible) access. It supports a traditional hierarchical organization of directories and files.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
An HDFS cluster also contains what is referred to as a Secondary NameNode that periodically downloads the current NameNode image, edits log files, joins them into a new image. and uploads the new image back to the NameNode. (Despite its name, though, the Secondary NameNode does not serve as a backup to the primary NameNode in case of failure.)
It’s important to note that HDFS is meant to handle large files, with the default block size being 128 MB. This means that, for example, a 1 GB file will be split into just 8 blocks. On the other hand, a file that’s just 1 KB will still take a full 128 MB block. Selecting a block size that will optimize performance for a cluster can be a challenge and there is no “one-size-fits-all” answer in this regard. Setting block size to too small value might increase network traffic and put huge overhead on the NameNode, which processes each request and locates each block. Setting it to too large a value, on the other hand, might result in a great deal of wasted space. Since the focus of HDFS is on large files, one strategy could be to combine small files into larger ones, if possible.
Q: Provide an overview of MapReduce, including a discussion of its key components, features, and benefits.
MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary (i.e., data reduction) operation. A MapReduce system orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
MapReduce processes parallelizable problems across huge datasets using a large number of nodes in a cluster or grid as follows:
Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
Reduce step: The master node collects the answers from its worker nodes and combines them to form its output (i.e., the answer to the problem it was tasked with solving).
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel (though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source). Similarly, a set of ‘reducers’ can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than high performance servers can usually handle (e.g., a large server farm of “commodity” machines can use MapReduce to sort a petabyte of data in only a few hours).
The parallelism in MapReduce also helps facilitate recovery from a partial failure of servers or storage during the operation (i.e., if one mapper or reducer fails, the work can be rescheduled, assuming the input data is still available). This is particularly important, since the failure of single nodes are fairly common in multi-machine clusters. When there are thousands of disks spinning for a couple of hours, it’s not unlikely that one or two of them will fail at some point.
The key contributions of the MapReduce framework are not the map and reduce functions per se, but the scalability and fault-tolerance achieved for a variety of applications. It should be noted that a single-threaded implementation of MapReduce will usually not be faster than a traditional implementation. Its benefits are typically only realized when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerant features of the framework come into play.
As a side note, the name MapReduce originally referred to a proprietary Google technology but has since been genericized.
Q: Describe the MapReduce process in detail.
While not all big data software engineers will be familiar with the the internals of the MapReduce process, those who are will be that much better equipped to leverage its capabilities and to thereby architect and design an optimal solution.
MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. We therefore provide a Hadoop-centric description of the MapReduce process, described here both for Apache Hadoop MapReduce 1 (MR1) as well as for Apache Hadoop MapReduce 2 (MR2, a.k.a. YARN). For simplicity, we use an example that takes input from an HDFS and stores it back to an HDFS. Such a process would operate as follows:
The local Job Client prepares the job for the submission and sends it to Resource Manager (MR2) or Job Tracker (MR1).
Resource Manager (MR2) / Job Tracker (MR1) schedules the job across the cluster. In the case of MR2, the Application Master is started, which will take care of managing the job and terminating it after of completion. The job is sent to Node Managers (MR2) or Task Trackers (MR1). An InputSplitter is used to distribute the map tasks across the cluster in the most efficient way.
Each of the Note Managers (MR2) / Task Trackers (MR1) spawns a map task, sending progress updates to an Application Master (MR2) / Job Tracker (MR1). The output is partitioned, according to the reducers it needs to go to. An optional Combiner might be used as well to join values for the same local key.
If a map fails, it is retried (up to a defined maximum number of retries).
After a map succeeds, the local partitions with the output can be copied to respective reduce nodes (depending on the partition).
After a reduce task receives all relevant files, it starts a sort phase, in which it merges the map outputs, maintaining the natural order of keys.
For each of the keys in the sorted output, a reduce function is invoked and the output is stored to the desired sink (in our case, HDFS). Since the node where it’s being executed is usually also a DataNode, the first replica is written to the local disk.
If a reducer fails, it is run again.
After a successful chain of operations, temporary files are removed and the Application Master closes gracefully.
Q: What are major types of NoSQL databases commonly used for big data?
In recent years, the growing prevalence and relevance of big data has dramatically increased the need to store and process huge quantities of data. This has caused a shift away from more traditional relational database systems to other approaches and technologies, commonly known as NoSQL databases. NoSQL databases are typically simpler than traditional SQL databases and typically lack ACID transactions capabilities, thereby making them more efficient and scalable.
The taxonomy of such databases might be created using a couple of different spaces. Although the table below groups them by data model, in the context of big data, consistency model would be another pivotal feature to consider when evaluating options for these types of datastores.
Inspired by Google's BigTable publication, stores data in records that are able to store anvery large number of dynamic columns.
In general, describe data using somewhat loose schema - defining tables and the main columns (e.g. ColumnFamilies in case of HBase) and being flexible regarding the data stored inside them, which are key-values.
Very fast writes
Easy to scale
Organizes data into tables and columns using a flexible schema
Easy to store history of operations with timestamps
Some versions (e.g., HBase) require the whole (usually complex) Hadoop environment as a prerequisite
Often requires effort to tune and optimize performance parameters
Designed around the concept of a "document"; i.e., a single object encapsulating data in some standard format (such as JSON, XML, etc.).
Instead of tables with rows, operates on collections of documents.
Simplicity gives significant gains in scalability and performance
Many real-word objects are in fact documents
Multiple implementations support indexing out-of-the-box
Tracking any kind of relation is difficult
No implied schema puts more effort on the client
Q: What are Pig, Hive, and Impala? What are the differences between them?
Pig, Hive, and Impala are examples of big data frameworks for querying and processing data in the Apache Hadoop ecosystem. While a number of other systems have recently been introduced (notable mentions include Facebook’s Presto or Spark SQL), these are still considered “the big 3” when dealing with big data challenges. The table below provides a basic description and comparison of the three.
Comparitive Overview: Hive, Pig, and Impala
Introduced in 2006 by Yahoo Research. Ad-hoc way of creating and executing map-reduce jobs on a very large datasets
Introduced in 2007 by Facebook. Peta-byte scale data warehousing framework.
Introduced in 2012 by Cloudera. Massively parallel processing SQL query engine on HDFS.
PigLatin, procedural, data flow oriented
HiveQL, SQL-like syntax, declarative
Runs MapReduce jobs
Runs MapReduce jobs
Custom execution engine; does most of its operation in-memory
Batch processing or ETL
Batch processing, ad-hoc queries
Relatively easy to extend with user-defined functions (UDF)
Relatively harder to extend with UDF
Supports native (C++) and Hive UDFs
High fault-tolerance of queries
High fault-tolerance of queries
If a node fails, the query is aborted
Q: What are Apache Tez and Apache Spark? What kind of problems do they solve (and how)?
One issue with MapReduce is that it’s not suited for interactive processing. Moreover, stacking one MapReduce job over another (which is a common real-world use case when running, for example, Pig or Hive queries) is typically quite ineffective. After each stage, the intermediate result is stored to HDFS, only to be picked up as an input by another job, and so on. It also requires a lot of shuffling of data during the reduce phases. While sometimes it is indeed necessary, in many cases the actual flow could be optimized, as some of the jobs could be joined together or could use some cache that would reduce the need of joins and in effect increase the speed significantly.
Addressing these issues was one of important reasons behind creating Apache Tez and Apache Spark. They both generalize the MapReduce paradigm and execute the overall job by first defining the flow using a Direct Acyclic Graph (DAG). At a high level of abstraction, each vertex is a user operation (i.e., performing some form of processing on the input data), while each edge defines the relation with other steps of the overall job (such as grouping, filtering, counting, and so on). Based on the specified DAG, the scheduler can decide which steps can be executed together (and when) and which require pushing the data over the cluster. Additionally, both Tez and Spark offer forms of caching, minimizing the need to push huge datasets between the nodes.
There are some significant differences between the two technologies though. Apache Tez was created more as an additional element of Hadoop ecosystem, taking advantage of YARN and allowing to easily include it in existing MapReduce solutions. It can even be easily run with regular MapReduce jobs. Apache Spark, on the other hand, was built more as a new approach to processing big data. It adds some new concepts, such as RDDs (Resilient Distributed Datasets), provides a well-thought-out idiomatic API for defining the steps of the job (which has many more operations than just map or reduce, such as joins or co-groups), and has multiple cache-related features. The code is lazily evaluated and the Direct Acyclic Graph is created and optimized automatically (in contrast, in the case of Tez, the graph must be defined manually). Taking all of this into account, the code in Spark tends to be very concise. Spark is also a significant part of the Berkeley Data Analytics Stack (BDAS).
Q: What is a column-oriented database? When and why would you use one?
Column-oriented databases arrange data storage on disk by column, rather than by row, which allows more efficient disk seeks for particular operations. This has a number of benefits when working with large datasets, including faster aggregation related queries, efficient compression of data, and optimized updating of values in a specific column across all (or many) rows.
Whether or not a column-oriented database will be more efficient in operation varies in practice. It would appear that operations that retrieve data for objects would be slower, requiring numerous disk operations to collect data from multiple columns to build up the record. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first and last names from many rows in order to build a list of contacts are far more common than operations reading all data for a single entity in the rolodex. This is even more true for writing data into the database, especially if the data tends to be “sparse” with many optional columns.
It is also the case that data organized by columns, being coherent, tends to compress very well.
For these reasons, column stores have demonstrated excellent real-world performance in spite of any theoretical disadvantages.
Q: What is a Lambda Architecture and how might it be used to perform analytics on streams with real-time updates?
One of the common use cases for big data is real-time processing of huge volumes of incoming data streams. Depending on the actual problem and the nature of the data, multiple solutions might be proposed. In general, a Lambda Architecture approach is often an effective way to address this type of challenge.
The core concept is that the result is always a function of input data (lambda). The final solution should work as such a lambda function, irrelevant of the amount of data it has to process. To make this possible, the data is split into two parts; namely, raw data (which never changes and might be only appended) and pre-computed data. The pre-computed data is further subdivided into two other groups; namely, the old data and the recent data (where “old” and “recent” are relative terms, the specifics of which depend on the operational context).
Having this distinction, we can now build a system based on the lambda architecture consisting of the following three major building blocks:
Batch layer. Manages the master dataset (an immutable, append-only set of raw data) and pre-computes arbitrary query functions (called batch views).
Serving layer. Indexes the batch views to support ad-hoc queries with low latency.
Speed layer. Accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.
There are many tools that can be applied to each of these architectural layers. For instance, the raw results could simply be stored in HDFS and the batch views might be pre-computed with help of Pig jobs. Similarly, to make it possible to query batch views effectively, they might be indexed using technologies such as Apache Drill, Impala, ElasticSearch or many others.
The speed layer, though, often requires a non-trivial amount of effort. Fortunately, there are many new technologies that can help with that as well. One of the most commonly chosen ones is Apache Storm which uses an abstraction of spouts and bolts to define data sources and manipulations of them, allowing distributed processing of streaming data.
The final element is to combine the results of the Speed layer and the Serving layer when assembling the query response.
This architecture combines the batch processing power of Hadoop with real-time availability of results. If the underlying algorithm ever changes, it’s very easy to just recalculate all batch views as a batch process and then update the speed layer to take the new version into account.
Without a doubt, big data represents one of the most formidable and pervasive challenges facing the software industry today. From social networking, to marketing, to security and law enforcement, the need for large scale big data solutions that can effectively handle and process big data is becoming increasingly important and is rapidly on the rise.
Real expertise and proficiency in this domain requires far more than learning the ins and outs of a particular technology. It requires an understanding, at the first principles level, of the manifold technical challenges and complexities involved as well as the most effective methodologies and problem-solving approaches to address these hurdles.
We hope you find the questions presented in this article to be a useful foundation for “separating the wheat from the chaff” in your quest for the elite few among Big Data engineers, whether you need them full-time or part-time.