From Vendor to In-House: How eBay reimagined its analytics landscape — Part Two

The first part of this series provided the background on why eBay decided to move from vendor to in-house for analytics and what the shift meant for the company. The second part will explore the new ecosystem and how the migration was executed.

10 min readApr 19, 2021

By Ishita Majumdar, Medha Samant, Naveen Dhanpal

Building the Hadoop-Spark Ecosystem

The new ecosystem needed to provide the same capabilities as the existing one. End users’ tolerance for any degradation in experience was very low, and the transition had to be seamless. A detailed analysis yielded the following categories as crucial to providing a seamless transition to the new Hadoop-Spark system:

Executing interactive queries — Match industry standards for SQL execution speed at scale. In other words, irrespective of the number of users accessing the system at any given time, every user would expect their queries to be executed in a matter of seconds. To offer this kind of performance on Hadoop meant building a dedicated SQL-on-Hadoop engine.
Feature parity — The eBay analytics community had a massive inventory of customized SQL scripts, reports, applications, and complex processes leveraging several features and functions of the vendor system that weren’t readily provided by Hadoop and Spark. This inventory needed to be migrated and fully supported by the new ecosystem.
Connectivity patterns — The Hadoop-Spark environment was expected to support established connectivity patterns that had evolved over the years while adhering to new, more stringent security standards.
Tools integration — The new solution needed to be able to connect to a software where users could write and execute code like SQL and Python as well as connect to vendor Business Intelligence and Data Science applications.

In summary, two components were crucial to building the new ecosystem: a dedicated SQL-on-Hadoop engine and the end-user facing SQL authoring application.

eBay’s SQL-on-Hadoop Engine

eBay’s SQL-on-Hadoop engine is based on open source and built for high availability, security, and reliability. During its inception, the primary goal was to replace the commercial data warehouse that specializes in speed, stability, and scalability. Consequently, the team developing the engine needed to address two key questions right away:

Is it fast and stable enough to handle the large volume of workloads running on the vendor warehouse?
Is there parity with the functionality provided by vendor warehouse?

eBay’s SQL-on-Hadoop engine offered high scalability and flexibility, but the question of performance still lingered. Its core component is a customized SparkSQL engine built on Apache Spark 2.3.1 with rich security features fully compliant with eBay’s standards. To address the performance gap, it needed significant optimizations at the software and hardware levels. Let’s take a closer look at some of these optimization strategies:

Custom Spark Drivers: By introducing a custom Spark Driver that functions as a long running service, the engine was able to support high volume of concurrent Spark sessions thus increasing the elasticity of the system and providing isolated session management capabilities. When compared to the traditional Spark launch mode, the connectivity and initialization speed reduced from 20 seconds to 1 second. Furthermore, by leveraging Yarn Dynamic Allocation, it was able to allocate executor resources based on needs thus enhancing the efficiency of overall cluster compute utilization.
Transparent Data Cache Layer: Scanning the vast Hadoop Distributed File System (HDFS) cluster would introduce instability and degrade its performance. To tackle this, a transparent data cache layer with well-defined cache life cycle management was introduced in eBay’s SQL-on-Hadoop engine. The data cache layer enabled automatic caching of the most-accessed datasets in the local SQL-on-Hadoop cluster; and the cached data would automatically expire based on the Spark runtime as soon as it discovers that the upstream data has been refreshed prompting the cache to be rebuilt. Spark runtime will reactivate the local data scan once the cache rebuild is complete. This data cache layer has increased the scan speed by 4x while significantly improving the stability of the system.
Re-bucketing: Most of eBay’s data tables have a bucket layout and are more suitable for “sort merge joins” since it eliminates the need for additional shuffle and sort operations. But what happens if tables have different bucket sizes or the join key is different from the bucket key? eBay’s SQL-on-Hadoop engine can handle this scenario with the “MergeSort” or “Re-bucketing” optimization feature. Consider an example where the user wants to join Table A with Table B. If the bucket size of table A is 100 and the bucket size of table B is 500, both tables would need to be shuffled before they can be joined. The “MergeSort” feature in the engine will identify that the ratio of bucket sizes of tables A and B is 1:5, and merge every five buckets of table B into one bucket thus bringing its overall bucket size to 100 matching it with the size of table A. This eliminates the need for shuffling all data and executes the join operation much faster. Similarly, the re-bucketing feature will take the table with smaller bucket size (table A) and further divide each bucket into five buckets thus increasing its bucket size to 500 and matching it with that of table B before executing the join operation.
Bloom Filter Index: This feature allows data pruning on columns not involved in buckets or partitions for faster scanning. Bloom Filter Indexes are independent from the data files so they can be applied and removed as needed.
Original Design Manufacturing (ODM) Hardware: The full effect of software optimizations can be realized only if the hardware has the capacity to support it. eBay designs its own ODMs and was able to leverage custom-designed SKU with high-performance CPU and memory specs tailored for SQL-on-Hadoop Spark engine providing maximum computing capability.
Update and Delete Operations: Traditional commercial databases with ACID (Atomicity, Consistency, Isolation, Durability) properties provide CRUD (Create, Read, Update, Delete) operations. The current Hadoop open-source framework lacks ACID properties supporting only Create and Read operations. Not providing Update and Delete operations would have required thousands of analysts and engineers to learn and adopt heavy Hadoop ETL technology to perform their day-to-day functions. This was a deal-breaker for eBay. Using Delta Lake, Apache Spark was enhanced to fully support the Update and Delete operations including use-cases with complex joins.

These optimization strategies helped achieve the industry standards for SQL execution speed at scale. They enabled eBay data analysts and engineers to migrate to the new Hadoop-Spark environment without any performance degradation.

eBay’s SQL Authoring Tool

SQL authoring tools are the interface between end users (data analysts and data engineers) and data warehouses. eBay built a solution to serve end users with SQL authoring capabilities and much more — metadata management, advanced analytics and toolkits for efficient data operations. The first version of the tool was designed to provide SQL development capability, and it leveraged Apache Livy for connectivity to the underlying Hadoop data platform and for a two-way transfer of data. This version also provided a centralized toolkit to support the development lifecycle for engineers.

Hadoop data processing capabilities were maturing with great speed, and the tool had to adapt at the same rate. A broader study of data analytics and its various stages yielded Data exploration, interactive query analytics, and data visualization as necessary capabilities for a powerful SQL authoring solution.

With these capabilities serving as the north star, the subsequent versions of eBay’s SQL authoring tool featured the following key components:

Data Exploration — Finding the right data is a precursor to any type of research or analysis, especially when dealing with hundreds of petabytes of data and thousands of tables. The solution was to integrate existing eBay metadata repositories with the Hadoop metastore.

Advanced Analytics — In addition to providing data querying capabilities, the tool was enhanced to provide quicker insights through interactive plotting and visualization of the queried results. Furthermore, it was enabled with multiple interpreters like PySpark and SparkR to support more use-cases for advanced users. It also leveraged Hadoop’s native features to visualize the query execution flows for easier SQL optimization.

Automation — eBay’s SQL authoring tool provided analysts and engineers with the ability to automate data models. Traditionally, users would require dedicated virtual machines to run data modeling scripts, which create a maintenance overhead. This was eliminated by the tool through an end-to-end functionality to allow users to schedule their data modeling scripts directly. This feature brought significant relief to end-users as they could now manage their interactive queries and data models in one place.

Making the Move

Batch Workloads

As the Hadoop-Spark platform was being built out, the migration effort was underway. In the first quarter of 2020 all production jobs from one of two vendor systems were being migrated to the new Hadoop-Spark system. With over 30,000 production tables, the first task at hand was to determine the critical tables and establish a clear scope for migration. This provided the opportunity to clean up several legacy and unused tables, as we arrived at a final set of 982 production tables to be migrated to Hadoop. All other tables were either retired or marked as “End of Life” to die when the system is shut down after the migration is complete.

Personal Databases and Interactive queries

The vendor solution provided a custom feature that allowed users to create personal databases which they could use as sandbox environments for any testing or temporary use-cases. However, a significant portion of analysts and end-users leveraged this feature for their day-to-day analyses, reports and dashboards, so it became a critical component of what needed to be migrated to Hadoop-Spark without any loss in data or functionality. Migrating these databases posed a couple of challenges:

From a platform perspective, there were 1000s of such databases and each of them served several unique use-cases, which the platform team did not have visibility into, and had to rely on individual users to determine the criticality of their databases and decide if they needed to be migrated to the new environment.
Given the nature of these databases several of them were created for one time use, and the users who owned/created these databases may have left the firm making it a communication and outreach challenge.

To address these challenges, the platform team first built a self-service tool that would provide the ability for users to migrate their personal database tables from the vendor system to the new Hadoop-Spark system. In order to address the problem of completion, the platform and migration teams analyzed the full list of databases on the system to eliminate all tables unused for at least 365 days, and then took a deeper look at each database usage, to arrive at a smaller set of databases to track for migration.

Training and Support

This effort involved moving users from a wide range of roles and responsibilities with diverse skills. Most users had been accustomed to the vendor-provided ecosystem and found it challenging to reimagine their day to day tasks in a new environment. It was prudent to address any gaps in skills and develop familiarity for users with the new system before encouraging them to make the move. In addition to all the engineering and design efforts that went into ensuring end users have a smooth transition, a solid foundation of training and support was necessary.

As a result, the migration team established a dedicated track to develop training material for various levels of user experience and technical complexity not only through wiki documents and training videos but through in-person classes and training drives with full-fledged course offerings tailored for users across the globe.

Several other learning and development avenues were established through custom office hours (for each topic of concern for users at large), dedicated slack channels, 24x7 level 1 and level 2 support with clearly defined SLAs for ticket acknowledgement and resolution. By the end of the project close to two thousand migration related jira tickets were resolved.

In some special cases where teams needed dedicated support with migration, temporary working groups a.k.a “tiger teams” including engineers from all levels of the stack worked closely with the end-users to navigate the deep end of their processes and dependencies on the vendor platform and rebuild them in Hadoop-Spark to offer similar if not better performance and experience.

Driving the change

eBay’s analytics community is distributed across the globe. Executing a migration of this scale required tight collaboration and partnership despite logistical challenges with varying time-zones and a global lockdown due to the coronavirus pandemic. However, the willingness of eBay employees to collaborate and embrace the associated challenges enabled a seamless execution of the transition.

Velocity and agility

Another key aspect of making the move was the speed at which eBay engineering teams could roll out features and upgrades as users were making the transition. eBay is a highly agile organization, and this migration was particularly achievable within the planned timelines because of its velocity in product maturity. For example, as users were making the transition and exploring the new system, they discovered several features that were essential to their activities. The SQL-on-Hadoop engine team was able to gather these requirements periodically, design and develop changes, and roll out to production in 1–2 sprints. This not only gave the end-users the confidence that their requirements were being addressed but also helped mature the product rapidly.

What does this mean for eBay?

By eliminating vendor dependency, this migration effort puts eBay in full control of its innovation, getting its users ready for the future of analytics with Hadoop and Spark. It not only results in significant cost savings but helps drive eBay’s renewed strategy of tech-led re-imagination. Most of all, it exemplifies the strength of collaboration and the technical expertise at eBay required for any significant undertaking of this size and scale in the future.