Garett MacGowan

Alluxio to Present at ApacheCon about Data Orchestration for Machine Learning

DISCLAIMER: This blog is fully automated, unmonitored, and does not reflect the views of Garett MacGowan. The ML model may produce content that is offensive to some readers.

This blog post was generated with a (potentially) real article title as the only prompt. A link to the original article is below.

Original Article

Photo by Stephen Dawson on Unsplash

Generated: 9/15/2021

Alluxio to Present at ApacheCon about Data Orchestration for Machine Learning

At the end of last week’s announcement from Alluxio and RedHat of a strategic alliance to work together on open source data distribution and orchestration solutions, we had a couple of press inquiries about the announcement. I answered a few questions in the Alluxio Mailing List on what Alluxio is planning to do, its strategy, and its overall impact as it relates to the open source world.

The original questions came from:

@fritsch: Does this mean that Alluxio now has a distribution and orchestration product based on the Mesos foundation?

No, it doesn’t. The Alluxio project has always focused on being an abstraction layer on top of HDFS. We have been making progress developing an abstraction layer on top of Apache Mesos. When we announced the strategic alliance, we planned to build a solution based on our abstraction layer on top of Mesos rather than directly integrating with Mesos to leverage all the other capabilities of Mesos. Today, we are still working on this capability, although we have several more releases left before we formally announce it and open source the project.

I don’t know if this will answer all the questions, but I wanted to give the best answers to a number of the most frequently asked questions about Alluxio and Red Hat.

“What does that mean for Alluxio?”

I’m going to start off by talking about Alluxio — the project — and explain why it’s important.

I’ve been working in the Apache community since I joined Oracle to be part of the Apache Incubator project. You can read about those early days from 2009-2012 through my history of contributing to the project on the Apache Project Blog. But for the last 4-5 years, the Alluxio project has become one of the top contributions to the Apache Incubator project — by far. Alluxio has become the biggest component of the Incubator as a number of other projects have grown in popularity around it.

We have been talking to a lot of people about the Alluxio project — we have around 100 companies using it at this point, and we have had numerous companies reach out to us to join Alluxio because they needed something to do with large scale data and their own data lakes, etc. We’ve built an ecosystem around Alluxio that is not like how we have built other projects in the Incubator. We have a lot of companies using Alluxio and a lot of users around the world.

Our community contributors have been very receptive to community contributions because it enables them to work together rather than us doing all the development work. And unlike OpenStack, we also have open source community support around it, which is very important to us. The project started in 2011 as a fork of the hdfs project because Hadoop developers weren’t providing features that we needed in order to build a data distribution platform. We had been using our own solutions to distribute data before, but we wanted a platform with better data distribution.

When we made an RFE to merge the project back into the hdfs project, the development of Alluxio continued because Hadoop’s maintainers were a little apprehensive about the integration from the beginning. Some of this was just the normal bureaucratic process of getting a new project merged into a more established project. I started contributing soon after, writing a lot of the new implementation and fixing a lot of bugs we had gotten into while developing the project.

At some point in 2015 we became a company. I was at Oracle, and had been planning to leave Oracle. When our company had funding issues and wasn’t going to have funds available any more, I decided to move into the Apache Foundation to become part of the Alluxio project. To do that, we were required to form a new company called Alluxio, Inc. (I’m still CEO).

Since being part of the community, I’ve worked on the alluxio code base for over 4 years. On the mailing lists and at conferences, we’ve been talking to a lot of people and companies about the project since we started using it for our own use; now we are really the only project in the Incubator to have large scale users of it. We’re one of the highest contributors and most contributors in the Incubator. We have been growing the number of users much faster than any other project.

Now it is the turn of the Data Center community members to look up from being used in the datacenter, and begin using data they’ve been creating, and data they’ve been accumulating as part of their work.

In other news, we are about to open source our abstraction layer for Mesos, and it’s alluxio-core project in a few weeks. We are still working on our project in the Incubator, getting ready to release a formal product to the community.

“Do you think Alluxio will influence Hadoop?”

Yes, and we have done this in a couple of different ways. First, I believe that as the data lake concept has caught on and as software developers start creating a datastore for their companies’ data, that this will influence the Hadoop community.

As the data lake concept grows, many of the questions related to how to store and distribute large amounts of data are similar to the problems Hadoop solves. Hadoop solved those problems better when it was initially conceived, and the data lake concept has now become a new way to solve those problems. We have been able to add new functionality with the Alluxio project that we couldn’t have added otherwise.

“Do you think Oracle may have been interested in the project as well?”

Yes, I think they were. I’m sure that they were when we were initially starting development as it was the only project being built around these concepts at the time. I believe that the projects within the Apache Foundation are all pretty similar, and that none of them have any particular interest in one company.

“But Hadoop is open source! Isn’t that the same thing as Oracle?”

I’m going to come back to that point for a second, just so we’re clear about what we are talking about.

We at Alluxio are not interested in creating Hadoop clones, but we have added to the Apache Community, as a company, with a lot of passion and enthusiasm, and I don’t believe we’ve taken proprietary technologies from our research work at our own company.

For instance, we created a project called Hadoop-HDFS-Plus with the intention of adding things to HDFS that we needed and that companies outside of our own company needed. For example, we needed a way to better manage large amounts of data and data processing, and Hadoop doesn’t have great solutions for those problems. We’ve had companies reach out to us through the Alluxio project about HDFS-Plus.

But that doesn’t change the fact that Hadoop was founded on the philosophy that it was an open project with a shared mission, and the goal was to have that mission succeed as well as possible regardless of a company like Oracle stepping in to fund efforts that they deemed important by donating to them. We have had the same goals and goals of getting companies to use our project even if we had a competing product from a company. We just wanted to get it better for all the people that needed a data distribution platform.

“Aren’t all databases based on open source? Oracle doesn’t develop anything proprietary — that would mean it would mean it would stop being able to run on hardware from other companies.”

They do sometimes have proprietary code — just as Linux is derived from the GNU Public License but has many proprietary elements in the kernel. Hadoop and all the data in it would be available from the Apache Foundation and many companies if Oracle didn’t buy some of them. We make a lot of our code available in a public Github repository; we license it under the Apache licenses.

The Alluxio project isn’t about putting proprietary aspects into a public project. We use free software for our own project from the beginning in 2009, and we haven’t ever had any problem licensing our proprietary code from us. We have a lot of companies using it, and the project has become an umbrella project for several other Apache projects which use its functionality, since it is such a large component.

We don’t have any interest in having proprietary pieces of Alluxio. The Alluxio project is about making HDFS better — it was originally a fork of the HDFS project for the same use we are doing for Alluxio. If Hadoop ever comes to have a proprietary model in which it uses Hadoop Plus, which is what we think are the features our company wants to deliver to other companies and it is the way we think Hadoop should continue, we would have no problems with it. As of now, we have no interest in that.

We are working on open source technologies with the other Apache projects, and we think that people would like to see Hadoop get better for us — as well as other companies, who also need a data distribution platform — and to have as many people as possible use Hadoop.

“Are there other competing platforms within the Apache foundation?

Garett MacGowan