how to build a data lake on-premise


Finally to summarize what all this, what youre doing is [00:24:30] youre basically speeding up time to insights, youre consolidating all your data into this one single massively scalable platform, so you can go from a few terabytes to multiple petabytes and more, and consolidating all your storage into a single device, and then you can bring compute to it whenever you need it.Again, delivers consistent performance, consistent security. The second shift that were seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. Like Dave said, Im a senior solutions manager for analytics and AI at Pure Storage. Theres actually not only storage going to do it, the storage [00:35:30] compute and networking built into every blade and flash blade so that you have that linear increase in performance as you scale and youll see the details in that video. In this session, learn how to modernize your legacy data lake or warehouse to create a modern kubernetes-based platform with Dremio and FlashBlade and how this architecture enables radical improvements in time-to-value for your data. The storage that I would build to meet these requirements, like, itd be an object store, itll be capable of many things and lets see what are those capabilities that we should bring here. So lets see how our data architectures have evolved to [00:07:30] take into account this change paradigm shift that weve seen over the last 10 years, all of us started back in 2015 and before we had these Hadoop clusters and data warehouses and your data were co-locating.

First, we have unpredictable performance, youve got data pipelines that service various teams with various requirements and their jobs [00:04:00] might be slow, their queries might be slowing them down, anybody that has a query thats stuck is going to just give up and not use the system, right? Theres like MLOps thats also super big buzz word in the industry right now. Youd have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you dont have to manage it, you dont have to tune performance to it, its just going to keep delivering that simplicity performance and scale simply, thats it. So please join us there or you can go to the [inaudible 00:31:20] and check out the booths and the demos that youll find there and some awesome giveaways and thanks everyone and enjoy the rest of the conference and thanks Naveen. And again, go check out that Field Day by Brian gold, from Pure Storage And hes going to explain how we built this ground up architecture to scale. So you can bring compute to whatever you need, rather than allocating specific compute silos. Actually, if you go to YouTube and just search for FlashBlade theres something about Field Day with Brian Gold, where he actually walks through all of the design that hes gone through.You can put the bunch of SSDs together and make it work for like a few terabytes of data. sap data solutions open hub analytics rise generation unified accelerate approach expand flow companies across looking them help [00:19:30] The other layer, on top of this storage, we spoke about building that open data platform and we also need another software to manage storage for your containers, as you spin up and spin down containers, you want those to be automatic and the storage needs to be allocated when its spun up and also when theres a failure scenario, [00:20:00] when theres backup, when theres needs for backup, theres needs to migrate data, we need to create a dev test environment, theres a need to encrypt that data, when one container fails in and Kubernetes takes action to create another container, to do confer failure scenarios or scaling scenarios, all of those storage needs do need to be addressed, and you need [00:20:30] Kubernetes data services platform to address all those requirements and Pure Storage, quite a company called Portworx, which is an industrys leading companies, data services, platforms available for that. [00:15:00] So, lets look at like, if we had some magic pixie dust, what kind of storage would we build to meet these requirements? Today, unstructured machine generated data is just growing exponentially, everybody knows that, IOT data, geospatial data are all generated by devices, video generated by cameras, log data. His team curates best practices to simplify management while delivering performance at petabyte-scale for software such as Elastic Search, Apache Spark, Apache Kafka, Tensorflow etc. Its got something called safe mode, which locks it against ransomware attacks, so you bring consistent [00:25:00] performance security and everything FlashBlade is managed from the cloud, so its very simple to manage, you can literally forget, and it can be managed with the APIs and the latest [inaudible 00:25:10] so you can just forget about managing storage. Theres actually not only storage going to do it, the storage [00:35:30] compute and networking built into every blade and flash blade so that you have that linear increase in performance as you scale and youll see the details in that video. azure datalake Lets go ahead and open it up for Q and A. You can put the bunch of SSDs together and make it work for like a few terabytes of data. Here Ill do it for you, Ill paste that question into your Slack channel and Ill post the link, just give me a second here. Okay. Well try and get that [00:32:30] over Slack, maybe. I dont think itll get added in, but just for reference, theres another question that was Is FlashBlade different from a standalone BOC of SSDs? Theres no need for any of that. Great. This is what you have in mind, so lets look at [00:11:00] storage and how we bring this paradigm to storage. Good evening, wherever you may be. Years experience in AI/Machine Learning research, and leading engineering team in various areas software development, DevOps, data science and MLOps. Thank you guys. Need an App Maker? Or if I have a lot of queries, but on very little data, then I just have to buy this storage, right? Unified Fast File and Object slash grid helps you bridge the gap between existing infrastructure, which is maybe an HDFS cluster to a modern data lake. You want code thats going through a CICT process and thats ready for production anytime. And I know DataOps is a very buzzy word right now. And finally multi-protocol support, you dont want to bank all your dollars on one particular protocol. This is a fantastic shift, it really brought in the elasticity and agility to the cloud world.What were seeing in 2020 beyond, especially with innovators, just like Dremio is youre [00:09:00] seeing Cloud Data Lakes that are built on open data, where theres a separation between compute and data, where you have a open data layer on top of your storage that may be built on open metadata standards, open file formats like parquet and open table format, suggest to data lake and, and Iceberg and other data formats, and then youve got this open data layer on top of your storage [00:09:30] and then that open data layer is accessed by various applications via Dremio, Spark or [inaudible 00:09:37] or whatever the application may be. One is most people buy FlashBlade for simplicity, you just put a bunch of SSDs together, you need to manage those, performance is going to be when you said performance is going to be inconsistent, you have to tune it for the different application workloads. Good afternoon. So [00:33:00] all right, let me copy the link. First, multidimensional performance, no matter what, the application that I throw at it and whatever [00:15:30] the data is, the data could be a different sizes, it could be either sequentially access or randomly access, it could be batch or real-time jobs, it could be a large number of small files or a small number of large files, whatever the file sizes, whatever the characteristics of the app is, I need to deliver a high throughput low-latency and consistent performance and thats key. Hey guys. So these are the requirements driving modern data requirements today and if you could capitalize on these, youd be much ahead of the competition. Its going to bring you consistent desegregation, green compute and storage, so you can bring a lot of compute to us, a problem for few minutes, and then take it away for another problem, right? And also it should be native to that platform so the performance is good, no matter what protocol the application is using [00:18:30] to access data. We got one more question here. While its clear in 10 years as forward thinking, companies say that most of the code generated would be AI and ML code. Different applications use Cloud-like applications [00:18:00] use an S3 Itll be a [inaudible 00:18:03] protocol, right? It is difficult to plan capacity ahead of time when you plan for something and then you add a node or removal node from a cluster, suddenly your data starts rebalancing and you have to move data from one location together, you need to install a patch, its just complex, [00:05:30] and the complexity scales with the data, so you start with a few terabytes of data or less than that and you start scaling to more users, you start scaling to more data, you start scaling to more clusters and nodes, and what happens is, complexity goes through the roof along with your scale. It could be some kind of deep learning software, you have to keep performance tuning and users are always complaining about query speeds and not being there or some something not functioning, so you have to keep performance tuning.All of these cost complexity, and you guys are well aware of that. So what you want to do is you want to run some nodes and just run the operating system and the basic functions on the local SSDs on the local drives and you want to keep all your data on a centralized object store file in object store that we call UFFO and that way you create an open data layer that can be used by any application, so youre not locking yourself into silos, to me, thats the difference between flash blade and RAID array of SSDs. Thats the value that [inaudible 00:35:18]. You want code thats going through a CICT process and thats ready for production anytime. All right, lets take a look and see what we got over there. So please join us there or you can go to the [inaudible 00:31:20] and check out the booths and the demos that youll find there and some awesome giveaways and thanks everyone and enjoy the rest of the conference and thanks Naveen.Naveen: Thank you so [00:31:30] much for the session. And if want to learn more about this, there are many customers that are doing this today, and if you want to learn more about, weve got a very technical document written by Joshua. Well were way over now. Youd have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you dont have to manage it, you dont have to tune performance to it, its just going to keep delivering that simplicity performance and scale simply, thats it. You would [00:34:00] have multi-protocol access, theres like a lot of engineering thats gone into building the FlashBlade made from ground up. Next, it needs to be cloud ready, even if youre on premises, it needs to be at agile infrastructure, which [00:16:30] gives you flexible flexibility to bring compute to the data and also with segregated, compute and storage, and also provides you consumption choices that are cloud-like, right? And finally, each piece of analytics software you have in your pipeline, whether its Spark or Splunk [00:07:00] or Elastic or whatever, it may be Dremio. So thanks again Naveen for sticking around a little extra and thanks for your talk. Here is Portworx, [00:24:00] You build a open data lake on top of a Pure FlashBlade, with whatever meta store table, open data format, that the tables or files, parquet files, data lake tables on top of, and weve tested this And weve seen this works very, very well. Folks, if you have any other questions, lets go ahead and get them in. Lets get started with the agenda. MinIO is just like the quick and cheap, dirty version of that and its basically yes, but you could use our Portworx software to completely manage all your storage. As you support various use cases, more data sources going from simple dashboards to machine learning, to actual [productionizing 00:25:31] [00:25:30] machine learning based software, right? For those of you who are still here, it looks like theres still a 20 so people here, just in the chat here. Yeah. What [00:02:30] developers want, what organizations want is to automate their data pipelines and make them self-service. [00:26:00] And finally its simplifies operations as you scale, like I said, Pure [inaudible 00:26:05] is managed from the cloud, it can be consumed as a service, its completely storage as a service, you only pay for what you use and you never have to be down for any upgrade patching, and even if you need to do a controller upgrade, thats all covered with Pures Evergreen guarantee. If you have more nodes, then you have frequent failures, so hundreds of nodes its again, managing hundreds of nodes [00:06:30] is complex, patching them and securing them, and theres going to be lots of failures happening all the time, either one has its problems. [00:19:30] The other layer, on top of this storage, we spoke about building that open data platform and we also need another software to manage storage for your containers, as you spin up and spin down containers, you want those to be automatic and the storage needs to be allocated when its spun up and also when theres a failure scenario, [00:20:00] when theres backup, when theres needs for backup, theres needs to migrate data, we need to create a dev test environment, theres a need to encrypt that data, when one container fails in and Kubernetes takes action to create another container, to do confer failure scenarios or scaling scenarios, all of those storage needs do need to be addressed, and you need [00:20:30] Kubernetes data services platform to address all those requirements and Pure Storage, quite a company called Portworx, which is an industrys leading companies, data services, platforms available for that.It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform.

Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right? And if youre using multiple clusters, different types of clusters you may be under utilizing resources in one area and over utilizing resources and other area, you cannot keep trying to rebalance those. Below the Kubernetes layer, youre going to have a layer that says Thats for data management services for Kubernetes. So the data management services for Kubernetes its going to as a container is spun up, spun down, the data management services layer is going to provide the storage to do the Kubernetes layer, and then youre going to have a layer, [00:11:30] which is your modern data lake layer, which is based on open data formats, and this software layer, or this layer is going to be built on top of Block or ObjectStore, or it could be more legacy systems, its going to be built on a [inaudible 00:11:49] .So lets double click into that storage layer, Im from Pure Storage, obviously Im going to double click into that storage layer and just find out like, What are some [00:12:00] of the requirements of that storage layer in this modern data analytics world? And what are some of the key drivers in market drivers for this layer, for data today, actually just not the storage layer, just what are the key market drivers for modern data delivery today. The second, it needs to be an intelligent architecture built up on todays technologies, todays storage demands flash, [00:16:00] right? [00:15:00] So, lets look at like, if we had some magic pixie dust, what kind of storage would we build to meet these requirements? Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right?Im talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes. So lets talk about how in the context of Dremio how these applications and this architecture is going to help you. Okay, thanks Naveen. Actually, if you go to YouTube and just search for FlashBlade theres something about Field Day with Brian Gold, where he actually walks through all of the design that hes gone through. Im talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes. So thats going to help you tremendously speed up time to insights, and also increase your agility. We want to take that cloud-like approach where the storage and the compute [00:29:00] are desegregated. First, multidimensional performance, no matter what, the application that I throw at it and whatever [00:15:30] the data is, the data could be a different sizes, it could be either sequentially access or randomly access, it could be batch or real-time jobs, it could be a large number of small files or a small number of large files, whatever the file sizes, whatever the characteristics of the app is, I need to deliver a high throughput low-latency and consistent performance and thats key.The second, it needs to be an intelligent architecture built up on todays technologies, todays storage demands flash, [00:16:00] right?