About this transcript: This is a full AI-generated transcript of Data + AI Summit Keynote 2026 — Day 1 from Databricks, published June 27, 2026. The transcript contains 32,692 words with timestamps and was generated using Whisper AI.
"The world created an artificial brain, then locked it in a room. It can answer almost any question, but it's isolated, cut off from your operations, your world. It knows how to think. Now let's show it what's real. Real is every decision, every transaction, every signal, every discovery, every..."
[00:00:00] The world created an artificial brain, then locked it in a room.
[00:00:15] It can answer almost any question, but it's isolated, cut off from your operations, your world.
[00:00:25] It knows how to think. Now let's show it what's real.
[00:00:33] Real is every decision, every transaction, every signal, every discovery, every promise, every life changed.
[00:00:46] Together, we give the brain context with all of your data.
[00:00:50] But you're in control. You decide what it knows, what it can access, who can access it, what it can do.
[00:01:00] Run it on anything, any way you want.
[00:01:06] Because the ultimate brain was just the beginning.
[00:01:10] The real breakthrough is what you build with it.
[00:01:14] The systems you transform.
[00:01:17] The discoveries you unlock.
[00:01:20] The lives you change.
[00:01:24] The possibilities are in your hands.
[00:01:27] One, four, three, two, one.
[00:01:33] Let's make it real.
[00:01:42] Welcome to the stage, Databricks co-founder and CEO, Arnie Gotze.
[00:01:48] All right, super excited to be here.
[00:02:02] That was an awesome video.
[00:02:03] I don't know if many of you in the audience maybe identified.
[00:02:05] Many of those clips that we were showing are actually customers and their use cases that you could see.
[00:02:10] Hopefully we play it again and you can see it again.
[00:02:12] Okay, so super excited to be here.
[00:02:15] We have an awesome event.
[00:02:16] This is actually the largest data and AI summit that we've ever done.
[00:02:19] Okay, so super cool.
[00:02:24] So just a couple of days I was looking and we only had 97,000 people signed up.
[00:02:29] But we now crossed 100,000.
[00:02:30] So super excited about that.
[00:02:33] My favorite thing is actually like we have 174 countries represented, which is an insane number.
[00:02:39] Okay, this is the number of languages.
[00:02:40] So it's truly an international, global event.
[00:02:43] There's people from all over the world that have actually flown in for this.
[00:02:47] So super exciting.
[00:02:49] And we actually have over 31, actually 30, they have the number exact down.
[00:02:53] Okay, so 31,309 people, which makes it the largest data and AI conference in the world.
[00:03:01] So super cool.
[00:03:05] And this conference, as you know, is really a community event.
[00:03:10] And it started in its roots with open source.
[00:03:13] We started the data and AI summit.
[00:03:15] It was called Spark Summit back in the day in 2013.
[00:03:18] And we started with the open source project Spark, which now actually has over 3 billion downloads a year.
[00:03:25] But we've been adding open source projects over the years.
[00:03:28] So, you know, Delta Lake, Iceberg, MLflow, Unity Catalog that we open sourced.
[00:03:33] But we've also added really Postgres.
[00:03:34] We've really embraced the Postgres open source project.
[00:03:38] So you'll see much more about that throughout this conference.
[00:03:41] In fact, Postgres, just Databricks is really embracing it.
[00:03:45] We have one-seventh of the governors of the project that actually work for us.
[00:03:49] And over 6% of the committers of the project are very excited about the Postgres project.
[00:03:53] We think that's the future of databases.
[00:03:56] And this weekend, my co-founder, Matei, open sourced the project called Omnigent.
[00:04:01] Okay, so it's a meta harness.
[00:04:03] So check it out.
[00:04:04] I'll talk more about it.
[00:04:06] And you'll also hear from him talking about it.
[00:04:08] Okay, so that's the community.
[00:04:10] We also have a lot of people from the community that are speaking at this conference.
[00:04:13] I'm very excited about that.
[00:04:14] We're going to hear from the co-founder and CTO of OpenAI, Greg Brockman, Satya Nadella,
[00:04:19] many other leaders and CEOs and co-founders of various companies, AI companies like Enthropic,
[00:04:25] Cognition, Glean, Decagon, Agno, Lama Index, Crew AI.
[00:04:30] So all of those are here.
[00:04:31] Super exciting program.
[00:04:33] And they're all part of a bigger ecosystem that's gathered here this week.
[00:04:37] Ecosystem includes system integrators, people who built on top of Databricks, or data sharing and AI partners that we have.
[00:04:45] This is super exciting.
[00:04:47] And I want to give a special thank you to all of our partners that made this event possible.
[00:04:51] So let's give them a round of applause.
[00:04:57] So please check them out.
[00:05:00] They, as well as the companies that I showed in the previous slides, are in the expo hall.
[00:05:03] They have a lot of interesting innovations.
[00:05:05] So go check those out.
[00:05:06] And this conference would not be possible without all of our customers.
[00:05:11] They're the ones that are actually doing the truly amazing things on top of the platform.
[00:05:16] All those use cases that we talk about, that's really where it's happening.
[00:05:19] It's all of you that are actually implementing all these use cases and actually driving the data and AI change in the world that's happening.
[00:05:26] So very excited about those.
[00:05:28] I want to highlight just three of them because I think they're inspirational.
[00:05:32] Number one is Insulet.
[00:05:35] So Insulet has a product called Omnipod.
[00:05:38] And what the Omnipod does is that it helps people with diabetes.
[00:05:43] It collects, actually, has a glucose monitor.
[00:05:45] And the glucose monitor just measures the level of glucose in the blood.
[00:05:50] And then, of course, leveraging lots of data and AI, they can actually personalize exactly what's happening with your glucose level and release insulin so you can manage your diabetes.
[00:06:02] So game-changing for all the folks that have diabetes in the world.
[00:06:06] So that's just one of the main, many 20,000 impacts that, you know, we've talked about here.
[00:06:11] Another favorite example of mine, and actually many of the early Databricks employees have one of these devices, is a tonal device, okay?
[00:06:18] So this is a really cool device.
[00:06:19] It's just that big, and you put it on the wall, but then you can do all your workouts on it, and it just creates this crazy resistance so you can do the workout on the device.
[00:06:29] But what's really cool about it is that they're collecting 100 billion seconds every year of workout data,
[00:06:34] and they're personalizing it for you so that you get, actually, a personalized workout.
[00:06:40] So it's like your personal PT with diet, advice, and all of that, of course, possible thanks to leveraging data and AI on the platform.
[00:06:49] So super awesome.
[00:06:51] Okay, and then the final one, I'm really excited about this one, is Merck.
[00:06:56] Merck actually trained a transformer-based language model.
[00:07:01] So it's called TEDDY.
[00:07:02] TEDDY stands for Transformer-Enabled Drug Discovery.
[00:07:06] Okay, so it's one of those transformer models.
[00:07:08] It's like an LLM, you know, next token predictor.
[00:07:11] It was trained on Databricks, actually, with the Databricks research team, but it's not a language model.
[00:07:16] It doesn't predict the next word.
[00:07:18] It predicts gene regulatory networks.
[00:07:21] So it can predict which genes in the body are actually causal for diseases.
[00:07:28] So it can predict, okay, if you change this cell, then this will happen with the next cell.
[00:07:32] This cell is just a causal cell.
[00:07:34] This cell is, you know, just a reactive cell.
[00:07:36] So this helps speed up drug development significantly.
[00:07:40] So over 100 million cells that it's been trained on.
[00:07:42] So it's super, super cool.
[00:07:44] So truly inspiring AI use case.
[00:07:47] So AI is top of mind for all of us.
[00:07:50] We're here at the Data and AI Summit.
[00:07:52] So I want to have a little bit of audience participation here today.
[00:07:56] Okay.
[00:07:57] So is AGI here yet today?
[00:08:01] So that's a question I have for the audience.
[00:08:03] I want to ask everybody, raise your hands if you think AGI is already here.
[00:08:07] That we already have AGI.
[00:08:09] So let's see, raise your hands.
[00:08:12] Okay, I'm going to do like a statistical estimation here.
[00:08:15] I'm going to say 5% at least in this area.
[00:08:18] I can't see all the way down there, okay, are raising their hands saying that AGI here.
[00:08:22] So I would say like 90 to 95% of you say that AGI is not here.
[00:08:27] Okay, so I have a question for you then.
[00:08:30] Okay, which one of you can solve this?
[00:08:33] So the question is, compute the reduced 12-dimensional spin boardism of the classifying space of the Lee Group G2.
[00:08:39] Okay, so audience part of session, you can just shout it out if you know the answer.
[00:08:44] Okay, actually you can come up here and just share the answer with us if you know the answer.
[00:08:47] Okay, how many know the answer?
[00:08:49] Can we see?
[00:08:49] Raise your hands.
[00:08:52] Oh, there's like one person.
[00:08:53] Okay, I should get them on stage.
[00:08:55] Okay, all the frontier agents and AIs today can solve this.
[00:09:01] Okay, so this is like a simple question.
[00:09:02] This is not a hard question.
[00:09:03] It's a hard question for us.
[00:09:05] But for the frontier agents, this is trivial.
[00:09:09] It's part of actually humanity's last exam, which is a benchmark that they're all really focused on.
[00:09:16] That benchmark has 2,500 questions.
[00:09:19] They look kind of like that, but in all kinds of categories.
[00:09:22] You know, it could be like image recognition or various kinds.
[00:09:25] And today, all of them pretty much can solve half of those questions.
[00:09:30] So we believe that AGI is already here.
[00:09:33] Okay, so artificial general intelligence has already arrived.
[00:09:37] In fact, I came to the United States to do research at UC Berkeley at the lab called the AMP lab,
[00:09:43] where the A was for algorithms and AI.
[00:09:45] It was the largest AI lab at this time, 2009.
[00:09:48] And actually, this conference came out of software that we built at that lab.
[00:09:51] And the definition of AGI that we had back then, we've already satisfied that by leaps and bounds.
[00:09:57] In fact, I went back and I asked many of the people that were in the lab,
[00:10:00] what do you think?
[00:10:01] Do we have, have you passed AGI or not the way we defined it back then in 2009?
[00:10:05] And everybody says, yeah, of course.
[00:10:07] Of course, we do it the way it was back then.
[00:10:09] But, you know, goalposts are changing and so on.
[00:10:12] So AGI is already here.
[00:10:13] It's plenty smart.
[00:10:15] We don't need it to be more intelligent.
[00:10:18] AI does not have an intelligence problem right now.
[00:10:20] These are not the kind of problems that we need inside of our organizations.
[00:10:23] The problem we have is that AGI is not really completely permeating our organizations.
[00:10:32] So inside of your companies, you don't have each hundreds of agents working for you,
[00:10:37] doing autonomous work, collaborating with each other, coming up with proposals,
[00:10:41] negotiating, and then we're just managing agents.
[00:10:44] Most of us are just using chatbots and asking them.
[00:10:47] Or maybe we're using coding tools and the writing code for us and we're using it agentically like that.
[00:10:52] But we're not actually sort of fully realize that vision that everybody's talking about.
[00:10:56] And that's why maybe many of you didn't raise your hand when I said, is AGI here?
[00:11:01] So the question is, how do we actually enable this at work?
[00:11:05] So that's what the next couple of days will be about.
[00:11:08] We want to figure out how do we enable all of you to sort of lead this revolution.
[00:11:13] It's basically our jobs here to figure out how can we actually make this AI work.
[00:11:17] That's what you all are doing.
[00:11:19] So that's what we're focused on.
[00:11:20] So in a nutshell, what we want to do is we have all these AIs.
[00:11:23] They've been released by many of the vendors.
[00:11:25] There's open source.
[00:11:26] There's frontier models.
[00:11:28] How do we figure out to take all the data, all the processes in your organization,
[00:11:33] everything that you have in your head, and provide it as context to the AI?
[00:11:38] If we can do that, we think that the AI can already solve many of the tasks that we want it to solve.
[00:11:44] So if we just give the context to these very smart models that can solve those super hard problems,
[00:11:50] they'll be able to do amazing things inside of our companies.
[00:11:53] Most of the tasks we're doing is not about solving 12-dimensional spinboardisms.
[00:11:58] They're about getting the information out of Salesforce, summarizing it, putting it in another place,
[00:12:03] going through it, preparing it, and those kind of tasks.
[00:12:05] So you just need to get all that context.
[00:12:08] So it's as simple as that.
[00:12:10] But getting this context into the AI is harder than one could imagine.
[00:12:15] It's actually really, really difficult.
[00:12:17] And getting the perfect enterprise context is elusive.
[00:12:20] So this is the problem we've been focused on, and we're really focused on this problem at Databricks.
[00:12:24] So context is actually itself many other things.
[00:12:27] It's not just saying context, how do we give it to the AI.
[00:12:31] Context means how do we get all the data that you have in your organization and actually connect it to the AI.
[00:12:38] Data could be, what are all the meeting notes?
[00:12:40] So every meeting that happens in the organization should be recorded, and we should get the transcripts of that.
[00:12:45] And we need to reorganize our processes so that we can do that.
[00:12:48] And actually, at Databricks, we have forward deployed engineers that come and help reorganize things like that.
[00:12:52] But it also means connecting to data silos that you have, or data that's sitting tucked away,
[00:12:58] and you don't have any access to it because maybe security has not approved it.
[00:13:03] So this is step number one, how do we get your data AI ready so that we can actually give it to the agents.
[00:13:09] Okay, but let's say we figure this out, and we figure out how to get access to all the data.
[00:13:13] Then we're still not done.
[00:13:15] The context problem isn't solved because we could just take OpenClaw and just unleash it on all of your data.
[00:13:21] Okay?
[00:13:21] When OpenClaw was released, over 10% of the skills in it were actually malicious.
[00:13:27] Okay, so step number two, how do we do this securely?
[00:13:31] How do we do that with control?
[00:13:32] Okay, so first is context.
[00:13:34] Now, how do we do this with control?
[00:13:36] So that's really, really important.
[00:13:37] If we're going to have agents roaming around doing various things, we have to make sure that they are satisfying our security policies,
[00:13:44] that we can audit them, and that they don't do something that we don't want them to do.
[00:13:49] Okay, let's say we solve this as well.
[00:13:51] It's still really hard.
[00:13:53] If we just have agents running around, for looping, live, checking through all of the data,
[00:13:58] then it's going to become extremely costly.
[00:14:00] So we have to also solve the cost problem.
[00:14:02] Okay?
[00:14:03] This is sort of just shooting up at most of our customers.
[00:14:05] We're seeing the cost just completely run away.
[00:14:08] Uber CEO said that in one quarter, they blew through the whole year's annual AI budget.
[00:14:13] Okay?
[00:14:14] If this continues, it's completely unsustainable for most organizations out there.
[00:14:19] You cannot have your costs completely outrun your revenue.
[00:14:23] You will go bankrupt.
[00:14:24] Okay?
[00:14:25] And your companies will not allow that.
[00:14:26] So they'll just shut it down or stop it.
[00:14:28] Okay?
[00:14:28] So it's just infeasible.
[00:14:30] So we have to also solve the cost problem.
[00:14:32] But even if we solve all of these things, including the cost, we get to the most important thing, in my opinion,
[00:14:38] which is how do we make sure that we can do that without lock-in?
[00:14:42] How can we do that in a way where you have choice?
[00:14:45] Okay, this is super important.
[00:14:46] This is the hardest one because some of you are representing organizations that are less than 10 years old.
[00:14:52] So this maybe doesn't apply to you.
[00:14:54] But most of the organizations in the room, over 90% of you represent companies who have a long history.
[00:14:59] They've been around for 20, 30, 50, 100 years, some of them.
[00:15:03] And the thing I've seen in the last 13 years is that the stack that you have at your companies is just very complicated.
[00:15:11] Over the years, you keep buying new software.
[00:15:14] You keep adding more things to that stack.
[00:15:16] And you just get locked into all those vendors.
[00:15:18] And they come and go, but the software remains.
[00:15:20] And there's always someone who wants that piece of software in your stack, so it's hard to rip it out.
[00:15:24] So how do we do this without getting locked in both to the data, the context, but most importantly also now to the AI?
[00:15:30] We don't want to get locked in just to one particular AI.
[00:15:33] We want to have choice when it comes to the AI.
[00:15:35] So that's what we're focused on today.
[00:15:37] My whole keynote is just focused on how do we solve these four.
[00:15:40] Okay?
[00:15:41] So I hope I can convince you today that we're making some steps towards solving these four.
[00:15:47] I'm not going to say we've completely solved it.
[00:15:49] Context is solved.
[00:15:50] But at least we're making some big leaps towards solving these four things.
[00:15:54] So the four things we're focused on is we want you to have choice and not get locked in.
[00:15:58] We want to be able to get the security and the control and the cost control as well as the context
[00:16:04] and then get that to the AI.
[00:16:05] And that way we can have more impact together.
[00:16:08] Okay.
[00:16:08] So how do we do that?
[00:16:09] So let's start with choice.
[00:16:12] Okay?
[00:16:12] It started with your data, as I mentioned.
[00:16:14] So we have all the data and we want to get that data in an AI-ready form.
[00:16:19] So step number one is we've actually made huge progress just in the last year to get that data
[00:16:25] and move it into an open lake house.
[00:16:28] So open infrastructure on top of any of the clouds and we get your data ready there.
[00:16:33] That's what Lakeflow does.
[00:16:34] So Lakeflow has actually been developing a lot of connectors over the last year.
[00:16:39] So two years ago when I was here, we barely had any connectors.
[00:16:43] And today we now have over 100 connectors.
[00:16:46] So Lakeflow can now connect to all these different systems that you have.
[00:16:48] It's not just Salesforce Workday NetSuite.
[00:16:50] It can pull data from Meta or Google Analytics or from any of your docs or any of the corpus.
[00:16:56] And if there's any data source that you do not have, please let us know.
[00:16:59] We can help you get those into the open lake.
[00:17:04] Okay.
[00:17:05] And it's not just getting the data in there.
[00:17:06] We can do it in many different modalities.
[00:17:08] So three things that I would like to highlight.
[00:17:10] One is we released ZeroBus last year.
[00:17:12] It is now GA.
[00:17:13] Then actually we've added a lot of capabilities to it.
[00:17:15] So what does ZeroBus do?
[00:17:16] So as the name implies, you don't need a bus anymore for the data infrastructure.
[00:17:22] So instead of using something like Apache Kafka, you can just hit this API of ZeroBus and you can give it data.
[00:17:29] As it comes in, it could be bursty.
[00:17:30] It could be tiny rows.
[00:17:32] You don't need to worry about having millions of files or any buffering or anything like that.
[00:17:36] ZeroBus just takes care of taking all those entries that you hit it with at a very high rate.
[00:17:40] And it just makes sure that those appear in your open lake in a way that you can probe it later.
[00:17:47] Okay.
[00:17:47] Scalable, manageable.
[00:17:50] Okay.
[00:17:50] So that's great.
[00:17:52] And then last year we announced Spark Realtime Mode.
[00:17:56] Spark Realtime Mode, or RTM as we call it, is an open source framework that lets us actually get down to tens of milliseconds.
[00:18:03] So this is really amazing.
[00:18:04] So we can do now operations super fast.
[00:18:07] Historically, Spark was great for everything, but you couldn't really get down to ten milliseconds.
[00:18:12] Okay.
[00:18:12] So the Realtime Streaming would always take this micro batch of one second or so.
[00:18:17] But with Realtime Mode, we can now really make that super fast.
[00:18:20] So many of you, for your streaming infrastructure, you might have Apache Flink.
[00:18:25] That was the only option that many of you would pick.
[00:18:28] But now you can actually do a lot of those workloads directly in Apache Spark with Realtime Mode, which is open source.
[00:18:34] Okay.
[00:18:34] And the final one is Lakeflow Designer.
[00:18:37] I personally actually really like doing all of my data engineering and data movement using Lakeflow Designer.
[00:18:43] So the way Lakeflow Designer works is that it kind of looks like Alteryx.
[00:18:48] It's visual.
[00:18:49] You don't need to see any code.
[00:18:50] But you speak to the AI, which is Genie, and you tell it, hey, I want to do this transformation.
[00:18:54] I want to join these two tables, and it just visually shows you what that looks like, and then you approve it.
[00:19:00] But under the hood, it's all just Spark, open source Spark, and you can actually inspect the code.
[00:19:06] You can version control it.
[00:19:07] You can make it production ready.
[00:19:09] So that's super awesome.
[00:19:11] Okay.
[00:19:11] So those are three things.
[00:19:12] They're all GA'd.
[00:19:14] So you can now get all this data, use any of these modalities, all these connectors from these different systems, and you can get them into, you know, well, which format should you pick?
[00:19:23] Should you pick Delta Lake or should you pick Apache Iceberg?
[00:19:27] Okay.
[00:19:27] So that's a big question, and actually, I want to welcome to stage the original creator of Apache Iceberg, Ryan Blue, who was the CEO, co-founder of Tabular that we acquired last year.
[00:19:37] So let's welcome him on stage.
[00:19:48] Hey, Ryan, how are you?
[00:19:49] Good.
[00:19:50] How are you?
[00:19:50] Good.
[00:19:51] I want to embarrass you with a video from a bunch of years ago.
[00:19:54] So this video actually had a very interesting title.
[00:19:57] That said, why you shouldn't care about Apache Iceberg.
[00:20:00] I mean, listen to it.
[00:20:01] That's something that I care deeply about.
[00:20:03] And I sincerely mean that.
[00:20:05] You shouldn't care about Apache Iceberg.
[00:20:09] Okay.
[00:20:10] We shouldn't care about Apache Iceberg.
[00:20:11] Actually, I agree.
[00:20:12] I think we shouldn't also care about Delta Lake.
[00:20:14] But why did you say that?
[00:20:16] No one should have to care about formats.
[00:20:18] You should be able to use whatever tools you want with all of your data, use the right tool for the job, which is why we built both formats.
[00:20:27] In the beginning.
[00:20:29] Yeah.
[00:20:29] In fact, it's super funny.
[00:20:31] A few years ago, you know, I would get to customer meetings and they would get into the nitty gritties of how Iceberg was implemented versus Delta.
[00:20:38] And they would ask us, like, yeah, but can you do this, you know, deletion vectors this way and that way.
[00:20:42] And I was like, this is not where the industry should be going.
[00:20:44] So how are we doing?
[00:20:45] Where are we?
[00:20:46] What's going on with the versions and all of that?
[00:20:48] Well, we just did the GA release of Iceberg V3 support, managed Iceberg tables in DBR.
[00:20:56] That is going really well.
[00:20:58] We're following that up later this year with the unified metadata layer that is going to be built into Delta 5 and Iceberg V4.
[00:21:07] So we are very close to the full unification vision.
[00:21:11] So what's special about V3?
[00:21:13] V3 is the unified data layer.
[00:21:15] So you no longer have to rewrite any data files to share them across Delta and Iceberg tables.
[00:21:21] That's awesome.
[00:21:21] So basically, the data that's laid out, whether it's Iceberg or Delta, looks like identical, the big data now, right?
[00:21:26] Exactly.
[00:21:27] Yes.
[00:21:27] Awesome.
[00:21:28] And then with V4, just a little bit of metadata that's left.
[00:21:31] And hopefully, when do we get that?
[00:21:33] We're aiming for later this year.
[00:21:35] I don't want to make too many promises, but I think that we should finish the spec sometime in maybe Q4.
[00:21:42] Awesome.
[00:21:42] Super excited about this.
[00:21:43] Great job by Ryan and the whole community, Iceberg community, Delta community.
[00:21:48] You shouldn't have to care about this.
[00:21:49] Can we stop talking about this now?
[00:21:51] Yeah.
[00:21:51] I've got work to do if we're done.
[00:21:53] All right.
[00:21:54] Thank you.
[00:21:55] All right.
[00:22:02] So it doesn't really matter.
[00:22:03] Actually, if you're using Databricks, it doesn't matter if it's in Delta or Iceberg.
[00:22:06] We don't care.
[00:22:07] It's actually the same now.
[00:22:09] Thanks to V3, it's the same data format that's on the disk.
[00:22:12] Okay.
[00:22:12] Perfect.
[00:22:13] So that's that.
[00:22:14] So now we're done with how you can actually get your data into this open format.
[00:22:17] It's really seamless.
[00:22:18] Just one way of doing it.
[00:22:20] It's all Spark-based under the hood.
[00:22:22] So let's move on and look at how we actually then start doing analytics on that data.
[00:22:27] So the data warehouse that we announced about four or five years ago, we call it Lakehouse.
[00:22:32] In fact, our tagline was the best data warehouse is the Lakehouse because it's open, so you control
[00:22:38] the data yourself, and it works with machine learning workloads, and we made huge progress
[00:22:42] on our Lakehouse in the last year, since the last Data NAI Summit.
[00:22:46] We've added over 110 features that legacy data warehouses have so that we can more easily
[00:22:52] migrate those over.
[00:22:54] We also released a lot of AI functions.
[00:22:57] This is the main thing that people want with Lakehouse.
[00:23:00] So there are all these AI functions you can just call.
[00:23:02] So you have your data.
[00:23:04] Maybe they're in tables.
[00:23:05] That's the kind of thing you work with in data warehousing.
[00:23:08] But now you can just, in your SQL queries, call this SQL AI functions, and you can say,
[00:23:14] can you do sentiment analysis?
[00:23:16] Tell me for each of these cells, what do people think?
[00:23:18] Or you can tell, hey, for each of these cells, can you extract from me the company names that
[00:23:22] you see in it?
[00:23:23] Or can you do a prediction of what this number is going to be?
[00:23:25] And it can just do that at really high scale.
[00:23:27] So this is the most popular thing that people do actually on Lakehouse.
[00:23:31] And we've taken Lakebridge, which is an automated converter that lets you migrate from existing
[00:23:36] data warehouses that you might have on-prem or elsewhere in the cloud.
[00:23:39] And we can do that now really, really well.
[00:23:41] So very excited about that.
[00:23:42] And actually, you all are using this.
[00:23:45] So we've over doubled the consumption of our data warehouse just in the last year.
[00:23:49] So we're very excited about this.
[00:23:51] But what I'm really excited about is my co-founder, Raynald, getting on stage and talking about
[00:23:56] the new engine.
[00:23:58] Okay, so we have a new engine.
[00:23:59] He's going to talk about it.
[00:24:00] It's called Raynald.
[00:24:01] And what's really cool about Raynald is that it's the world's fastest engine that we've
[00:24:06] ever seen.
[00:24:06] We've benchmarked it against anything else.
[00:24:08] We've gotten the latency of it down to just tens of milliseconds.
[00:24:12] So far faster than anything we had done before.
[00:24:15] So it's very, very exciting for us.
[00:24:17] So we'll have him in a couple of talks in the keynote today talking about this.
[00:24:23] Okay, so that's our data warehouse.
[00:24:24] That's the Lakehouse.
[00:24:26] What about transaction processing?
[00:24:28] So I mentioned in the beginning, we really love Postgres.
[00:24:31] We've really embraced Postgres.
[00:24:32] We think AIs today really prefer Postgres.
[00:24:36] If you ask any of the AIs today, what kind of database should I pick?
[00:24:39] They will say open source Postgres.
[00:24:41] Okay.
[00:24:41] But we're excited about what we call Lake Base, which is a way in which you take Postgres and
[00:24:48] you separate it out so that you have all of its storage just sit on cheap lakes.
[00:24:53] Okay.
[00:24:54] So now we're talking Lake Base.
[00:24:55] So this is all TP, right?
[00:24:56] It's like MySQL, Oracle, transaction processing.
[00:25:00] And once you have that data sitting on the lake, you can now start doing amazing things on top.
[00:25:04] So what can we do on top?
[00:25:06] Well, we have a little animation here.
[00:25:07] Where if you look, you can get serverless auto-scaling.
[00:25:10] So you can scale up and down Postgres.
[00:25:13] And this is really fast.
[00:25:15] So you can do it in less than a second.
[00:25:16] You can scale it even down to zero.
[00:25:18] That's really awesome.
[00:25:19] Because if nobody's using it in the middle of the night, you don't have to pay for it.
[00:25:22] You just pay for cheap storage that's sitting in the lake.
[00:25:25] Okay.
[00:25:25] This was not possible before.
[00:25:27] You know, others had specialized storage.
[00:25:29] But this is just sitting in open source format on your lake.
[00:25:33] So why do you care about that?
[00:25:34] Why should your company care about that?
[00:25:35] Because it significantly cuts the cost of running databases down.
[00:25:40] So you get significant reduction in TCO by doing this kind of auto-scaling.
[00:25:44] So that's a game changer.
[00:25:45] But that's actually not even the coolest feature of Lake Base Postgres.
[00:25:49] The best thing about Lake Base Postgres is that it has this feature that agents really love.
[00:25:54] And that feature is called branching.
[00:25:56] So it kind of looks like that.
[00:25:57] I don't know.
[00:25:58] That's our attempt.
[00:25:59] To animate and visualize that.
[00:26:01] But what that means is that you can take a gigantic database, petabytes, and you can say, hey, can you make a clone of this for me?
[00:26:07] And it will, in less than a second, say done.
[00:26:09] What it's doing is not actually copying the whole database.
[00:26:12] It's just keeping track.
[00:26:13] The data is anyway sitting on the lake.
[00:26:15] It'll just say, okay, we have two copies.
[00:26:16] But it's actually just one copy.
[00:26:17] And then when you make a change to one, it'll just keep track of, like, okay, this change is in this copy.
[00:26:22] This other change doesn't exist in the first copy.
[00:26:25] So what's called copy on write.
[00:26:27] So it actually does that for you.
[00:26:29] Why does that matter?
[00:26:30] Well, it matters because it actually completely changes the way you organize databases inside of an enterprise.
[00:26:35] Most of enterprises have thousands of thousands of database instances that they're running.
[00:26:39] And we have DBAs that are running around trying to manage all of these.
[00:26:42] And the way you do that, the reason you have thousands of these and you pay for all these thousands of copies of the data that are siloed inside of the enterprise is because you need to have a user acceptance testing database.
[00:26:53] You need to have production database one, production database number two.
[00:26:57] You need to have staging database, R&D database.
[00:26:59] So you have all these copies, okay?
[00:27:01] But branching, what it enables is to just have one instance.
[00:27:04] And then you just have different branches of it.
[00:27:06] And that's just lower TCO, easier to manage.
[00:27:10] And agents love that because agents love to just branch out and just experiment with the data, try something out.
[00:27:17] And they want to do it quickly so that one second latency is very important for them.
[00:27:20] They don't want to wait 10 minutes on a database to come up.
[00:27:23] And then, finally, if something goes wrong, they want to be able to go back to an earlier snapshot.
[00:27:28] If the data gets corrupted or the agents did something by mistake, they should be able to go back to an earlier version.
[00:27:33] So this is really awesome.
[00:27:36] So please try this out.
[00:27:37] If you're doing any development going forward, please use LakeBase Postgres.
[00:27:42] In the next 12 months, we're going to see more software be written than ever in the history of mankind, okay?
[00:27:49] All that software that your organizations are going to write using LLMs and coding tools need a database behind the scenes.
[00:27:55] Let's make sure it's just open source, LakeBase, Postgres with low cost.
[00:27:59] That's what these agents love.
[00:28:01] But today, we have another talk that I'm really excited about.
[00:28:03] And I'm not going to talk too much about it here.
[00:28:05] But Raynald's going to come back.
[00:28:07] And he's going to talk about something we call LTAP, okay?
[00:28:09] LakeBase and LakeHouse Unification.
[00:28:11] So it's like a breakthrough that actually the industry we've been working on for 40 years.
[00:28:15] We think we finally pulled it off.
[00:28:17] So he's going to come and talk about that as well.
[00:28:19] Okay.
[00:28:20] So we talked about LakeFlow to get the data in.
[00:28:22] We talked about LakeHouse, which is a data warehouse to do analytics.
[00:28:25] We talked about LakeBase, which is transaction processing and the talks that we're going to have on that.
[00:28:30] So hopefully, we now have the choice, the data layer at the bottom ready for the AI, okay?
[00:28:36] So it can be consumed fast by the agents.
[00:28:39] And it's seamless for them.
[00:28:40] And they can experiment with the data.
[00:28:41] So this is really important.
[00:28:42] We've got to get that data layer right.
[00:28:44] But as I mentioned, we also need to figure out control, okay?
[00:28:48] So we can't just unleash the agents on all of the data like that.
[00:28:51] We need to make sure that we're controlling what they're doing.
[00:28:53] So that's why we started the Unity Catalog project a bunch of years ago.
[00:28:56] When we started that project, what the industry was doing is that they were having catalogs for structured tables.
[00:29:03] So data warehouses, we're looking at structured tables, rows and columns, and they were saying, okay, well, this column is somebody's salary information.
[00:29:10] So can we hide that?
[00:29:11] So that's what this was about.
[00:29:12] Access control on structured data like that.
[00:29:15] But we called it Unity because we think you shouldn't just do control on structured data.
[00:29:21] You should do it on all the data assets that you have in an organization, any data and AI assets, okay?
[00:29:26] So you could have unstructured files, PDFs, you might have AI models, and 50 other asset types that we added over the years.
[00:29:33] We want to be able to do it with all of these, okay?
[00:29:35] So that's idea number one for Unity Catalog.
[00:29:38] Number two, you don't want to just do access control, but you can lock everything down and you're done.
[00:29:44] You want to also be able to do discovery and democratize the data in the organization.
[00:29:47] You want to also be able to track the lineage of the different data sets.
[00:29:50] Where does this data come from?
[00:29:51] And visualize that graph.
[00:29:53] You want to be able to do cost controls on everything that's being done because otherwise the cost just will run out.
[00:29:59] You want to be able to look at the data quality of everything you have in your organization.
[00:30:03] So these are all the things that are part of Unity Catalog, and it's extremely important.
[00:30:06] That's why we never actually charge for Unity Catalog.
[00:30:09] Unity Catalog is just free, and we actually open sourced it a few years ago, and we're doubling down on open sourcing more.
[00:30:13] In the last month, we added more people that are open sourcing pieces of it.
[00:30:16] So it's moving very, very fast.
[00:30:17] We're excited about that.
[00:30:19] And also, we announced that you want to be able to share data.
[00:30:22] So this was really important for us.
[00:30:25] The rest of the industry, they were doing this kind of locked-in sharing, where they were saying, buy my software, and then if you want to share it with someone else, they should buy also my software, and they can share data sets between it.
[00:30:34] That's why we open sourced Delta sharing a bunch of years ago as part of the Delta Lake project.
[00:30:39] And now you can take any data assets, and you can share it with anyone in the ecosystem.
[00:30:43] It's like you have great growth, lots of partners in that.
[00:30:45] But what we're very excited about this year is that we're actually open sourcing a new project that's a superset of Delta sharing.
[00:30:51] We call it open sharing.
[00:30:53] And what this enables you to do is share not just with Delta, but also with Iceberg, because they're now the same format, but also not just share data assets, but also AI assets.
[00:31:02] Okay, so now you can share your agents, your skills, and your models, and you can do that also on-premise.
[00:31:09] Okay, so we're very excited about that.
[00:31:11] This is very cool.
[00:31:11] Okay, so we got all these different things you can do in Unity Catalog.
[00:31:20] We have open sharing.
[00:31:21] But in the last year, what has happened is that governance, which always has been AI governance, has always been part of Unity Catalog, is becoming very complicated in most organizations.
[00:31:32] Okay, so we're seeing that organizations now have lots of agents.
[00:31:37] You're all building agents, maybe using Lama Index, maybe using Lang Chain.
[00:31:41] The SaaS providers are now each providing you an agent that they want you to use.
[00:31:46] They have agent builders that they want you to use.
[00:31:47] They have their own MCP servers that the SaaS vendors are saying, please use our MCP servers.
[00:31:51] Here are the skills you can use with those.
[00:31:53] Your organizations, your developers, are themselves developing MCP servers.
[00:31:57] They're themselves developing skills files.
[00:31:59] They're all these models that are being developed now by the frontier vendors.
[00:32:04] There's many different versions of them.
[00:32:05] They all have different capabilities.
[00:32:07] You know, they exist.
[00:32:08] They don't exist.
[00:32:09] There's coding tools out there, vibe coding things.
[00:32:12] And then now we're starting to see more and more co-work-like agents in the organization.
[00:32:16] So it's kind of a quagmire.
[00:32:18] It's a complete mess.
[00:32:20] Okay.
[00:32:20] So what is the problem with this?
[00:32:24] So problem number one is that actually cost is running away.
[00:32:26] I think this is the biggest problem right now.
[00:32:28] Anyone we talk to, they're really worried.
[00:32:30] The skies of this are just skyrocketing.
[00:32:33] And no one has visibility.
[00:32:35] Is it someone's agents?
[00:32:36] Which model is it?
[00:32:37] What were they doing?
[00:32:38] So all this token maxing is happening.
[00:32:40] And there's no way to actually set limits on everything holistically or rate limit it.
[00:32:44] So that's problem number one.
[00:32:46] Problem number two is that there's no way to control what the agents are doing.
[00:32:49] How do we make sure that the agents actually have access to the appropriate data, that we can audit them, that we have guardrails for their inputs and outputs?
[00:32:58] And how do we manage their identities?
[00:32:59] Who are they acting on behalf of?
[00:33:02] How do we track this?
[00:33:03] And then finally, and this is the most important one, as I said earlier, how do we do this with choice?
[00:33:07] Without lock-in?
[00:33:08] Because the state-of-the-art models now have a lifespan of one month.
[00:33:12] November last year, we were super excited about Gemini came out from Google.
[00:33:16] And it was the best model.
[00:33:17] And we all were switching over to it.
[00:33:19] And we said, okay, that's it.
[00:33:20] This is the best model ever.
[00:33:22] Then February timeframe, January, February timeframe, it was like, clearly Opus is the best ever.
[00:33:26] And then OpenAI came out with GPT-5-5, and that was the greatest model ever.
[00:33:30] And then last week, we had Mythos, or Fable, and it was out, and we were all going to switch to it.
[00:33:35] But now they canceled it, and then, you know, it's back and forth.
[00:33:38] So how do we make sure that we can actually have flexibility so that we can pick any of them, and that we will have choice here?
[00:33:46] Okay?
[00:33:47] So those are three problems that are really important.
[00:33:49] And that's why we're super excited to announce, as part of Unity Catalog, we are breaking out one place where you can manage
[00:33:56] and control all of your agents, all of your AI spend.
[00:34:00] We call it Unity AI Gateway.
[00:34:02] Okay?
[00:34:03] So excited about this.
[00:34:08] So what is Unity AI Gateway?
[00:34:10] It's a single pane of glass.
[00:34:12] And there, you can actually have your one entry point for all of your agents, for all of your AI, for everything that happens.
[00:34:21] Any of the harnesses that you're using can go through this.
[00:34:24] Again, this is part of Unity Catalog, which we open-sourced.
[00:34:27] We don't charge for this.
[00:34:28] And also, the AI Gateway is part of MLflow, so it's also open-sourced.
[00:34:33] But one of the most important things that we do that not maybe everyone in the room knows is that we will actually provide you capacity for the frontier models.
[00:34:40] So let's say you committed to pay $100,000 to Databricks.
[00:34:44] You might use Lakeflow for that to consume things.
[00:34:47] You might use our Lakehouse.
[00:34:48] But you can also just use tokens for that $100,000 directly from us on OpenAI's models or Anthropics models or Gemini.
[00:34:57] And you can do this on any cloud, whether you're on Azure, GCP, or AWS, you can just use this.
[00:35:01] So this is available on the platform.
[00:35:03] So we actually offer that.
[00:35:05] Two, you can actually get perfect observability on all of the spend inside the organization with dashboards that give you the observability of how that spend is happening.
[00:35:14] And you can set limits.
[00:35:16] You can set budgets.
[00:35:17] You can set budgets on different groups, subgroups, all the way down to the individual level.
[00:35:22] And when someone exhausts that budget, you will get an alert.
[00:35:26] Or you can actually just stop it and rate limit it so that you can make sure that nobody's spending too much.
[00:35:31] Or that they every day get an email saying how much they've spent.
[00:35:34] So this is super important.
[00:35:36] And then finally, we can enforce safety, compliance, auditing, and identity management of all your AI.
[00:35:43] Okay, this includes harnesses.
[00:35:44] This includes models and agents.
[00:35:47] So very excited about this.
[00:35:48] So that's what Unity AI Gateway is.
[00:35:51] And now you can manage all these asset types.
[00:35:53] MCP tools, models, agents, skills that you have in the organization.
[00:35:58] And I just want to emphasize, this is not just MCP or the tools that you have in Databricks.
[00:36:03] These are any MCPs that you have in the organization by any vendor.
[00:36:06] You register them in Unity AI Gateway, and you will track all the usage, and you will see how it's going.
[00:36:11] And in fact, it has a really awesome capability where you just authenticate once for the MCPs, and it will do it for all of them so you don't have to authenticate again and again and again.
[00:36:19] So this is really awesome.
[00:36:20] Check this out.
[00:36:22] Okay.
[00:36:23] So now we have also control, and we have cost.
[00:36:28] So now we've dealt with how do we get your data ready.
[00:36:31] So hopefully we have a data ready in an open format that's fast for the agents.
[00:36:35] We figured out how to do control so we can do governance, and how do we do cost management with Unity AI Gateway.
[00:36:42] Okay.
[00:36:42] So that brings us to the context layer.
[00:36:45] So as I said, AI's are plenty smart.
[00:36:48] AI does not have an intelligence problem.
[00:36:50] It has a context problem.
[00:36:51] So how do we actually get that context into the AI?
[00:36:53] So this is the most important thing.
[00:36:55] So what do I mean here?
[00:36:57] So let me give you an example of how it works today.
[00:36:59] Today, you're using these agents.
[00:37:02] You might be using a coding agent.
[00:37:04] And if you look at the best agents out there, they kind of work like this.
[00:37:07] There's a big graph of all the documents and everything that everyone is doing inside of your organization.
[00:37:13] And what they're doing is they're in this red dot you can see.
[00:37:15] They go when you ask a question.
[00:37:17] For instance, how is my product revenue last two weeks in Europe?
[00:37:22] It goes and it looks at one document.
[00:37:23] And then it reads that document.
[00:37:25] And it says, okay, is there anything interesting?
[00:37:26] And then, oh, there's a link.
[00:37:27] And then it opens up an MCP server to another document.
[00:37:30] It reads that document.
[00:37:31] And it's sort of for looping live through your corpus to get the data.
[00:37:36] Okay.
[00:37:37] But there's a big problem with this approach that all the agents today are using, all the frontier ones.
[00:37:42] Problem number one is that it's very time consuming.
[00:37:45] This live loop of going searching for your data live is just going to take very long time.
[00:37:50] Okay.
[00:37:50] And we all see it.
[00:37:51] Like you ask a question, even a simple question, and it takes it to 10 minutes, 15 minutes.
[00:37:55] You just come back with an answer.
[00:37:56] Because it's live going through the MCP servers.
[00:37:58] It doesn't know where to search.
[00:38:00] It's just trying live to do this.
[00:38:03] It doesn't have the context.
[00:38:05] Number two, the costs are super high.
[00:38:07] The joke is you tell it, please rename this file for me.
[00:38:11] And it takes 10 minutes, and it charges you five bucks.
[00:38:13] Okay.
[00:38:14] So that's kind of how it works.
[00:38:15] Because, again, it's going through, and it's doing the search, and it's using up a lot of tokens to construct all of this.
[00:38:20] But these two are not even the biggest problem.
[00:38:22] This gives you really bad user experience, and it's expensive.
[00:38:24] The biggest problem is that the quality suffers.
[00:38:26] The reason that is, is that all of the documents and all of the data assets in your organization is gigantic.
[00:38:31] Okay.
[00:38:31] So you're really just doing a small random walk of a small subset to get you the answer.
[00:38:36] It's remarkable the answers it gives you.
[00:38:38] It's game-changing.
[00:38:39] We're all excited about it, and we all were using a bunch of it, and the costs were going up.
[00:38:42] But actually, if you look at the accuracy, it's quite problematic.
[00:38:46] Okay.
[00:38:46] So how do we solve this?
[00:38:49] So we are very excited to announce, and this is one of the most important things I'm going to announce, Genie Ontology.
[00:38:55] Okay.
[00:39:01] So what's Genie Ontology?
[00:39:02] So let me explain.
[00:39:04] So Genie Ontology, what it does is that it actually connects with all the data that you have in an organization, and in the background, so this is not when you're asking a question.
[00:39:13] This is happening in the background.
[00:39:15] It's constructing a graph of the most important knowledge that you have inside of your organization, and it's making it ready so that we can give that context to any of the agents.
[00:39:25] Okay.
[00:39:25] And one thing that's really important is that it's not just everything that you have in the Lake House and everything you have in Databricks.
[00:39:31] It also connects to your Google Drive.
[00:39:33] It connects to your SharePoint.
[00:39:35] It connects to your email, Google Calendar, anything that you set up, and it can actually branch out and access all of these things that you have there.
[00:39:43] And the research team developed an algorithm called OntoRank.
[00:39:47] Okay.
[00:39:48] I know it's a bad name, but, you know, the research team came up with it.
[00:39:52] And it's like Ontology Rank.
[00:39:55] But what is OntoRank?
[00:39:57] So what OntoRank does, it's kind of like PageRank, for those of you who remember.
[00:40:00] So PageRank was how Google, 25 years ago or 30 years ago, figured out how to take the whole web and create an index for what are the most important sites.
[00:40:11] Before Google, AltaVista, Yahoo, these sites, they were just looking at how many times is this word mentioned on a site.
[00:40:17] It led to these SEO search engine optimization where people were adding the same word thousand times on a web page to get higher ranking.
[00:40:24] But PageRank, it could just figure out from the graph itself what's important.
[00:40:28] So OntoRank does a similar thing on all of the assets that you have in an organization, but it's not PageRank.
[00:40:35] This problem is harder because it's not just web pages with links.
[00:40:40] It's all the assets that you have, but they're actually different asset types.
[00:40:43] So it needs to identify, you know, source code is different from Google Doc.
[00:40:48] Also, here we have users.
[00:40:50] And the users inside of your organization are part of an org chart.
[00:40:52] And it actually matters how often they access things.
[00:40:54] So we have this information as well.
[00:40:56] So we take all of these different asset types, all the people in the organization, and we construct this ontology graph that then helps us actually do amazing things with this for the agent.
[00:41:06] So this is the context that we're going to feed.
[00:41:09] Now you might say, hey, I already put together context.
[00:41:11] I already built semantic layers at my company.
[00:41:13] I already have a business glossary that I built over the years.
[00:41:17] Jeannie can take that from any of the sources.
[00:41:19] So you can actually bring your own semantics as well.
[00:41:22] So as part of Unity Catalog, we have something called Unity Catalog Semantics where you can bring from our partners, whether it's from Atlan or AtScale or from existing BI tools that have semantics or any business glossary that you already have.
[00:41:35] And you can have humans curate that.
[00:41:37] You'll happily take that, and it'll help it get smarter.
[00:41:40] So depending on whichever camp you're in, you think this should be completely autonomic and happen by itself, Jeannie ontology does that.
[00:41:47] If you want to have lots of manual curation, you can do that with model semantics.
[00:41:51] They both work well together.
[00:41:53] Then these are fed into our agents.
[00:41:55] So we'll talk about the agents that we have, and we'll give you demos.
[00:41:58] But there's three categories of agents that are important, and I'll talk about them.
[00:42:01] Jeannie 1, Jeannie Agents, and Jeannie Code.
[00:42:04] And they all are fed this context, which makes them faster.
[00:42:08] It makes them much more cost-effective, and it makes them much higher quality.
[00:42:14] So you can much faster get the quality answers you want without paying too much.
[00:42:17] So that's what this is.
[00:42:19] If you're using other agents, you want to use other agents, they can use Jeannie as an MCP server, and they can just connect to that agent, Jeannie, and they can actually get the semantics from Jeannie.
[00:42:30] So you can also do that this way.
[00:42:32] You can bring any client, any agent you want.
[00:42:34] You can connect to this ontology that you have there so that you can get that semantics and the context into any existing AI that you have.
[00:42:42] So this is what it looks like in the product.
[00:42:44] You can actually click on the graph, and you can sort of explore it.
[00:42:46] In this particular graph that we're exploring here, this is just one instance of Databricks for ourselves.
[00:42:51] This is our own Databricks that we have for Databricks employees.
[00:42:53] It has over 4.5 million ontology data snippets that it has collected and constructed.
[00:43:00] So obviously nothing you could do live.
[00:43:02] Like, it would be infeasible for any of these agents to live, go one by one, and look through 4.5 million ontology snippets.
[00:43:09] But it's done that, and it's constructed this so that you can actually have this.
[00:43:13] Okay.
[00:43:14] So that's super awesome.
[00:43:16] So this is now fed to our agents.
[00:43:19] The most important agent that I'm excited about is Jeannie 1.
[00:43:22] Okay.
[00:43:22] So let me tell you about Jeannie 1.
[00:43:24] Some of you might already be using Jeannie at Databricks for the last three years.
[00:43:29] Jeannie has been part of the platform.
[00:43:30] This is not that Jeannie.
[00:43:31] This is Jeannie 1.
[00:43:32] So it's the one place you go to access all of the data assets in your organization.
[00:43:38] Okay.
[00:43:38] So Jeannie that we had previously that we're using, it's still there.
[00:43:42] But that's confined to a certain domain, and you have, say, marketing people asking just marketing questions in a Jeannie space or room.
[00:43:50] Jeannie 1 is just one place any of your business users in your organization can log into.
[00:43:56] So you should just enable automatic identity management for all of the employees.
[00:43:59] If you have 100,000 employees, they then log in.
[00:44:02] They just get Jeannie 1.
[00:44:03] They don't get the data warehouses.
[00:44:04] They don't get all the rest of the Databricks infrastructure.
[00:44:07] You know, they're not going to be able to go and spend, spin up data warehouses and spend a lot of money.
[00:44:12] They're just getting a simple interface that looks like this, and they just ask their questions there.
[00:44:16] So it looks like any other agent that you might have.
[00:44:19] But there's some differences here.
[00:44:20] First of all, it now can get data from any of the sources that you have at the bottom of the screen.
[00:44:24] So whether it's Google Drive or SharePoint or whatever it is, it, of course, has all the goodies that other agents have.
[00:44:30] Like, you can have skills, you can schedule tasks, and do co-work activities for you.
[00:44:36] But the key thing is, it has access to its Jeannie ontology.
[00:44:39] Okay?
[00:44:40] And that changes everything.
[00:44:42] So I'm very excited about Jeannie 1.
[00:44:43] Please enable your organization to have it and also use it.
[00:44:46] My favorite way of using it is on the phone.
[00:44:48] Most of us use it on the phone, on our iPhone or Android.
[00:44:51] And you can see it's visual.
[00:44:53] You see all the graphs.
[00:44:54] And it's integrated with our dashboards, AIBI.
[00:44:57] So it's basically like a BI tool or a dashboard.
[00:45:00] Okay.
[00:45:00] So that's Jeannie 1.
[00:45:01] We're also announcing Jeannie Agents.
[00:45:04] So what's Jeannie Agents?
[00:45:05] You can take any conversation that you had with Jeannie 1, and you can say, hey, turn it into an agent.
[00:45:11] You can take that agent.
[00:45:12] You can now make it available to a little company.
[00:45:14] So, for instance, you can say, hey, here's all my HR documents.
[00:45:17] And create that agent.
[00:45:19] Have a conversation.
[00:45:19] Make sure that it can answer questions about your HR documents.
[00:45:22] Now you can make that available to the whole organization.
[00:45:24] Anyone can ask any HR questions they have from that agent.
[00:45:26] You can put it in Slack.
[00:45:27] You can put it in Teams.
[00:45:29] Any modality they want.
[00:45:31] But it also can do more.
[00:45:33] It can actually completely do autonomous work for you.
[00:45:35] It can go every day, fetch information from, say, Salesforce, prepare it, put it in Workday, send you a report, work with other agents.
[00:45:45] So very excited about Jeannie Agents.
[00:45:48] And then we have Jeannie Code.
[00:45:49] Jeannie Code is special because Jeannie Code is similar to other coding agents that are there, but it's really good at two things.
[00:45:57] It's really good at data engineering.
[00:46:00] So, you know, 13 years of data engineering on the platform, we made it really good at writing pipelines for you, writing code for you, doing data engineering.
[00:46:08] So that's super awesome.
[00:46:10] It's also really, really good at machine learning and data science.
[00:46:13] So it can go and build you a machine learning model, it can train it on GPUs, it can test that the accuracy of those machine learning models is really high, and it can really democratize access to machine learning inside of your organization.
[00:46:24] So that's really, really important.
[00:46:26] So these are two skills that it's uniquely capable of on the platform.
[00:46:30] So check out Jeannie Code.
[00:46:32] And we also are launching something I'm really excited about.
[00:46:36] I think this agent actually is going to be the biggest transformative agent for most of your organizations.
[00:46:42] We call it Jeannie Zero Ops.
[00:46:44] So Zero Ops, as the name implies, what it does for you is it runs in the background, it looks at all your data pipelines, or if you have data science, machine learning models, and it sees that, okay, this pipeline is down at 2 a.m.
[00:46:56] It will then go investigate, why is this pipeline down?
[00:46:59] Oh, the pipeline is down because there's an error.
[00:47:01] It's saying that, you know, this data, I'm not familiar with it.
[00:47:04] There seems to be a new column in this data.
[00:47:06] They'll go see, okay, what's the name of that column?
[00:47:08] Can I add that column?
[00:47:09] It'll actually experiment with writing the code in a separate environment without affecting your production pipeline.
[00:47:14] And once it's ready, it'll prepare it for you, and it'll send you a notification on your phone, and you can see and you can review it.
[00:47:20] And if it looks good, you can just click accept, and it'll fix your pipeline, okay?
[00:47:24] So you're still a human in the loop.
[00:47:25] Of course, you can just click do it automatically, but I would not recommend that.
[00:47:28] Yes, people are excited.
[00:47:33] Yeah, I'm personally really excited about this.
[00:47:35] No one wants to wake up at 2 a.m.
[00:47:36] You know, we have all these folks that are on call, and, you know, this pipeline broke.
[00:47:40] Why did it break?
[00:47:41] Oh, you're going to have to go investigate, and you're going to spend, you know, two hours in the middle of the morning, and everybody's yelling.
[00:47:45] This is the world of data engineering.
[00:47:47] So with Zero Ops, we're really excited that you can actually make this much simpler.
[00:47:50] And then finally, we'll talk about apps, building your own apps to democratize access to data and AI
[00:47:55] inside of your organization.
[00:47:57] But now we make that really simple to vibe code your own apps.
[00:47:59] You can, of course, use our partners.
[00:48:01] You can use Lovable, Replit, Vercel, any of these.
[00:48:03] But we also now give you, on Genie, Genie App Builder, which can do that for you.
[00:48:09] Okay.
[00:48:09] That's Genie, and key thing here is Genie ontology that we automatically construct for you.
[00:48:14] That's really the secret sauce.
[00:48:16] But the core work and accessing the data outside of the lake house is also really important.
[00:48:20] A lot of people think of Databricks as, okay, this is the data I have in my lake house.
[00:48:23] No, this is any of the data you have in your organization.
[00:48:27] But what about developers who want to do low-level coding?
[00:48:30] They want to do coding agents.
[00:48:31] They want to write software.
[00:48:33] You know, they want to use different harnesses.
[00:48:34] Maybe they're using cloud code.
[00:48:35] Maybe they're using codex.
[00:48:36] Maybe they're using OpenCode or Pi or something else.
[00:48:39] So that's where Agent Bricks comes in.
[00:48:41] So this is for developers who are doing development work.
[00:48:44] And we've expanded that developer platform so that we now have sandboxes so that you can run your agents in really fast sandboxes that come up and isolate them.
[00:48:53] We have agent memory.
[00:48:54] And we announced this weekend a meta harness called Omnigent.
[00:49:00] So Matei announced that.
[00:49:01] And he'll be giving a talk on this and explain what that is and how that works.
[00:49:05] But it's a harness of harnesses.
[00:49:07] So it can leverage all the existing harnesses.
[00:49:09] And it can actually have them play against each other.
[00:49:11] You can have cloud code fight codex and actually get much better results than either of them.
[00:49:16] So he will give a talk on that as well today.
[00:49:18] So I'm very excited about that.
[00:49:20] Okay.
[00:49:20] So that's what it looks like.
[00:49:23] So hopefully I'm showing you how we can now take all the data that you have in your organization.
[00:49:28] Okay.
[00:49:28] So it's the first half of my talk.
[00:49:29] How do we take the data, make it ready for AI?
[00:49:32] How do we actually make sure that we have cost and governance, control, and security figured out?
[00:49:39] And then how do we get with Genie ontology, this context into the agents?
[00:49:42] So that's great.
[00:49:44] But I think actually something more profound is happening in the industry.
[00:49:49] Something is happening with software.
[00:49:50] Okay.
[00:49:50] And people are talking about it.
[00:49:52] So what is that?
[00:49:53] So what does the classic software as a service SaaS stack look like over the last two, three decades?
[00:50:00] So it kind of looked like this.
[00:50:02] You had a system of record, which was a database.
[00:50:05] That's, yeah, the most important golden records about your HR data or your CRM data or whatever it was.
[00:50:11] And you had some custom workflows inside of that, somewhat rigid, but you could customize them if you brought in contractors that would augment that for you.
[00:50:20] It would take a bunch of a while.
[00:50:21] And it was a UI.
[00:50:22] It was customizable.
[00:50:24] No one really loved those UIs, but most of us learned them.
[00:50:27] And then once we learned them, we didn't want to switch to a new one.
[00:50:30] And we had one of these for each of the domains inside of the organization.
[00:50:33] So you have one for sales.
[00:50:34] You have one for ops.
[00:50:36] You have one for finance.
[00:50:37] So there's one of these for each.
[00:50:39] But what's happened in the last two years is that each of these vendors are now providing you an agent of their own.
[00:50:44] So they're saying, hey, we have an agent for this.
[00:50:46] Here's my agent to access my system of record data here.
[00:50:50] You know, we're headless.
[00:50:51] You can just access it with the agent.
[00:50:53] But the problem is, when you go to these agents and you start asking questions, you never have just questions about the data in that system of record.
[00:50:59] Your question spans things.
[00:51:01] And you need to access data that's elsewhere.
[00:51:02] So it needs to go and access systems that are elsewhere.
[00:51:05] Okay?
[00:51:05] And this gets quickly very complicated.
[00:51:07] And most of these vendors don't want to play well together anyway because they're competing with each other.
[00:51:11] So it's sort of very messy inside of the enterprise right now.
[00:51:14] And it's very confusing.
[00:51:14] How do you make that actually work and go across these different systems?
[00:51:17] So the big question is, what does the future of software stack look like?
[00:51:21] You know, or the SaaS software, or the apps of the future.
[00:51:26] What do they look like?
[00:51:27] What are they going to look like?
[00:51:28] Okay?
[00:51:28] And we don't know.
[00:51:29] But we think it should look something like this, right?
[00:51:32] You have all your agents.
[00:51:34] And they should just access agent system of record, or agentic system of record.
[00:51:39] They should just be there, and you should be able to access that seamlessly.
[00:51:42] It should be very fast.
[00:51:44] You shouldn't have to go through lots of other agents.
[00:51:45] You shouldn't have to try to go to a production database.
[00:51:48] That's too risky.
[00:51:49] You should just be able to have it at your fingertips and let these agents do that really
[00:51:52] fast, quickly, cost-efficiently, securely.
[00:51:56] So what would the agent system of record look like in the future?
[00:52:00] What do we think that's going to look like?
[00:52:01] Well, we think it probably looks something like this.
[00:52:04] You have all your data, and you can access it all in one place in an open format.
[00:52:09] Okay, you see where this is headed?
[00:52:11] Okay.
[00:52:12] So I can just click through the next few slides.
[00:52:14] On top of that, you want to have unified governance, right?
[00:52:16] So you want to be able to have control.
[00:52:18] You want to be able to do cost management on top of that.
[00:52:20] And, of course, you don't want these agents to go and go one dock at a time and spin up
[00:52:26] tokens and not know what they're doing.
[00:52:28] You want them to have the enterprise context so they can quickly do that.
[00:52:30] So we think that the new system of record is actually the data and AI platform that we've
[00:52:40] been talking about all along.
[00:52:41] We think that the world will look more like this, whether it's Databricks or someone else,
[00:52:44] but this is going to be the structure of the agentic system of record of the future.
[00:52:48] That way, you can get the choice.
[00:52:50] You get the choice of running this on any cloud, any SaaS vendor, any of the AI models.
[00:52:54] You can do the cost controls.
[00:52:55] You can do the governance control.
[00:52:57] And, of course, you should have the context for this.
[00:52:59] So that's what we think the future looks like.
[00:53:01] So let's augment the stack with that as well.
[00:53:03] So this is what I already presented to you, those three layers.
[00:53:06] But now, on top of that, we also get agentic apps.
[00:53:09] So Databricks apps is very exciting.
[00:53:12] We'll have a whole talk dedicated to this.
[00:53:14] Databricks apps is how you democratize access to data and AI inside of organization
[00:53:17] to hundreds of thousands of people.
[00:53:18] And it could be custom apps that you've built.
[00:53:22] Maybe you've Ibe-coded these.
[00:53:24] There's a huge proliferation of this.
[00:53:25] This is the fastest growing, actually, product for us.
[00:53:28] So that's what apps look like.
[00:53:30] And we actually now introduce the marketplace where you can actually get these apps.
[00:53:34] So anyone can go and access these apps.
[00:53:36] And one of the things that we're also excited about that we're announcing is that you can actually
[00:53:38] transact and pay for these apps so that you can buy them from those vendors that are actually
[00:53:42] offering those.
[00:53:44] Of course, you can have your own.
[00:53:45] Your company can build your own.
[00:53:46] But all of these that also be listed here exist as apps that you can actually get access to
[00:53:51] or work with.
[00:53:52] So that's a very exciting development.
[00:53:54] Okay.
[00:53:55] So that's what that looked like.
[00:53:57] But there's a few apps that we're really excited about and we want to build ourselves.
[00:54:00] That we think that we uniquely have a point of view and where we can really add value.
[00:54:06] Where we think it's really data intensive and it's kind of our really forte.
[00:54:09] So the first one is security.
[00:54:11] And we announced two months ago that we're launching a sim.
[00:54:15] It's called Lakewatch.
[00:54:16] It's an agentic sim.
[00:54:18] We call it the security lake house.
[00:54:20] It looks kind of like this.
[00:54:21] Let me explain what this is.
[00:54:22] It's built on the agent system of record that I mentioned earlier.
[00:54:26] So you collect all of your security data in a lake house.
[00:54:30] We have to do it that way.
[00:54:31] The current approach of taking data and injecting it into these proprietary sims is super expensive.
[00:54:39] So people can't afford.
[00:54:40] And the data is exploding.
[00:54:42] So you don't want to actually ingest all this expensive data into these classic legacy sims.
[00:54:47] So a lot of people, what they're doing is they're filtering out the data before to make sure that only the most important data is in there.
[00:54:53] Well, this is not going to work because we're getting now massive attacks from automated agents.
[00:54:58] In the last six months, the best hacker in the world is an agentic system called Expo.
[00:55:02] You can look it up on the HackerOne top list.
[00:55:06] People are finding exploits using things like Mythos.
[00:55:08] And even before Mythos and actually after Mythos, we're going to see lots and lots of exploits.
[00:55:12] So we need all of the data.
[00:55:14] It needs to be very cheap.
[00:55:15] It can't be skyrocketing costs.
[00:55:17] And we should just store it all in any means we have in an open, secure lake house.
[00:55:21] And then on top of that, we've built agents in Lakewatch.
[00:55:25] Those agents, what do they do?
[00:55:26] They automatically make sure that they create detections.
[00:55:31] When you wake up and there's like 100 alerts of what's going on with your data, they go and triage it for you automatically.
[00:55:37] Very similar to zero ops, but this is really for the SOC.
[00:55:40] SOC is Security Operations Center.
[00:55:42] So the SOC analysts wake up and can do that.
[00:55:46] And then finally, it can do threat hunting for you.
[00:55:49] It can actually find out that there's a new zero-day attack and it'll add the pipeline to it and automatically do all of that work.
[00:55:54] So we're very excited about Lakewatch.
[00:55:56] So check that out.
[00:55:58] And we are very excited to announce a piece of news, which is that Databricks has agreed to acquire Panther Labs.
[00:56:05] So, yes.
[00:56:11] So Panther is a really amazing, ahead of its time, sim that's based on Python.
[00:56:18] Okay?
[00:56:18] So what Jack Nagliri and team that we're very excited to welcome to Databricks, and you'll actually see him on stage here tomorrow, they really bet on everything should be Pythonic.
[00:56:29] So they were way ahead of their time, actually.
[00:56:30] I remember meeting him five years ago, and he came from Airbnb.
[00:56:33] He was a SOC analyst himself.
[00:56:34] And he said that, you know, we can't have these complicated drag-and-drop UIs.
[00:56:38] It needs to be Python-based, which turns out to be really prescient, because today, all the AIs want to speak Python.
[00:56:45] So that's really awesome.
[00:56:46] And he has hundreds of connectors that they've already built that we're all adding.
[00:56:49] So now we have all these connectors, and we have all these Pythonic detections.
[00:56:53] So very excited about this.
[00:56:54] And they have really marquee customers.
[00:56:56] So Anthropic is using them.
[00:56:58] And Coinbase is using them.
[00:57:00] Plaid is using them.
[00:57:01] So very excited to welcome that to the family.
[00:57:03] So that's the Agentex Sim.
[00:57:06] Okay.
[00:57:06] There's one more space that we're excited to get into, which is the marketing stack.
[00:57:12] So what is that?
[00:57:14] So we're very excited to announce what we call Customer Lake.
[00:57:18] Okay?
[00:57:19] Yeah.
[00:57:20] So what's...
[00:57:21] Okay.
[00:57:24] So what's Customer Lake?
[00:57:25] So it's an Agentex CDP, Customer Data Platform, or Customer 360, if you will, built on the lake house.
[00:57:31] It looks like this.
[00:57:33] And so, of course, you can get now all of your data into a lake house.
[00:57:38] But it's unique.
[00:57:39] So it has two agents that are really special.
[00:57:43] One agent, which is a profile agent.
[00:57:45] What it can do is it can automatically do all of the identity dedupe that these frameworks in the past have done.
[00:57:51] But it can do that using LLMs.
[00:57:54] So it's really accurate.
[00:57:55] It also has a campaign agent, which implements what we call a new concept called Infinity Campaigns.
[00:58:01] And Infinity Campaigns basically let you tailor, personalize every interaction that you have.
[00:58:08] So in the past, what people would do is they would say, the whole world, let's split it up into 100 different audiences,
[00:58:12] and then let's market to each of the 100 audiences.
[00:58:15] But now we can really personalize with really cheap, small, distilled LLMs, one-to-one, continuous campaigns.
[00:58:23] So we're very excited about Customer Lake as well.
[00:58:25] And we've developed this in partnership with all the vendors that you see down here.
[00:58:30] And you can see at the top the logos of our development partners that are helping us actually develop this.
[00:58:35] So very excited about this.
[00:58:36] There will be a talk by Tassel on this as well.
[00:58:39] Okay.
[00:58:39] So that brings me to the end of my keynote.
[00:58:42] We have an exciting program.
[00:58:43] This is what I kind of showed you.
[00:58:46] So let's simplify this a little bit and zoom out.
[00:58:48] What have I told you today?
[00:58:50] So basically said that we have this data.
[00:58:53] We have all these processes.
[00:58:54] If we can just capture all of that and make it AI-ready, we can get the context, and we can give it to the AI,
[00:58:58] then we can have a huge impact in the organization.
[00:59:01] So really, it's about how do we get choice so that you're not locked in?
[00:59:05] How do we get governance?
[00:59:06] How do we get cost control?
[00:59:08] And how do we get context?
[00:59:09] Okay.
[00:59:10] So the first one is really about any data, any model, and any cloud without lock-in.
[00:59:15] The second one is really one place where you can do all of your governance and all of your cost controls.
[00:59:19] And the final one is about how we get enterprise context to your agents.
[00:59:24] So that concludes my talk.
[00:59:26] I hope I've shown you that Databricks Data and AI Platform will help you do that.
[00:59:30] Thank you so much.
[00:59:30] Thank you.
[01:00:00] Thank you.
[01:00:30] Welcome to the stage, Databricks Senior Director of Product Management, Ken Wong.
[01:00:54] All right.
[01:01:01] Good morning.
[01:01:03] So today, I get to introduce you to Genie1.
[01:01:07] Genie1 is an AI coworker that can connect to all of your data and all of your apps so that
[01:01:14] everyone in your organization can get insights and even automate action.
[01:01:18] You can even create autonomous agents just by talking to Genie.
[01:01:23] But I know what you're all thinking.
[01:01:25] In 2026, why do we need another one of these AI agents?
[01:01:31] Well, the answer is actually very simple.
[01:01:33] The ones that you have today aren't very good with enterprise data.
[01:01:37] They either can't connect and reason about data at all, or they're just not accurate enough
[01:01:43] for you to actually trust them to make real decisions or let alone automating your work.
[01:01:49] So let me show you what I mean.
[01:01:50] So yesterday, we had our product advisory board meeting at Databricks.
[01:01:55] Shout out to the PAB.
[01:01:57] And I wanted to prepare myself for it by getting Genie1 to help me with a comprehensive profile
[01:02:04] of all the customers we have, what's going on in their accounts.
[01:02:07] So this required Genie1 to connect to all the systems that we have, including my personal email
[01:02:13] and Slack and my calendar to figure out who's in the PAB and what we've been talking to them about.
[01:02:18] It required Genie1 to connect to Salesforce to pull in the use case information.
[01:02:23] But it also meant that Genie1 needed to connect to Databricks in order to pull all the consumption data
[01:02:30] that they had because I had a lot of nuanced questions that I wanted to include in the brief
[01:02:35] about what they're using Genie for.
[01:02:37] And this is what Genie1 was able to come up with.
[01:02:40] As you can see, it gave me a comprehensive profile.
[01:02:42] I had to blur some of it so you don't see the details.
[01:02:45] But after querying Databricks and querying these systems,
[01:02:49] they were able to give me a comprehensive and visual profile of all of our customers.
[01:02:53] So what I wanted to do was try this exact same prompt with five of the leading AI agents
[01:02:59] that were out there.
[01:03:00] I tried two of the AI assistants that came bundled with some of the software we had
[01:03:04] and three of the leading coding and co-work agents.
[01:03:07] And this is what I saw.
[01:03:09] So the first AI assistant is actually one that I used a lot.
[01:03:12] One that I used to write docs and things like that.
[01:03:15] And it came back with a response to my prompt very, very quickly.
[01:03:19] Actually, a little bit too quickly.
[01:03:21] And there was something a little bit off about it.
[01:03:23] And the first thing I noticed was that it said that we had 24 customers in the PAB.
[01:03:28] And I knew this was not true.
[01:03:29] So I asked it and said, hey, where did you get this 24 number from?
[01:03:34] And guess what it said?
[01:03:36] It confessed that it just completely made it up.
[01:03:39] I was stunned, right?
[01:03:41] Because this is actually -- I use this thing a lot, not just for at work,
[01:03:45] but I also have a personal subscription.
[01:03:47] But it just goes to show you how important enterprise grounding actually is.
[01:03:53] Now, the other AI assistant that I used actually did a lot better.
[01:03:56] It actually searched existing documents and pulled out account briefs.
[01:04:00] But they were all sort of stale.
[01:04:02] You know, they were all from a quarter ago, information that was no longer relevant to the conversation I wanted to have.
[01:04:09] So the results of this thing, completely unusable.
[01:04:12] The good news, it was very, very quick.
[01:04:14] Now, the coding agents and the co-work agents actually made a genuine attempt and connected to all the systems and the structured source in order to pull live data out.
[01:04:25] But if you've ever used one of these things, you've probably experienced what I experienced, which is that it went off and tried very, very hard to pull this information out.
[01:04:33] It actually hit its time limit, where they said, ask for permission to continue burning Lord knows how many tokens in the process.
[01:04:40] And then after I allowed it to continue, it went away and it iterated again and again.
[01:04:48] And I came back with some cool ASCII art, but then confessed again that actually we were far from done.
[01:04:55] You know, we had only gone part way through.
[01:04:57] So we're 30 minutes into this thing.
[01:04:59] I had a partial report.
[01:05:00] It's like a huge contrast with what Genie1 was able to do.
[01:05:04] So you might say, hey, this is sort of a one-off test.
[01:05:08] You know, you cherry picked an example.
[01:05:10] But you know, it's actually very, very consistent with the research that Databricks research had been doing on the ability of coding agents to answer novel questions.
[01:05:19] Actually, what we saw, we curated a benchmark out of real questions that our employees were asking of Genie, and we asked coding agents to solve the same problems.
[01:05:28] Now, these weren't run-of-the-mill simple questions that could be easily, you know, solved with some semantic layer.
[01:05:35] These were like nuanced questions like, hey, with all the customers who use Genie, what percentage also use a third party agent?
[01:05:41] Questions that, you know, hadn't been asked before.
[01:05:44] And we saw that coding agents were able to do it, the best of them, about half the time, and it took minutes.
[01:05:51] And remember this chart, we'll come back to it, but 50% of the time for a real data question is basically unusable, right?
[01:05:59] You cannot make, you literally can flip a coin.
[01:06:02] And if you know how these agents work, and I'll explain this a little bit already, it shouldn't be surprising,
[01:06:09] because they basically go through this process of the agentic loop, where really getting to the right answer is a matter of it, you know,
[01:06:17] kind of making sure, happening upon like the right sequence of reasoning steps to chance upon like the right context it needs to ask, answer the question.
[01:06:28] So it kind of goes through this walk where it just does its best.
[01:06:31] And there is a very real trade-off between the accuracy that you get and the cost and the runtime of these things.
[01:06:38] And this trade-off results in what we see.
[01:06:41] Now, our current best approach to solving this problem is to inject some context, right?
[01:06:46] Like to give it some information that gives it hints on the right thing.
[01:06:49] And if you come from the AI world, you might call these things skills.
[01:06:52] And if you come from the BI world, you might call these things semantic layers, but they're really the same idea.
[01:06:57] It's like, hey, we can just tell it the things that it needs to do, and then it can do a better job.
[01:07:01] And the problem with this is that, well, while it's totally reasonable and in fact best practices to do it,
[01:07:07] for the most stable business concepts you have, it's unreasonable to do it for the breadth of things that you do within your organization.
[01:07:15] Think about how each and every single marketing campaign defines leads a little bit differently,
[01:07:20] or how every single Scrum team uses all the fields in JIRA a little bit differently.
[01:07:24] You just can't realistically write all this stuff down.
[01:07:28] And before you say, hey, can't we just use AI to solve this problem by generating these models?
[01:07:33] Well, it actually just doesn't work.
[01:07:35] Actually, Anthropic published a great blog on the limits of this, talking about the fact that if you do this,
[01:07:41] what you end up doing is just encoding or like anchoring the AI on one specific use case,
[01:07:47] and then that actually negatively affects the results overall.
[01:07:52] So what is our solution to this problem?
[01:07:55] Well, I mean, Ali stole my thunder a little bit.
[01:07:58] It is the Genie ontology, right?
[01:08:01] The Genie ontology is an automatic context layer that adds a learned layer on top of any modeled context that you have,
[01:08:09] whether that's in Unity or in one of your existing semantic model or modeling tools.
[01:08:15] And it basically allows Genie and any agents attached to the ontology to do a much better job of accurately answering questions with data.
[01:08:25] Let me tell you how it works.
[01:08:27] So first, with the Genie ontology, what we do is connect to the systems that you have that contain knowledge.
[01:08:34] So for example, using the Databricks system, you have a ton of information in your pipelines, queries, dashboards,
[01:08:40] and things like that that tells an agent how they can reason about data.
[01:08:44] And we extract all of those things, things like expressions and relationships,
[01:08:49] but also expertise, like the idea that this person is really knowledgeable about a specific topic.
[01:08:55] And we extract all of this into our internal knowledge store.
[01:08:59] And then we go over this knowledge with an algorithm which, you know,
[01:09:04] this is how you know the engineers still run the show at Databricks,
[01:09:07] we call OntoRank.
[01:09:09] And it's a PageRank algorithm that helps us determine which snippets are actually authoritative.
[01:09:15] And then with this information, when a question comes in to Genie,
[01:09:20] we're able to look up this information, apply permissions,
[01:09:24] because we also extracted out the permissions of the underlying source access
[01:09:28] so that we don't have to worry about leaking permissions,
[01:09:31] and then we just inject that context into the agent loop so that Genie is able to answer
[01:09:37] much better questions, much more accurately.
[01:09:40] And the results of this is so that if I ask a question like how many people are registered,
[01:09:45] without ever having to model it, Genie knows about the nuances of how we captured registrations
[01:09:53] in our internal systems because it was able to extract that information out of some source asset.
[01:09:59] And if you want to know where it came from, you can just inspect it and see how you actually learned
[01:10:03] about these calculations from our underlying dashboards.
[01:10:07] The marketing team is a little bit mad at me about this because I took this video from two weeks ago.
[01:10:12] But the numbers are a lot higher now.
[01:10:16] So, coming back to this chart.
[01:10:18] So we saw what happened with generic coding agents.
[01:10:22] How does Genie with the ontology do?
[01:10:25] Well, in our internal testing, what we've seen is a consistent 30 percentage point plus improvement
[01:10:30] in accuracy and roughly half the runtime of the leading agents, right?
[01:10:37] And we're really confident that as we continue to tune the OntoRank algorithm
[01:10:42] and expand the breadth of the ontology that we're able to push this accuracy much, much higher.
[01:10:50] So, with this level of accuracy enabled by the Genie ontology,
[01:10:54] now we can truly create a data smart coworker.
[01:10:58] One that you can give to all of your employees to actually make decisions with data
[01:11:02] and even automate decision making, right?
[01:11:04] It's the foundation.
[01:11:06] So, I'd like to welcome now Elise Joris onto the stage to show you what you can do
[01:11:12] now that you have this foundation and what you can do with Genie 1.
[01:11:25] Hello, everyone.
[01:11:27] My name's Elise.
[01:11:28] I'm a product manager here at Databricks for Genie.
[01:11:31] And I use Genie in a lot of my day-to-day work.
[01:11:35] We actually have an exec review for our Genie OKRs that's coming up.
[01:11:39] And I'm going to use Genie to prepare for it.
[01:11:42] I'm starting here in Genie 1.
[01:11:44] And as you can see, it's a pretty simple interface.
[01:11:47] I have my chat here in the middle.
[01:11:49] We have some buttons that we can use to draft a doc or create a skill.
[01:11:56] On the left side, I have my prior chats as well as all the agents and the assets that are available to me.
[01:12:03] But more on that in just a second.
[01:12:06] I'm going to start by asking Genie to create a document for my review.
[01:12:10] Every team at Databricks uses the same template.
[01:12:13] It's got metrics, highlights, lowlights.
[01:12:15] So, I've given Genie a link to that.
[01:12:17] I've also asked to poke the people who have opened Jira tickets so that we can keep those moving.
[01:12:23] Now, as Genie is running, we can actually click into what it's thinking.
[01:12:27] So, as you can see here, it clearly found something interesting in the Unity Catalog glossary.
[01:12:32] Seems like it found some potentially relevant ontology snippets.
[01:12:36] And this is actually all from data that I do already have access to.
[01:12:41] Genie isn't forcing you to create a separate permission system.
[01:12:44] It's enforcing the permissions that you already have through Unity Catalog.
[01:12:49] Looks like it also grabbed some Google files and found those Jira tickets.
[01:12:54] It used MCP to comment on those tickets, just like I asked.
[01:12:58] And now it's actually running some SQL across both Databricks and BigQuery to get the latest ground truth.
[01:13:05] And that's actually really important.
[01:13:07] A lot of generic agents are just reciting what already exists in a document.
[01:13:12] But there's no document for this week's metrics.
[01:13:15] That's why I'm doing this.
[01:13:16] Genie understands the numbers and is actually able to compute the results based on our live operational data.
[01:13:23] And to that end, it looks like I already have a doc for my review.
[01:13:27] Looks like Genie pulled the metrics and broke them down by region.
[01:13:32] Gave me a bunch of nice visualizations.
[01:13:35] Grab the highlights, the lowlights, anomalies.
[01:13:39] And notice that it actually did that fairly quickly.
[01:13:42] Rather than just grinding through an agentic loop of trial and error and eating up a lot of tokens along the way,
[01:13:48] Genie just went straight to the ground truth.
[01:13:51] And that is really the Genie ontology at work.
[01:13:54] So how do we know that we can actually trust this?
[01:13:57] Well, for that, we're going to pop down here into the citations.
[01:14:00] This is everything that Genie actually used to answer my question.
[01:14:03] And again, we have tables, foreign catalogs, tickets, docs.
[01:14:07] Looks like it did actually use the definition from the glossary.
[01:14:11] You can see here that it's verified.
[01:14:13] But I actually want to click into the ontology.
[01:14:16] I want to make sure that Genie inferred everything correctly.
[01:14:19] So what you're seeing here are ontology snippets.
[01:14:22] These are, again, the facts about our business that Genie has learned and evaluated automatically.
[01:14:28] It looks like there's a snippet for how we define engagement.
[01:14:32] A snippet for what qualifies as a blocker in Jira.
[01:14:36] And one for the SQL that we can use to pull engagement by region.
[01:14:41] I'm going to click into this first snippet.
[01:14:43] And here we see the full definition.
[01:14:46] We're looking at sessions that lasted longer than 30 seconds.
[01:14:50] And it looks like Genie pulled this from our mobile KPIs dashboard,
[01:14:54] which was authored by our colleague Chung.
[01:14:57] It's also got a high authority score.
[01:15:00] This is a function of things like how often the snippet is used
[01:15:03] or the asset that it was pulled from.
[01:15:05] It goes back to that onto rank algorithm that Ken was talking about.
[01:15:08] And I can actually click into Chung's profile.
[01:15:11] And here I see that he's engaged with a lot of the domains our team cares about.
[01:15:15] And yeah, it actually looks like Genie has pulled quite a few snippets
[01:15:19] from the assets that he's created.
[01:15:22] Now, while I'm here, I might as well schedule this to run every week
[01:15:25] so I don't have to manage Jira myself.
[01:15:29] Great.
[01:15:30] And now I'm just going to take this one step further.
[01:15:32] I'm going to ask Genie to add a forecast to my doc.
[01:15:35] One of my teammates already has a forecasting skill,
[01:15:38] so I'm just going to use that.
[01:15:39] And she didn't do anything fancy.
[01:15:40] She just used the built-in forecast function and added a few instructions
[01:15:44] for how we like to review these forecasts.
[01:15:46] And again, this is really only possible because Genie is able to compute these results
[01:15:51] based on our live operational data.
[01:15:54] And it looks like it added my forecast.
[01:15:57] And it also gave me a pretty nice visualization here.
[01:15:59] So now I'm going to ship this.
[01:16:02] I'm going to send it over email and Slack.
[01:16:05] Now, this OKR review isn't a one-time thing.
[01:16:09] People are asking me for updates on the mobile app literally all the time.
[01:16:12] So I'm going to create an agent that can answer questions using context from this conversation.
[01:16:21] I'm going to name my agent.
[01:16:29] Great.
[01:16:30] Now I have an agent.
[01:16:31] This is an expert coworker that can answer questions and take action related to the Genie mobile app.
[01:16:38] And unlike an OKR doc, which is probably going to get stale pretty quickly,
[01:16:43] this agent stays up to date.
[01:16:45] And again, it can compute results on the fly.
[01:16:48] So while I'm here, might as well share this with the other PMs at Databricks.
[01:16:52] And now I'm going to click into this agent.
[01:16:55] So as you can see, I'm actually chatting directly with my agent.
[01:17:00] I'm not going to have to specify what product we're talking about
[01:17:03] because it's already baked into the agent.
[01:17:05] It's customized to the mobile app.
[01:17:07] And you can also see that in these agent details that Genie has automatically generated.
[01:17:12] And more broadly, you can probably imagine all the different teams that might use something like this.
[01:17:17] Think about any function that has to answer a bunch of questions.
[01:17:21] HR, IT, finance.
[01:17:23] Now everyone in the organization can create and collaborate on these types of expert coworkers in one place.
[01:17:30] Now from here, I can go in and manage my agent.
[01:17:35] You know, tighten the scope, sharpen the instructions, add some context.
[01:17:39] Or I can certify it.
[01:17:41] And this is going to appear on our organization's agent list page as a certified agent.
[01:17:46] And the Genie ontology is also going to prioritize it.
[01:17:49] Now, I don't do all of my work on the desktop.
[01:17:54] I actually have to do a lot of work on the go.
[01:17:57] Which is why I have the Genie mobile app.
[01:18:00] And if I open my app, or my phone that is, it looks like I just got a notification.
[01:18:09] Engagement with the app is spiking.
[01:18:11] It's all very meta when you have to work on Genie with Genie.
[01:18:14] And I can click into the app to learn more or maybe ask a follow-up question.
[01:18:20] And to be clear, I didn't actually have to set up this insight.
[01:18:24] Genie knows that I care about engagement with the app because of my activity.
[01:18:29] And it can actually proactively notify me because something interesting happened.
[01:18:33] And I think that's actually really powerful.
[01:18:36] Genie is no longer just a reactive Q&A tool.
[01:18:39] It's now a proactive AI coworker that can bring me things that matter before I even know to ask.
[01:18:45] So in just a couple of minutes, we moved the project along.
[01:18:49] Prepped for an exec review.
[01:18:50] Got some colleagues questions answered.
[01:18:52] And even learned something that we didn't know to ask about.
[01:18:56] That is a super high quality AI coworker.
[01:18:59] And it's really only possible because Genie understands our business through the ontology.
[01:19:04] It can compute results rather than just reciting documents.
[01:19:08] And it can enforce our organization's governance along the way.
[01:19:12] So we'd encourage you to try this on your own data and share your feedback.
[01:19:16] And with that, Ken is going to come back out.
[01:19:18] Thank you all.
[01:19:23] All right, what do you think?
[01:19:25] Thank you.
[01:19:26] So Genie1 is a data smart AI coworker.
[01:19:31] It has full coworker capabilities.
[01:19:34] You can create documents.
[01:19:35] And again, compute them live.
[01:19:37] Not just summarize and recite existing things that you have, but net new analysis.
[01:19:42] You can connect it to all of your systems thanks to the connectors that we have.
[01:19:46] You can schedule it to work in the background.
[01:19:49] And you can schedule it without having to keep your laptop open.
[01:19:53] And you can integrate it to all of your systems.
[01:19:55] Even custom ones.
[01:19:56] Because we have custom MCP tools all managed through Unity AI Gateway as well.
[01:20:02] And you can develop custom skills that you can share and create with your coworkers.
[01:20:06] Even create a dedicated domain-specific agent just by talking to Genie so you can improve the performance of that focused agent over time.
[01:20:16] And Genie, again, works on all of your data thanks to the large and growing ecosystem of connectors that we have through federation and ingestion.
[01:20:26] And it's all managed by Unity.
[01:20:29] And we'll have a deep dive on all the governance features of the Unity AI Gateway tomorrow.
[01:20:34] And what this means is that customers like Warner Music Group is able to use Genie for both Databricks and non-Databricks data.
[01:20:43] And customers like General Motors is able to deploy hundreds of Genie agents into production to improve the performance in every single one of their teams.
[01:20:53] And it allows customers like Foot Locker to be able to think about how to transform their business to a world where every employee is armed with the ability to make decisions with data.
[01:21:05] We really want every single user to be able to leverage data to do their jobs, which is why we're striving to make Genie available everywhere.
[01:21:14] So I'm pleased to announce that Genie is integrated now into Teams and Slack available today.
[01:21:20] And it will be available on your mobile devices on both major platforms on Android and iOS again today.
[01:21:28] So as Ali mentioned, this has already become the most popular way for Databricks employees to experience Genie.
[01:21:34] And I'm pretty sure it's going to be that way for your employees as well.
[01:21:39] Now, for those of you who are using one of these existing AI agents or have developed your own and would like to benefit from the superior accuracy and data smarts of Genie,
[01:21:51] we're also pleased to announce the availability of the Genie MCP app.
[01:21:56] And that gives your existing agents or existing tools the ability to tap into Genie's ability to compute results accurately.
[01:22:05] And that is also available today.
[01:22:09] So we really believe that Genie 1 is unique.
[01:22:12] With the ontology, it's the only AI agent that allows everyone in your organization to actually consistently make results and not just summarize information from existing documents.
[01:22:24] And we want to make sure it's available to absolutely everybody.
[01:22:28] So in order to kind of encourage that to happen, we're giving $10 worth of tokens for every single user inside your organization every single month.
[01:22:38] So there's really no reason not to get started with Genie 1.
[01:22:43] And we're going to make Genie 1 generally available today.
[01:22:53] All right, so please connect Genie 1 to everything inside your organization and give it to everybody.
[01:22:59] Well, next, I'm excited to introduce to you PepsiCo, who's going to tell you a little bit about the work that we've been doing together.
[01:23:09] One, two, three.
[01:23:10] So, let's go.
[01:23:40] Welcome to the stage, PepsiCo's global chief data and AI officer, Magesh Bhagavathi, and Databricks co-founder, Arsalanta.
[01:24:10] So, Magesh, we saw a pretty cool demo out there talking about Genie, but here we're about to talk about all the amazing things that Pepsi's been doing.
[01:24:18] I'm sure everybody out here is familiar.
[01:24:20] They just saw the video, but you guys are pretty massive and complex business, but best in class when it comes to AI transformation.
[01:24:27] So, how about you tell me a little bit about that?
[01:24:30] Well, thank you, Arsalan.
[01:24:31] It's fantastic to be here.
[01:24:33] Great crowd, very talented crowd, and I'm here to represent Team PepsiCo.
[01:24:37] And, you know, this journey started off for us around six years back as we were really initiating our full-blown end-to-end digital transformation and process transformation as well.
[01:24:47] And the focus has always been faster for our customers and consumers, stronger for our people, products, and our businesses, and ensuring we're better for the planet.
[01:24:58] So, that's been the core mission of PepsiCo.
[01:25:00] And as we talk about the size and scale of PepsiCo, sometimes it amazes myself, right?
[01:25:07] We have 320,000 employees globally, operate around 200 countries, and this is continuous, non-stop operation.
[01:25:16] We have, I don't know, some of you might not know, we have one of the largest fleets in North America.
[01:25:21] So, when you have this level of complexity, you also want to, most importantly, be able to service your consumers.
[01:25:28] And one of the things we measure how we serve as a consumer is occasions.
[01:25:33] We do around 1.4 billion occasions a day.
[01:25:36] An occasion is, you saw all the great products in the video, it is really how you all consume our fantastic products every day.
[01:25:43] So, that's what we call as an occasion.
[01:25:46] So, when you have to service these occasions through six million retailers and also your EB to B and E and D to C avenues, it's a pretty complex endeavor.
[01:25:54] So, for us, what we truly believe is an AI transformation is a data transformation.
[01:26:00] And that's what we've undertaken.
[01:26:01] So, what the team has done really is fundamentally over the last six years really moved us from a legacy architecture, where we had 60-plus data lakes, to one lake house right now running on Databricks.
[01:26:14] And also from a monolithic architecture to a truly pluggable Lego-based architecture.
[01:26:23] That transformation for us is really how we start unlocking the value of the data that's in this platform because now 90% of PepsiCo's universe, data universe, is now on the platform.
[01:26:35] And this, which is what we call as Enterprise Data Foundation.
[01:26:38] So, for us, this is a living, breathing platform that we double down on.
[01:26:41] That's awesome.
[01:26:43] And look, you were one of the early ones that talked to us about Genie.
[01:26:46] I know you had efforts internally and you basically came to Ali and I and you're like, we have to build something like this.
[01:26:51] Why has Genie been such a game changer for you guys?
[01:26:55] Yes, we did come to you guys a few years back.
[01:26:58] And this was when we were actually, as we started onboarding all this data into the platform, we saw that there's going to be so much richness this platform can offer for us in terms of the core capabilities.
[01:27:09] We call this AI for BI and our ability to dialogue with data.
[01:27:13] Whether it's a warehouse operator, whether it's a frontline employee, whether it's a knowledge worker, we want to be able to dialogue with data just as Ken showed.
[01:27:21] And for us, as we were looking at this AI for BI consoles, it was taking us anywhere between three to six months.
[01:27:27] As you want to vectorize it, you want to rag, do rag patterns and try to bring the data onboard and really then start executing this.
[01:27:34] As we were having conversations, we were starting to see, you know, the accuracy was not where we wanted it to be.
[01:27:40] So that's why it was taking three to six months to productionize.
[01:27:43] And this is when we reached out to you guys.
[01:27:46] So at this point in time, having almost 95% accurate and cataloged data in the Enterprise Data Foundation, for us, Genie, is all about unlocking the value from the Databricks environment.
[01:27:59] So that's kind of an amazing story.
[01:28:02] Now, bring that home for me.
[01:28:03] Make it real, right?
[01:28:04] You guys actually have actual use cases of how you've been using it.
[01:28:07] And even though you're getting started, talk to me about some of the value and how you've been using it.
[01:28:11] Yeah, it's not real until you productionize it, right?
[01:28:14] And what we have in PepsiCo as an ambition is we want to identify PepsiCo.
[01:28:20] And as we think about identifying PepsiCo, we think about concepts such as the supply chain AI brain.
[01:28:26] One of the components of the supply chain AI brain is the procurement brain.
[01:28:30] Very strong sponsorship from our chief procurement officer to really start looking at insights for all things procurement, both direct and indirect spend.
[01:28:40] We're a $95 billion revenue company.
[01:28:42] So there is a decent amount of direct and indirect spend.
[01:28:45] So within this platform called SpendWise, where our procurement leaders and our key senior leaders operate to understand what their overall indirect spend is, we've had dashboards, capabilities, and reports.
[01:28:57] But as we unleashed Genie within that space, it was truly a big game changer.
[01:29:04] Just in the first few weeks of launch, we almost had 30,000 queries, engagements back and forth.
[01:29:09] And all of a sudden, you start seeing this user population shift from reports and dashboards into a Genie type engagement.
[01:29:16] One of the things I love about Genie, by the way, is a little bit of what Ken showed, was that ability for you to go into agent mode and really then start looking at deep research.
[01:29:27] So you have a tremendous amount of flexibility within the platform, which is what me and a lot of the leaders within PepsiCo really like.
[01:29:34] And now what we're looking to do is we're now looking to expand this use case across all parts of PepsiCo, including commercial reporting for Europe and also Latin America as well.
[01:29:44] Fair enough.
[01:29:44] And then I know they're going to yell at us because we're running late, but I'd be remiss to not ask you, you know, what's next for you guys?
[01:29:50] You guys have such ambitions of what you want to drive.
[01:29:52] So what's next?
[01:29:54] You know, for us, it's all about how do we get to an agentic PepsiCo.
[01:29:59] And we've got huge ambitions.
[01:30:01] For us, it's really about going from insights to action to outcomes for our end user and for our population.
[01:30:09] So our goal is now to go from these 30,000 plus reports down to 50 plus AI for BI or Genie consoles.
[01:30:16] That's a big outcome for us.
[01:30:18] That's how we're going to drive our business going forward.
[01:30:21] This will really help us agentify PepsiCo.
[01:30:23] This will help us agentify our 320,000 associates.
[01:30:27] Awesome.
[01:30:28] Well, it's been an incredible journey and excited to see what comes next.
[01:30:30] Thanks so much.
[01:30:31] Thank you, Aslan.
[01:30:31] Thanks for the participation.
[01:30:32] Appreciate it.
[01:30:32] Thank you all.
[01:30:46] Welcome to the stage, Databricks Senior Director of Product Management, Bilal Aslan.
[01:30:58] Good morning.
[01:30:59] Good morning.
[01:30:59] Good morning.
[01:31:01] I'm going to talk to you about Lakeflow, but just a little quick note first.
[01:31:05] I wore this purple jacket two years ago.
[01:31:08] Talked about Lakeflow.
[01:31:09] I did not wear it last year.
[01:31:10] And the biggest piece of feedback was bring it back.
[01:31:13] So here it is.
[01:31:14] My name is Bilal Aslan.
[01:31:15] I'm a Senior Director of Product Management at Databricks.
[01:31:17] Super excited to talk to you about Lakeflow.
[01:31:20] Now, first of all, thank you to Magesh and Arslan and Ali and Ken and everybody who's come
[01:31:24] here and talked about AI and agents.
[01:31:26] Now, super exciting future.
[01:31:28] But if you're anything like me, if you're a data engineer, you're probably sitting there
[01:31:32] thinking, this is great.
[01:31:33] But all of this needs data.
[01:31:35] All these apps, agents, AI, everything needs data.
[01:31:38] And you might be thinking, my job got a little bit harder.
[01:31:42] And that's because with data engineering, the destination is simple.
[01:31:46] The destination is get to the right.
[01:31:48] Get to the business outcomes we want.
[01:31:49] We want agents, applications.
[01:31:51] We want real-time operations.
[01:31:53] But we're actually starting from the left.
[01:31:55] We're starting from raw data sources.
[01:31:57] We have data in Kafka, Salesforce, NetSuite.
[01:31:59] We have on-premise databases like Oracle and SQL Server.
[01:32:03] And the messy middle, that's data engineering.
[01:32:05] That's our job, to take that raw data, turn it into insights.
[01:32:09] And over the years, we have built architectures like these.
[01:32:12] I put this up here because it's representative.
[01:32:15] You don't have to look at every box here.
[01:32:16] It's a lot of logos.
[01:32:18] It's a lot of boxes.
[01:32:19] It's a lot of lines between boxes.
[01:32:21] And that's just to get data from one place to the other.
[01:32:24] And, you know, we've made peace with complexity.
[01:32:26] Sure, there's a lot of tools.
[01:32:27] But you're probably sitting in here thinking, okay, at least it works.
[01:32:32] Even though you know instinctively that this complicated, not unified architecture is missing a bunch of critical things,
[01:32:38] we want to version control everything as data engineers.
[01:32:41] But in this stack, you can version control some things, not the others.
[01:32:44] You can version control your Spark code, but not your Kafka infrastructure.
[01:32:48] You want governance everywhere, but it's pretty much impossible.
[01:32:50] You're reading and writing to too many different places.
[01:32:53] And there are parts of this stack, as we will see soon, that are just locked up.
[01:32:56] And finally, you want everything to scale.
[01:32:58] And you want to save money so you can build more applications and outcomes for your business.
[01:33:02] But not all parts of this stack scale equally well.
[01:33:06] I'm here to tell you that your job and my job as a data engineer is about to get quite a bit harder.
[01:33:13] What you're seeing here is GenieCode.
[01:33:15] It's a purpose-built coding agent for data science, machine learning, and data engineering.
[01:33:19] If you're using it, that's awesome, because 60% of Lakeful pipelines, this is a crazy statistic, are already written by GenieCode in just three months.
[01:33:30] We just released this.
[01:33:31] It's pretty awesome.
[01:33:33] On the other hand, this statistic scares me a little bit as a data engineer.
[01:33:36] And that's because I have a crush on agents.
[01:33:39] Agents are great.
[01:33:40] But all agents mean new data, new pipelines, more to build, more to manage on that fragile infrastructure.
[01:33:46] We're going to get swamped.
[01:33:48] So what do we do?
[01:33:49] We start by removing the moving pieces.
[01:33:52] And actually, it's super important.
[01:33:53] We don't just want to remove complexity with more complexity.
[01:33:56] We want to replace that with open formats, open frameworks, open tools.
[01:34:01] We ultimately need to simplify the data stack.
[01:34:04] I'm here to make a pretty bold claim.
[01:34:07] And my bold claim is that Lakeflow is that unified stack.
[01:34:10] That it gives you that open foundation for AI and agents so that you can build the future that your company deserves.
[01:34:16] And now we work with hundreds of partners, lots of tools.
[01:34:19] They're all interoperable with Lakeflow and the Lakehouse.
[01:34:22] You can use them as well.
[01:34:23] So let's get started.
[01:34:24] I'm going to do a little bit of a speed run through this.
[01:34:26] How do you simplify data transformation?
[01:34:28] ETL.
[01:34:29] This is where you're spending most of your time and money.
[01:34:31] You have Spark.
[01:34:33] You have lots of legacy Spark jobs.
[01:34:35] You may have a Spark distribution on EMR or some other vendor.
[01:34:39] I'm going to replace that with Apache Spark declarative pipelines.
[01:34:42] If you're not using them, please try them out on Lakeflow.
[01:34:45] They're open.
[01:34:45] You can run them anywhere, including on your laptop.
[01:34:48] And they're declarative.
[01:34:49] So you can just focus on the work to be done.
[01:34:52] And because they're open, agents are really, really good at writing these pipelines.
[01:34:56] They unify batch and streaming.
[01:34:58] Many years ago, DBT was the only game in town when it came to SQL.
[01:35:01] I like writing ETL and SQL, but Spark declarative pipelines unifies that as well.
[01:35:05] It has SQL and Python.
[01:35:06] You don't have to choose.
[01:35:08] And finally, if you're using Flink for real-time streaming, that's probably because actually low-latency
[01:35:12] streaming was really hard on Databricks and Spark.
[01:35:16] But now, last year, we open-sourced real-time mode, which is a brand-new engine for low-latency
[01:35:20] streaming, and I'm super excited to share that it's now available inside Spark declarative
[01:35:25] pipelines.
[01:35:25] So you can get millisecond streaming on an open framework that runs anywhere.
[01:35:30] Awesome.
[01:35:31] So we're taking care of this complexity.
[01:35:32] We're getting rid of this.
[01:35:33] We're building an open foundation for agents in AI.
[01:35:36] But what about all the shadow IT that's happening?
[01:35:39] So you see, you may have 20 data engineers in your company, but you have hundreds of analysts
[01:35:43] who are building pipelines in drag-and-drop tools.
[01:35:46] These pipelines are built on their laptops, proprietary formats, proprietary tools, and if
[01:35:52] somebody leaves the team, you're out of luck.
[01:35:55] Super excited to share that Lakeflow Designer, the no-code data prep tool,
[01:36:00] powered by Genie, is now available.
[01:36:02] It's generally available.
[01:36:03] You can try it today.
[01:36:04] You can try it right now.
[01:36:07] Thank you.
[01:36:10] Awesome.
[01:36:11] And by the way, one super cool thing is that Lakeflow Designer doesn't build any proprietary
[01:36:15] pipelines.
[01:36:16] It just builds Spark declarative pipelines under the hood.
[01:36:19] Again, you can just run them anywhere.
[01:36:21] I think you're starting to see a pattern.
[01:36:22] All right, we're going to go a little faster.
[01:36:24] I'm going to go ahead and simplify data ingestion.
[01:36:26] If you're using a SaaS that has hundreds of connectors, that's great, but it's probably landing
[01:36:31] data in proprietary formats.
[01:36:33] Your agents don't know how to operate this SaaS.
[01:36:36] You cannot automate this.
[01:36:37] You can barely monitor it or observe it or govern it.
[01:36:40] As Ali mentioned, I'm super excited to share that Lakeflow Connect now has more than 100 connectors.
[01:36:46] If there's a connector you're thinking of, we've either built it, we're building it, or we're
[01:36:50] going to build it.
[01:36:51] And there's a new community that's building open source connectors.
[01:36:53] You can even build your own.
[01:36:55] And guess what?
[01:36:56] I think you're going to guess what I'm going to say next.
[01:36:58] This isn't a special type of pipeline or ETL.
[01:37:00] It's just Spark declarative pipelines under the hood.
[01:37:03] That was pretty awesome.
[01:37:05] All right.
[01:37:05] Let's talk about Kafka.
[01:37:07] As data engineers, we use Kafka day in and day out for high-volume telemetry.
[01:37:11] We're bringing that in.
[01:37:12] Kafka's a buffer.
[01:37:13] But we also know Kafka's a real pain to manage.
[01:37:17] Okay?
[01:37:17] Super excited to share that ZeroBus Ingest is a fully managed service that is now 100%
[01:37:23] wire-compatible with Kafka.
[01:37:25] So you can take your Kafka producing code, point it at ZeroBus Ingest, and you can land
[01:37:29] data into the lake house in open formats, ready for AI, so no small files, millions of small
[01:37:35] files, at 12 gigabytes a second.
[01:37:38] So it's pretty awesome.
[01:37:39] So you can get rid of Kafka from your stack.
[01:37:42] Great.
[01:37:43] Okay.
[01:37:44] So we're almost there.
[01:37:45] And you're going to see we're still building everything on an open ETL framework that's
[01:37:50] declarative.
[01:37:51] But you see this little Airflow icon hanging out there.
[01:37:54] That's your orchestrator.
[01:37:55] You have to use that to trigger things, to schedule things.
[01:37:58] Let's simplify that as well.
[01:37:59] So you're probably using Airflow because you like writing workflows in Python.
[01:38:03] Well, Lakeflow Jobs supports writing workflows in Python now.
[01:38:06] It's just pure Python.
[01:38:08] You can also build point-and-click DAGs.
[01:38:11] It's fully serverless.
[01:38:12] So you don't have to manage any infrastructure.
[01:38:14] And one interesting fact is that we did a survey of our customers, and 80% of our customers
[01:38:19] are using old Airflow distributions.
[01:38:21] So they're managing them, maintaining them, they have security vulnerabilities, and they're
[01:38:25] not even getting all the new features.
[01:38:27] Okay?
[01:38:28] And the last thing I'm really excited about, because for the longest time, let's face it,
[01:38:31] with Jobs, it was really good at orchestrating Databricks, not so great at orchestrating
[01:38:35] other systems.
[01:38:37] So super excited.
[01:38:38] This is probably one of my favorites in this conference is we're releasing 50-plus integrations,
[01:38:41] so you can orchestrate everything.
[01:38:43] You can think of these as Airflow operators.
[01:38:44] They're open source, and you can orchestrate everything, including Snowflake.
[01:38:48] Okay, so where does this bring us?
[01:38:50] This brings us to an open stack.
[01:38:53] Everything is pretty much building open source declarative pipelines, and this is not a brand
[01:39:01] new product.
[01:39:01] We've been at this for a while, and it's ready for the world's toughest data workloads.
[01:39:05] I want to share some statistics with you.
[01:39:08] Spark declarative pipelines now process 200 trillion rows of data every single day.
[01:39:14] Lakeflow jobs is running the world's biggest data workloads.
[01:39:18] It's about 1.7 billion job runs per month.
[01:39:21] And finally, serverless compute is really popular.
[01:39:24] You don't want to manage clusters.
[01:39:26] You don't want to set up VPCs.
[01:39:27] 50% of our customers have now opted in to use serverless compute.
[01:39:31] That's pretty awesome.
[01:39:33] Great.
[01:39:34] So hopefully I have convinced you to some degree that we made it easier to build,
[01:39:38] and now you might be thinking, okay, we have an open stack.
[01:39:41] I can start building.
[01:39:43] Yeah, I can write the code.
[01:39:44] Maybe you're going to go try out Genie, and it's going to build something for you.
[01:39:48] But what about operations?
[01:39:49] I haven't really solved that for you.
[01:39:51] Actually, I've kind of made the problem worse, right?
[01:39:53] Because now you're going to say, great, now more people can build more pipelines,
[01:39:57] more outages, more problems.
[01:40:00] And here's the reality.
[01:40:01] Simplifying pipeline creation is actually the easy part, although this was pretty tough, right?
[01:40:06] It's data operations that are really, really hard.
[01:40:09] Here's a survey that is by one of our partners, and it shows that data engineering teams spend more than 50% of their time on maintenance.
[01:40:18] So that's where your time is going.
[01:40:19] It's only going to get worse.
[01:40:21] And even with this time spent on maintenance, you still see up to 60 hours of downtime every month.
[01:40:26] So this sucks.
[01:40:28] Now, you might be thinking, but Bilal, you just told me that this is a unified stack.
[01:40:33] Surely it has APIs and endpoints and tables.
[01:40:35] Maybe a coding agent can just fix it.
[01:40:37] And unfortunately, not really.
[01:40:39] Let me walk you very quickly through why.
[01:40:41] And the big idea here is that data engineering is not software engineering.
[01:40:44] It's something a bit more.
[01:40:46] In software engineering, your code is self-describing.
[01:40:49] It's just literally described itself.
[01:40:52] And in data engineering, it's data and code.
[01:40:55] In software engineering, your tests either work or they don't.
[01:40:58] They're fully deterministic.
[01:41:00] In data engineering, you have code and data.
[01:41:03] And you can't just test data.
[01:41:04] It's statistical.
[01:41:06] And finally, in software engineering, and this is the big one, failures are loud.
[01:41:10] When something goes wrong, you're going to see a trace.
[01:41:13] You're going to see an exception.
[01:41:14] And then you can roll back your deployment.
[01:41:16] In data engineering, failures are silent and permanent.
[01:41:19] You're probably going to get an angry phone call from a CEO or a stakeholder saying,
[01:41:22] hey, how come this data is bad maybe a week later?
[01:41:26] So why do coding agents trip up here?
[01:41:29] Well, that's because when it comes to detection, they're just missing data.
[01:41:33] They're missing telemetry.
[01:41:35] And you can give them Spark logs, for example.
[01:41:37] But these are multi-megabyte traces.
[01:41:39] You have to be very careful to manage the context window there.
[01:41:42] Similarly, when it comes to assessing the root cause, they're missing lineage.
[01:41:46] Now you have to export your lineage out of the system.
[01:41:49] To some degree, you can do this.
[01:41:50] Coding agents are getting better and better, and context windows are getting better.
[01:41:54] And then you have to remediate.
[01:41:56] They have to write the code fix.
[01:41:58] But it's actually the final step, which I'm going to be a little pedantic about and call
[01:42:01] it verify.
[01:42:02] You can't, with data, you can't just run your unit tests.
[01:42:05] You actually have to take your code fix, and you have to run it on production data.
[01:42:10] Okay?
[01:42:11] And if this is giving you shivers, that's exactly right.
[01:42:13] It really should.
[01:42:14] So what does your agent need?
[01:42:15] Your operations agent needs to combine code and data.
[01:42:19] That's tools and skills.
[01:42:20] To some degree, you can do it.
[01:42:23] To figure out the root cause, it needs read access to production data.
[01:42:29] Who's comfortable with that?
[01:42:30] It's an agent you didn't write and you don't trust.
[01:42:33] I don't see a single hand raised here.
[01:42:36] But actually, your coding agent also needs right access to production data.
[01:42:41] That's the only way it can verify the fix.
[01:42:43] Like, who's here wants to give a random agent right access to their production data?
[01:42:48] It's got to be one person.
[01:42:50] Okay, there's no one who wants to do that, right?
[01:42:53] Because it can cause chaos.
[01:42:54] It can drop everything.
[01:42:56] And this is the big insight.
[01:42:58] So we've been looking at this problem.
[01:42:59] It's a tough problem.
[01:43:00] And the big realization we came to was that your operations agent actually needs to live in the data plane.
[01:43:06] It cannot live outside of the data plane.
[01:43:08] And if you think about it, that makes sense.
[01:43:10] Because the data plane has the data.
[01:43:13] It has the lineage.
[01:43:14] It's also the right governance boundary.
[01:43:16] So I'm super, super excited to announce Genie Zero Ops, a new background genie that puts your data and AI operations on autopilot.
[01:43:28] Great.
[01:43:29] I'm going to give you a demo, but let me quickly walk you through how it works.
[01:43:32] So for detection, Genie Zero Ops autonomously builds per-table machine learning models.
[01:43:38] And it continuously fine-tunes them.
[01:43:40] And it has native access to metrics, events, and logs.
[01:43:43] So you don't have to build this plumbing yourself.
[01:43:45] We already have tens of thousands of these machine learning models in production with Genie Zero Ops.
[01:43:51] To assess, it does graph ranking on data lineage in Unity Catalog.
[01:43:56] So I'll show you how it walks the lineage forward and backwards.
[01:43:59] And then to figure out the root cause, and this is where you spend a lot of your time.
[01:44:02] The root cause is basically what went wrong.
[01:44:04] It actually has a supervisor agent and a fleet of sub-agents who do research and then come to a consensus on what is the likeliest cause.
[01:44:12] For remediation, Genie Zero Ops works with Genie Code.
[01:44:15] It cooperates.
[01:44:16] If Genie Code has access to your code, to your ticketing system, to your version control, it uses all that.
[01:44:21] So it can also update tickets for you as it goes through the lifecycle.
[01:44:25] And finally, verification is the toughest step.
[01:44:28] To verify, this is the 10 years of building an open lake house, it builds shallow clones of your production data.
[01:44:34] These are super cheap, super fast branches, similar to what we do in LakeBase.
[01:44:38] And then it utilizes native network and code isolation in Databricks.
[01:44:42] All right, so I'm going to show you a quick demo of what Genie Zero Ops actually does.
[01:44:48] Great.
[01:44:49] So let's see.
[01:44:49] So you're going to notice a couple of things here.
[01:44:54] First of all, this does not look anything like a dashboard.
[01:44:58] We're not flooding you with metrics, events, alerts, red lights and all that.
[01:45:04] What you see is something that looks and feels a lot like an email inbox.
[01:45:08] It's actually, we modeled it after a prioritized email inbox.
[01:45:12] So for example, here, one cool thing is that you get to see severity.
[01:45:15] So every incident is ranked by severity, so you can just focus on what really matters.
[01:45:20] And also, this isn't just a list of jobs and pipelines and failures and things like that.
[01:45:25] Over here, for example, here's an alert, and Genie Zero Ops figured out that, hey, 16 jobs are failing.
[01:45:31] They're actually the same incident.
[01:45:33] So natural grouping saves you time.
[01:45:36] All right, so I'm going to go ahead and look at an actual incident.
[01:45:39] And what's happening here is that there's an upstream table called top fan voters, and something's going wrong with it.
[01:45:45] About 10 minutes or so ago, the row count dropped quite a bit.
[01:45:50] And what Genie Zero Ops does is, by the time you log in, your background agent has actually done all the thinking for you.
[01:45:57] It has done all the investigation for you.
[01:45:59] And as a data engineer, I can start looking at things like, what's the impact?
[01:46:03] Now, when it comes to impact, not all tables are important, not all pipelines are important.
[01:46:08] So it walks the lineage forward.
[01:46:10] But in this case, it figures out that, hey, top fan voters is a pretty important table.
[01:46:14] Downstream tables depend on it, and this fan engagement dashboard depends on it.
[01:46:18] So it marks it as critical.
[01:46:19] Then it deploys the root cause agents to figure out what's going on.
[01:46:24] And what these agents do is, this is what they're saving you hours of time.
[01:46:28] They start from this table, and they fan back using lineage, and they go investigate potential causes upstream.
[01:46:34] So over here, this is a fairly simple DAG.
[01:46:37] For example, you know, it could be any one of these tables.
[01:46:40] But really, it finds this one table, this fan interactions table.
[01:46:43] That's the one that's problematic.
[01:46:45] Okay.
[01:46:46] Then it went ahead and autonomously wrote a fix for you.
[01:46:49] But this fix is not deployed to production.
[01:46:52] You're in control.
[01:46:53] What it did do, and I want to share this.
[01:46:55] This is super cool.
[01:46:55] Is that it went and created a shallow clone.
[01:46:58] It's like the lake-based branch of your production data.
[01:47:01] By the way, you have to give it permission.
[01:47:02] It won't do it automatically.
[01:47:03] You have to tell it you can do it with this pipeline.
[01:47:06] And this could be petabytes of data.
[01:47:08] It's a shallow clone.
[01:47:09] It deployed the fix, and it actually verified that the right number of rows are returned.
[01:47:13] This is awesome.
[01:47:14] This is hours, if not days' worth of work.
[01:47:17] And now at this point, I'm just going to go ahead and create a pull request,
[01:47:20] and I can follow my software development lifecycle.
[01:47:25] Let me show you one other super cool thing that Genie Zero Ops can do.
[01:47:28] The one other thing Genie Zero Ops does is, if you give it permission,
[01:47:32] it will autonomously scan tables for things like PII.
[01:47:36] So in this case, our application is unwittingly exposing the PII of about 1,000 users.
[01:47:41] Okay?
[01:47:41] That's pretty bad.
[01:47:43] And Genie Zero Ops creates a report for me.
[01:47:45] It finds out what the PII, the personally identifiable information, is.
[01:47:49] It gives me a per-table breakdown.
[01:47:52] It looks at lineage to figure out this is important.
[01:47:54] Yeah, this is pretty bad.
[01:47:55] And then in this case, it comes up with a proposed fix.
[01:47:58] That's not a code fix.
[01:47:59] It's a Unity catalog policy fix.
[01:48:02] And I can deploy it.
[01:48:04] Okay.
[01:48:05] So that's Genie Zero Ops.
[01:48:06] We're super excited about it.
[01:48:08] And for data engineers like me, and hopefully for you,
[01:48:11] it means far less time maintaining and much more time building
[01:48:17] on a unified and simple data stack.
[01:48:20] Thank you very much.
[01:48:26] Welcome back to the stage.
[01:48:28] Anikotzi.
[01:48:32] All right.
[01:48:40] Super cool.
[01:48:40] I'm very excited about Zero Ops.
[01:48:42] It's going to make our lives so much easier.
[01:48:43] Okay.
[01:48:44] So now we're getting to a really exciting part of our show
[01:48:46] where we're going to actually talk about a lot of the announcements
[01:48:48] that I mentioned.
[01:48:49] We're going to see Raynald introduce a lot of those things
[01:48:51] that I mentioned that we have innovated on in the last year.
[01:48:54] So, but before I do that, I want to put it all in perspective
[01:48:56] because it can get kind of confusing in the data world.
[01:48:59] There's so many different technologies.
[01:49:01] There's so many moving pieces.
[01:49:02] So we wanted to simplify that.
[01:49:03] So we call this, what you see here on this slide,
[01:49:06] the known data realm.
[01:49:08] Okay.
[01:49:08] It's kind of like the Game of Thrones, okay, of data infrastructure.
[01:49:12] Okay.
[01:49:12] This is like the troubled waters.
[01:49:14] So I'm going to explain it.
[01:49:15] What does this look like?
[01:49:16] Okay.
[01:49:16] So on the west of this planet, we have the world of old TP databases.
[01:49:22] Okay.
[01:49:22] So this realm of old TP databases was invented in the 80s.
[01:49:26] And what it was is people needed a database behind every piece of software.
[01:49:30] So this software, for instance, you might have an ATM.
[01:49:34] You're going to go withdraw money from it.
[01:49:36] That needs to hit the database.
[01:49:37] It needs to do that really fast.
[01:49:38] We want that latency to be milliseconds.
[01:49:41] We want it to never go down.
[01:49:42] And we want it to be super reliable.
[01:49:44] It can't be that it ever messes up the transactions.
[01:49:46] Okay.
[01:49:47] And over the time, we've had these niche databases that also have appeared.
[01:49:50] We have key value stores.
[01:49:52] We have now vector search with AI.
[01:49:54] So that's this island of old TP.
[01:49:58] Okay.
[01:49:59] On the right-hand side of this map, what we have is the world of data warehousing.
[01:50:03] It appeared shortly after the 80s because people wanted to ask, you know,
[01:50:08] how much data volume went through this ATM here or how much money has actually transacted
[01:50:14] on all the ATMs.
[01:50:16] Asking that question on those planes of data warehousing or old TP on the left would be
[01:50:23] too complicated.
[01:50:24] And it would take down those databases and it would affect the ATMs.
[01:50:27] And we never want that.
[01:50:28] So we want that to be up and running.
[01:50:29] We want it to be isolated.
[01:50:30] Okay.
[01:50:31] So on the right-hand side, we started doing data warehousing.
[01:50:33] And now you could ask these more analytical questions.
[01:50:36] Okay.
[01:50:36] All right.
[01:50:37] But very soon, in the last 15 years, people wanted to ask much more advanced questions.
[01:50:41] Not just what's the total transaction volume on the ATMs, but how much is Ali going to withdraw
[01:50:47] today?
[01:50:48] Can you predict the future?
[01:50:49] So we started doing data science and more and more advanced statistical analysis.
[01:50:54] Okay.
[01:50:54] And then last few years, we've now had real-time analytics appear.
[01:50:59] And real-time analytics, that's, you know, agents or applications or dashboards that want really low latency.
[01:51:06] They can't quite answer complicated questions and doing big joins of data and saying, you know, let's do this analysis of all this data.
[01:51:14] But they can give you dashboards that give you some analytical queries.
[01:51:17] So for each of these, we needed different things.
[01:51:19] And we needed to move data from the left-hand side, from the OLTP databases, all the way to the right.
[01:51:24] Okay.
[01:51:24] So for that, that's where data engineering came in.
[01:51:26] And that's where we had all the complexity of moving data and shuffling it from the OLTP databases all the way to the OLAP engines on the right-hand side.
[01:51:33] Okay.
[01:51:34] So this is what the world looked like.
[01:51:35] All right.
[01:51:35] You just heard Bilal.
[01:51:37] So this world of data engineering, that's what he was focused on, how you actually get your data from all these systems ready so that you can start doing operations on it.
[01:51:44] So we just cleaned this up.
[01:51:46] Okay.
[01:51:46] These planes, he just cleaned them up and added roads.
[01:51:49] But now what we're seeing is that there's this onslaught of agents coming.
[01:51:55] Right.
[01:51:56] We're having agents that requires us to ship way more data between these different islands.
[01:52:00] Okay.
[01:52:01] And we're going to get crushed by these storms that are going to come, and we're going to lose ships.
[01:52:05] Okay.
[01:52:05] Our infrastructure is super brittle.
[01:52:07] We've already seen many of the very important infrastructure sites that we depend on go down in the last year.
[01:52:14] This is going to happen to every organization.
[01:52:15] It's almost like everybody just hired double the number of employees.
[01:52:19] Okay.
[01:52:19] Because agents are submitting queries.
[01:52:20] They want more data stuff.
[01:52:22] More data movement is happening.
[01:52:23] So we need to simplify this.
[01:52:25] Okay.
[01:52:25] So how do we simplify this?
[01:52:26] We kind of did that already when we started Databricks.
[01:52:30] And what we did is that we kind of merged data science and data engineering.
[01:52:35] Spark already could do big data movement, and it allowed you to do machine learning with things like Spark ML.
[01:52:42] And five years ago, we announced Lakehouse, which lets us do, in one engine, data warehousing, data engineering, and data science.
[01:52:49] So we've already kind of accomplished that.
[01:52:51] That's kind of great.
[01:52:52] But we still have this real-time analytics island separate.
[01:52:57] And we need to make a separate copy of the data if we want to have really, really low latency.
[01:53:02] Because the analytics stacks today, all the data warehouses that are there, they can't go down to a fraction of a second.
[01:53:08] No matter what you do, there's a hard tradeoff.
[01:53:10] They can't do the complex queries and the fractional second queries that real-time analytics has.
[01:53:15] So we need separate engines for all of that.
[01:53:18] Okay.
[01:53:18] So for this, I'm excited to announce our latest innovation.
[01:53:22] I'm going to let my co-founder, Raynald Shin, do that.
[01:53:24] So please welcome him on stage.
[01:53:32] All right.
[01:53:33] Thank you, Ali.
[01:53:35] I'm super excited coming on stage today to talk to you about this new effort.
[01:53:40] And honestly, it's probably the single largest innovation we have done since our introduction of Lakehouse.
[01:53:46] And as Ali already mentioned, with the Lakehouse, we were able to actually unify data science, data engineering, as well as data warehousing.
[01:53:54] And our data warehousing business has taken off exponentially since its introduction five years ago.
[01:53:59] It was recognized by the Gartner and Forrester both as a leader in data warehousing.
[01:54:04] I think over 60% of the Fortune 500 uses the Lakehouse for warehousing.
[01:54:08] And I joke about we actually scan exabytes of workload, depending on which time zone you are, before you have breakfast.
[01:54:15] And over the course of the last few years, we worked fairly hard to improve performance at every single data and AI summit.
[01:54:22] I think me or Sean, somebody else, would come up on stage and introduce to you some massive performance improvements we've done.
[01:54:28] But the reality is kind of the warehouses or the lakehouses hit a wall at about a second.
[01:54:34] And what do I mean by hitting a wall?
[01:54:36] It doesn't mean you can never get a response from the lakehouse or warehouse in less than a second for a query.
[01:54:42] It just means if you have a very stringent workload that you want to have very tight SLAs in the range of milliseconds or hundreds of milliseconds,
[01:54:50] it would be very, very difficult to accomplish that because you might actually get latency spikes.
[01:54:55] And as a result, I think for many organizations that actually require workloads that have very stringent SLAs,
[01:55:02] they started setting up a separate serving stack.
[01:55:05] They would copy their data into the data warehouse or the lakehouse and copy a portion of it that's required for serving into a separate stack.
[01:55:13] And this caused a lot of problems.
[01:55:15] It's an actual data pipeline to maintain for every data set.
[01:55:19] This serving stacks tend to perform super well or in very simple queries, but doesn't actually work well for the more general workloads.
[01:55:25] And it's a governance nightmare.
[01:55:27] You now have to worry about, hey, did I forget to actually secure my data set in this specific platform, but I actually did in my underlying warehouse.
[01:55:34] So if the whole theme here is unification, what if you never had to move your data?
[01:55:41] What if you never have to copy it?
[01:55:42] And we asked ourselves that question a couple of years ago, and it turned out it's actually fairly fundamental.
[01:55:48] It's very, very difficult to tackle this problem.
[01:55:50] As a matter of fact, Michael Stonebreaker, who's a Turing Award winner, and some of you might know him as the original creator of Postgres,
[01:55:57] wrote a whole academic paper about it called One Size Fits All, an idea whose time has come and gone.
[01:56:03] And in this paper, Stonebreaker actually argued it's very, very difficult to engineer a single database engine that's capable of running a wide variety of workloads.
[01:56:13] And ultimately it's because of the increase in complexity makes it very, very difficult to actually engineer the system.
[01:56:18] And the fundamental reason for that is, if you look at how database engineering were done in the last four decades or even five decades,
[01:56:26] every database engineering team started working approximately this way, Databricks included.
[01:56:31] We'd go targets of, hey, here's a new workload we might want to actually study and then understand and then actually build a system for it.
[01:56:39] And the way to do it is to study all the latest and the coolest academic papers about this specific type of workloads.
[01:56:46] And some of the papers actually go back 40, 50 years, some of the papers are published in the last few years.
[01:56:51] You try to study them, understand, hey, it might apply in real life.
[01:56:55] And you spend a lot of time implementing it and at some point you try to roll it out to test how well it works with real world workloads.
[01:57:02] And inevitably this happens.
[01:57:05] So the new technique you implement, it could be a new algorithm, could be a new data structure.
[01:57:11] It works remarkably well for a subset of the queries and it works terribly for a subset of the queries.
[01:57:18] So the database ends up, hey, we're doing better here and actually regressing.
[01:57:22] And many of the algorithm and data structures actually work really, really well for very, very low latency to guarantee maybe a millisecond response time.
[01:57:29] Tend to actually backfire when we have larger data sets and vice versa.
[01:57:33] So this is one of the biggest problems.
[01:57:36] So we, two years or so, decided, hey, can we break the status quo?
[01:57:41] How can we challenge it and do the best we can?
[01:57:44] And we actually went back to the drawing board with an amazing engineering team.
[01:57:48] We actually flipped away how database engineering were done.
[01:57:51] Instead of starting with ideas and techniques, we went back to the workloads.
[01:57:56] So luckily, on Databricks, we have collected over zettabytes of data, we scanned traces from all of this data, which is in the range of corduille and actually traces.
[01:58:05] I have to learn how to pronounce that word.
[01:58:07] And all coming from trillions of queries that were executed.
[01:58:10] And based on all of this traces, we're able to actually build a machine learning model that give you a very high fidelity estimation of what might happen in practice, given a specific new technique.
[01:58:24] And based on this machine learning model, we can do two things.
[01:58:26] The first is we can select the right algorithm to dispatch very quickly at runtime, but it's not an LOM.
[01:58:32] Because LOMs, we have too high latency.
[01:58:34] We can actually pick the right algorithm to dispatch at runtime, but even more importantly, we can decide and predict what algorithms to even implement before it hits production.
[01:58:45] Because there are millions of algorithms and data structures out there, it would be virtually impossible to actually find on mall.
[01:58:52] Knowing what to implement is one of the biggest advantage one can get.
[01:58:55] So, the resulting engine of that is Raiden, which you all know is a reference to Mortal Kombat.
[01:59:04] So, Raiden, and the resulting engine is actually pretty remarkable.
[01:59:09] And you might have actually used some of the other serving stacks out there, and if you have, you run into many of the issues you're talking about.
[01:59:16] For example, I saw them can't run queries longer than five seconds, almost all of them require you to copy the data into a very specific format that's very engine specific.
[01:59:25] They can't run complex joins and all of that, but Raiden can actually do it all, right?
[01:59:29] And the way Raiden will show up is going to come in the form of a new SQL warehouse.
[01:59:34] And this is kind of abstract, so let me give you maybe a quick demo for you to help understand what Raiden actually does.
[01:59:42] All right, can I get a demo on stage?
[01:59:46] Okay, so here I have a very simple setup here, honestly just a very simple query, the New York taxi dataset query.
[01:59:53] You don't really need to understand what it does, but for those of you that actually know what the dataset is, it's fairly small.
[01:59:58] And this runs, first of all, I ran this query in Lake House today, all right?
[02:00:04] It finished, it's not terrible, it finished in about a second, all right?
[02:00:08] And on the right-hand side, I have the Raiden setup, which we call it Lake House RT, and let's try to run this.
[02:00:16] Okay, I don't know if you can see this number.
[02:00:25] Let's try to run it one more time.
[02:00:27] Nope, did we hard-code it, 0.007?
[02:00:30] Nope, no, we didn't.
[02:00:33] Okay, but that's just one query, all right?
[02:00:36] What's the fun of running just one query?
[02:00:40] Here I want to show you a demo of what would happen in practice when you start hammering the system.
[02:00:45] And we built a simulator simulating real dashboard application workloads hitting against the Raiden
[02:00:51] engine.
[02:00:51] And I'm going to show you start with just one active session here.
[02:00:54] So just imagine this is an actual dashboard running, and we're just going to load this dashboard.
[02:00:59] It loads fairly quickly, all right?
[02:01:00] It's probably what we expect.
[02:01:01] But what happens if you scale up to 10, 100, or why don't we try 1,000?
[02:01:07] So now we're going to simulate 1,000 active agents hitting the same engine at the same time.
[02:01:14] Each of the agent, by the way, is generating more than one query.
[02:01:17] The dashboard actually takes about eight queries for it to load.
[02:01:20] And they will all hit at exactly the same moment.
[02:01:23] Let's see what happens.
[02:01:25] Done, all right?
[02:01:28] So we executed 8,000 queries at 6,000 queries per second with a tail latency of 37 milliseconds.
[02:01:37] This is incredible.
[02:01:39] None of the existing systems can do this.
[02:01:41] That's a very simple demo.
[02:01:45] Okay, but when I told you, hey, none of the existing systems can do that,
[02:01:51] let's put it into the perspective of existing systems.
[02:01:54] So here, I want to show you a benchmark measuring latency versus throughput.
[02:01:58] Throughput here means queries per second, right?
[02:02:00] So we have the y-axis, which is the P90, 90 percentile latency.
[02:02:05] And the x-axis, the number of queries executing.
[02:02:08] And we're running a very simple query, tpch-creel6, which is a canonical query of scanning a little bit of
[02:02:14] data, running some aggregation.
[02:02:16] And let's see what happens when we test the different systems out there.
[02:02:19] So for a competitive vendor, using their latest warehouse, we're able to actually push the warehouse.
[02:02:26] It started fairly low latency, but as you start pushing up about 160 queries per second,
[02:02:31] the latency started climbing up and hitting a wall.
[02:02:35] So the same vendor actually announced a fairly innovative offering last year called interactive warehouse.
[02:02:41] It's so interactive, they can't actually read the same data with the general warehouse.
[02:02:45] The only way to actually have it work is to copy the data into a separate warehouse.
[02:02:50] And the interactive warehouse does do much better than their latest generation general warehouse.
[02:02:55] You were able to push to like 300 queries per second before the latency starts spiking up.
[02:03:01] And we also tested sort of an open source query engine that's known for being able to run very simple queries
[02:03:09] very, very fast with the yellow color.
[02:03:11] And it's actually running much better on this specific workload compared with the gen 2 and the interactive warehouse.
[02:03:19] And it's able to push all the way to close to actually, I think, 15,000 queries per second.
[02:03:25] That's pretty good, right?
[02:03:26] But then it just started crashing right after.
[02:03:28] It doesn't actually time out, it just crashes.
[02:03:30] So how can Raiden do, right?
[02:03:32] How does Raiden do?
[02:03:35] On the same workload, Raiden could actually run all the way to 12,000 queries per second,
[02:03:40] while keeping the tail latency still below a second.
[02:03:51] So there's one other benchmark I'm going to show you, which is what if we run all of TPCH,
[02:03:56] which is a standard data warehousing benchmark.
[02:03:59] As you can see on the screen, when data set is small, actually all systems run reasonably well.
[02:04:04] I mean, it's like a few seconds running 22 queries, some a little bit faster than others, but they all run pretty well.
[02:04:10] But as you scale up, for example, 100 gigabyte, you started seeing the yellow systems run time started spiking.
[02:04:17] And this is actually, I'm trying to point out the very specific problem I talked about earlier,
[02:04:22] which is a specialized serving engine doesn't actually work very well on the more complex joins,
[02:04:27] which will show up in the full TPCH queries.
[02:04:29] And of course, when we scaled up to actually a bigger amount of data, both systems actually started crashing.
[02:04:35] The interactive warehouse can't even finish because it cannot run any queries longer than five seconds,
[02:04:41] and the yellow system couldn't actually complete because it started running out of memory.
[02:04:44] And Raiden was actually able to complete all queries at all scale factor.
[02:04:49] You might have benchmark fatigue at this point, so many benchmarks.
[02:04:54] Who cares about synthetic benchmarks?
[02:04:56] So one of the things that we're really excited about in the process of developing Raiden is the design partners
[02:05:01] we've been working with, some of which I've shown their logo on the screen here.
[02:05:05] I want to share with you two results from all of these design partners.
[02:05:09] The first one is Enverus.
[02:05:11] Enverus is a global leading data and AI platform for the energy sector,
[02:05:16] and they tested Raiden with 11 of their most representative production queries.
[02:05:21] And what they found compared with the existing serving stack is that with Raiden,
[02:05:26] the same extremely low latency queries remain sub-hundred milliseconds,
[02:05:30] whereas the longer running queries can now actually dramatically shrink that run time.
[02:05:35] And across the board on average, they're actually getting a 16 times speed up.
[02:05:39] So Paul from Enverus was super excited to see that they're able to now execute queries in tens of milliseconds.
[02:05:46] But also shrink the runtime of the longest running queries by almost 100x.
[02:05:53] And Meta also tested Raiden and they found very similar things.
[02:05:56] They were able to have the typical queries come back in tens of milliseconds.
[02:06:00] And the most important thing is you don't need a separate system on the side just for your serving.
[02:06:06] So what Raiden does, it can give us the best low latency possible
[02:06:10] at extremely high concurrency.
[02:06:13] It can actually run real-world workloads really, really well.
[02:06:15] And it can also handle the complex EDW workloads super well.
[02:06:21] And today, very excited to announce the first product offering powered by the Raiden engine.
[02:06:26] It's called the Lakehouse RT.
[02:06:28] And Lakehouse RT is a new warehouse that's showing up starting with read-only workloads.
[02:06:33] It will support millisecond performance and massive concurrency.
[02:06:36] But the most important thing is directly on the lake, governed by Unity Catalog on the same delta or iceberg data.
[02:06:47] So what can you do differently with Lakehouse RT, where RT stands for real-time?
[02:06:55] Challenge your conventional wisdom.
[02:06:56] If you have a separate serving stack, test it out, and you might actually collapse the two platforms into one.
[02:07:04] And that will give you millisecond performance, massive concurrency, directly against your data lake in open formats.
[02:07:13] And beta starting today, try to talk to your account to get early access.
[02:07:19] And there's also two talks at this conference starting soon today on Raiden and Lakehouse RT that you can learn more about.
[02:07:26] And that's it.
[02:07:29] And next, I think I would like to invite Nikita on the stage to talk about Lakebase.
[02:07:38] Not quite.
[02:07:41] So, Reynold refuses to say what Raiden stands for, so I'm going to get up here and embarrass him a little bit.
[02:07:47] Okay, so Reynold runs all of the data platform stuff, all this innovation you see that's part of his team.
[02:07:53] So his team actually called the engine Raiden because it stands for Reynold's Dream Engine.
[02:07:59] Okay, it's true, true story.
[02:08:01] You know, A, I think they wanted to embarrass him, so that's why he doesn't want to call it that.
[02:08:05] But secondly, I was actually talking to one of the team members, and he said, look, there's so many projects at Databricks,
[02:08:10] and you guys always, you know, kill the projects that, you know, you think you don't want us to be spread out thin.
[02:08:15] So we thought, what if we would name this after a co-founder?
[02:08:19] Okay, so that's the Raiden engine.
[02:08:22] So this is the map that we had of the data world.
[02:08:24] And then you saw that now we had real-time analytics on a separate island.
[02:08:28] Now we can shift these tectonic plates and merge that in.
[02:08:31] You get the same format of the data under the hood.
[02:08:34] You get the same Unity catalog governance.
[02:08:37] And you can now do all of your low latency processing, as well as the big data processing,
[02:08:42] in this one island using the same technology stack under the hood.
[02:08:45] So it comes as part of Lakehouse RT, which stands for real-time.
[02:08:49] And we've actually priced it super competitive, so it's essentially the same price.
[02:08:52] So very excited about this.
[02:08:54] Okay, so now we're going to turn our eyes to the left, and we're going to look at this island of OLTP.
[02:09:00] And I'm very excited to welcome on stage Nikita Shamgunov, who was CEO of MemSQL, later Single Store.
[02:09:08] And then Neon, which was acquired by Databricks, where he leads all of this.
[02:09:11] And he's been doing OLTP for a very, very long time.
[02:09:14] So he'll tell us about Postgres.
[02:09:16] So let's welcome Nikita on stage.
[02:09:26] Thanks to AI, we now generate software.
[02:09:30] We're not writing it by hand.
[02:09:32] And we're looking at the trends for the first six months of year 26.
[02:09:36] And now we can confidently say we're going to have more software generated in the next year
[02:09:42] that in the history of humanity.
[02:09:45] And every application still needs a database.
[02:09:50] So what do next generation application and agents need from a database?
[02:09:58] So the first thing is it needs to be familiar.
[02:10:01] And the reason to that is the more agents know about the systems,
[02:10:06] and they know by scanning the internet and scanning all the Stack Overflow forums and Reddit,
[02:10:10] the better they are at running and operating it.
[02:10:13] So it needs to be open source.
[02:10:15] It needs to be popular.
[02:10:17] And it needs to be extensible.
[02:10:18] So you can consolidate all your main and also niche workloads on the single database platform.
[02:10:26] It also needs to be nimble.
[02:10:28] With this onslaught of applications that are coming at us and all the various dev tests and staging environments,
[02:10:35] the system needs to be serverless.
[02:10:37] And it needs to be branchable so we can easily create environments for your agents in which they can safely operate.
[02:10:44] And finally, as you run those things at scale and at volume, it needs to be cost-effective.
[02:10:51] The final thing, it needs to be mission-critical. Operational databases run your business.
[02:10:57] And so that means is it needs to be infinitely scalable, it needs to be fast, and it needs to be extremely reliable.
[02:11:06] So familiar, nimble, and mission-critical.
[02:11:14] So we decided to start with Postgres, the most advanced open source database in the world.
[02:11:20] It has the largest ecosystem in the world and also lots and lots of extensions.
[02:11:26] Again, you can consolidate all your additional workloads to the main one, to the operational database system one, into the platform.
[02:11:34] It's also open source and understood by every agent on the planet.
[02:11:39] But Postgres is a monolith, and the compute and storage in the monolith are tightly coupled.
[02:11:46] So if you want to make it nimble, you need to do something, you need to re-architect that system.
[02:11:53] The main idea we had is we can decouple storage and compute and move storage into the lake.
[02:12:01] But it's actually, you know, much harder than you might think.
[02:12:05] So for that, we need to re-architect storage from the ground up.
[02:12:09] The lake storage is, you know, has a lot of very, very good properties.
[02:12:13] It's inexpensive, it's very easy to scale, but it's also slow and transactionally inconsistent.
[02:12:21] So as we are rebuilding the storage for Postgres to run it on the lake,
[02:12:26] we had to introduce two additional services, one for reads and one for writes.
[02:12:30] For writes and transactional consistency, we built safekeepers.
[02:12:35] They implement a consensus protocol called Paxos, so they give you low latency writes.
[02:12:41] And then we introduced another service called PageServers that serve pages to Postgres compute.
[02:12:47] Those systems are called PageServers, and they deliver on low latency reads.
[02:12:52] Then we put it all together and integrate with the lake.
[02:12:55] And that gives us LakeBase, the fully managed serverless Postgres that runs on the lake.
[02:13:05] So what does LakeBase give us?
[02:13:08] Well, it certainly is nimble, right?
[02:13:11] So we can provision Postgres in under 500 milliseconds.
[02:13:15] And as you have lots and lots of developer staging environments for every potentially PR that you want
[02:13:23] to move into GitHub, we can scale that also to zero, right?
[02:13:28] So with lots and lots of environments and they're serverless, we need to deliver in TCO.
[02:13:34] So we automatically shut down environments that you don't use.
[02:13:39] We introduced two more patterns that are incredibly useful in this new agentic world.
[02:13:45] The first one is branching.
[02:13:46] Every Postgres database, you can easily branch, and it also branches in about 500 milliseconds.
[02:13:54] So you can create an additional environment for you to create dev test to staging.
[02:14:00] Another pattern that kind of emerged in that agentic development is snapshot restore.
[02:14:07] With any -- with a simple mouse click or an API call, you can snapshot your database.
[02:14:12] Then unleash your agent to do work, to build some software, change the schema, change the data,
[02:14:17] whatever.
[02:14:19] So if the work is to your liking, you can proceed.
[02:14:21] If not, you can instantly roll back to the previous snapshots.
[02:14:27] So now it's built for agents.
[02:14:30] But is it mission critical?
[02:14:31] Can it support your business?
[02:14:34] Let's start with scalability.
[02:14:37] With Lakebase, all you need to set is error bars for your compute.
[02:14:41] You can say, well, don't scale my compute below this minimum or over that maximum.
[02:14:47] And you want to do it for potentially cost controls.
[02:14:50] But within those error bars, you actually can scale up and down automatically as your workload
[02:14:58] changes.
[02:14:59] So we'll scale the system up at your peak hours, scale things down maybe on weekends or nights.
[02:15:08] Storage, of course, is on the lake.
[02:15:11] And that gives you infinite scalability.
[02:15:13] You will never run out of storage.
[02:15:15] And you will never run out -- you will never need to run management operations if you're approaching,
[02:15:20] you know, your disk size or whatever.
[02:15:25] So now you might be wondering, so you changed the architecture of Postgres at the storage level.
[02:15:32] So is it fast?
[02:15:35] How does it compare with the industry implementation of Postgres as a service?
[02:15:39] So this is a very similar graph to what Relnet was showing for Aiden, where we're measuring latency as we scale throughput.
[02:15:48] Obviously, here, it's an operational system.
[02:15:51] So it runs much smaller queries than in analytical systems, but they run it at very high concurrency.
[02:15:58] And we're comparing it with the first cloud vendor, which is quite popular.
[02:16:03] And that cloud vendor taps out at about 130 operations per minute.
[02:16:09] And by the way, we're running a fairly standard industry benchmark called TPROC.
[02:16:15] Another cloud vendor with slightly different architecture was able to push
[02:16:21] its throughput to about 350,000 operations a second.
[02:16:25] But after that, of course, latencies started to spike as well.
[02:16:29] And I'm very excited to showcase some of the incredible performance work that we've delivered.
[02:16:35] And Lakebase can scale under 10 milliseconds for each transaction, for each operation,
[02:16:41] all the way north of 600,000 operations per second.
[02:16:48] Lakebase is ready for almost any workload.
[02:16:53] Mission critical also means a lot of features, right?
[02:16:56] And those features are security, compliance, encryption, you name it, and we have it.
[02:17:02] But I also wanted to showcase something that we're in a very, very unique position to deliver on.
[02:17:10] So cloud outages don't happen every day.
[02:17:14] But when they do happen, they're devastating to your business.
[02:17:18] And now that a lot of your business is increasingly automated and run by agents themselves,
[02:17:24] cloud outages could be really, really disruptive to your operation.
[02:17:29] So I'm incredibly excited to introduce first fully managed cross-cloud disaster recovery.
[02:17:45] This is what we mean by truly mission critical.
[02:17:48] With fully managed cross-cloud disaster recovery, you can set up your system to run cross-clouds.
[02:17:55] So you can provision Lakebase, let's just say US West AWS, and then a replica US East Azure.
[02:18:06] And in the case of AWS outage, you can instantly failover from one cloud to another
[02:18:13] and continue your uninterrupted business operations.
[02:18:18] So this is Lakebase. It's fully managed serverless Postgres that runs on the lake,
[02:18:23] and it is mission critical. We've only been on the market for a year,
[02:18:28] but we already have over 3,500 enterprise customers that trust us with their mission-critical workloads.
[02:18:36] One of them I would like to invite here on stage. Please welcome Federica Cohen from MasterCard.
[02:18:43] - Thank you for having me. - Thank you for having me.
[02:18:57] Doing good. Excited to have you.
[02:19:00] So MasterCard aims to have data and services account for over half of its revenue.
[02:19:07] To get there, you consolidated 80 different services as the one Igentic platform.
[02:19:13] So what didn't work in the past and what's working now?
[02:19:17] Well, it's a great question. And maybe just to level set, just to give a sense of the dimension that we're operating in.
[02:19:23] We're operating in about 200, over 200 countries. We have billions of cardholders and we process more than 150 billion transactions every year.
[02:19:33] And so at the scale that we have to operate, bringing together our services was very critical.
[02:19:38] Now, LakeBase helped us accelerate that by creating a shared foundation where we can bring in differentiated insights,
[02:19:45] our governance, and isolation when we need it to basically bring together agents that can reason in real time,
[02:19:54] but more importantly, can work together with that shared context and that architecture.
[02:20:02] So, you've built a reusable architecture here using Unity Catalog and LakeBase to help with the creation of a virtual C-suite.
[02:20:12] How does standardizing on LakeBase help?
[02:20:14] Well, standardizing on LakeBase has been pretty critical.
[02:20:19] For example, in March, we announced a new solution called virtual C-suite.
[02:20:23] It's designed to help small businesses operate with executive level decisioning.
[02:20:27] Think about the average small business owner. They're very busy.
[02:20:31] It is very difficult for them to take decisions on cash flow, on forecasting.
[02:20:37] And so, what we've done is we've created a suite of agents that will help each as a digital executive,
[02:20:42] aiding in a number of digital responsibilities.
[02:20:45] The first one we're starting with is the virtual CFO.
[02:20:47] It brings and helps make decisions on cash flow, on payments, and how to make working capital,
[02:20:53] all which are make or break decisions.
[02:20:55] And so, we use LakeBase to create a foundation that allows us to take in insights from their data,
[02:21:02] from their performance, bring it all together. And the benefit is, because we're using LakeBase,
[02:21:07] we're able to have that shared understanding. So, as one agent takes action or generates insights,
[02:21:12] it becomes instantly available for the next one.
[02:21:14] So, let's talk about multi-tenancy. With thousands of issuer banks on a single platform,
[02:21:21] multi-tenant isolation is a high-stakes challenge. How did you use LakeBase to approach a secure,
[02:21:27] bank-level data separation, while enabling AI-driven insights that can continuously learn and improve
[02:21:32] over time? It's a great question. So, there is another solution that we've been working on that's
[02:21:37] coming soon called Performance Pulse. And for some of these solutions, we have to support thousands
[02:21:43] of issuing banks, all on a shared platform. Now, think about the challenge that we have to have.
[02:21:47] For us, data isolation isn't optional. It's foundational to the trust that our clients have with us,
[02:21:53] and it helps us manage things like data residency and other requirements that we have.
[02:21:57] And so, this capability that is coming soon, that helps us bring always-on insights into action,
[02:22:03] to measurement, and a continuous loop, requires us to keep our issuing data isolated and separate
[02:22:10] for protecting each of our clients. But at the same time, we have to have the ability to consolidate
[02:22:15] insights in a secure and transparent and anonymized way to draw it. LakeBase allowed us to basically create
[02:22:21] this solution by establishing a single foundation that then can apply to thousands of issuers at once.
[02:22:28] And so, for us, it's enabled us to have that trust and that embedded foundation within
[02:22:33] that we can create for one client, and then expand and scale horizontally to all of them.
[02:22:38] Your team went from concept to a scale-ready MVP demo in seven weeks. My mind is kind of blown. I've never
[02:22:45] seen anything like it. And for a company with MasterCard's regulatory bar, this is kind of wild. So,
[02:22:54] how did keeping agent memory, model serving, governance inside of a single platform allowed you to move
[02:23:01] fast without sacrificing trust? We're incredibly proud of the speed that we've been able to achieve,
[02:23:07] but part of the key piece is that we are uncompromising in our approach to how we manage and govern our AI
[02:23:13] and our data. And so, what we were able to do with LakeBase was to build all of those foundations from the
[02:23:18] start. So, when we start developing a new MVP, what we have the ability to do is embed that governance,
[02:23:25] the guardrails that we need, and a number of the other isolation and foundations that we need to have
[02:23:29] in place from the get-go. By building this in with the LakeBase architecture, that allowed us to move
[02:23:35] much faster than usual because all of those trust components become non-negotiable. So, in supposed
[02:23:41] to accelerate by cutting corners, we accelerated by building the foundation from the start into something
[02:23:45] that, frankly, now we can reuse every time we build additional agents. So, looking ahead, what becomes
[02:23:52] possible, as your agents get more autonomous, what are you looking forward to building on top of Databricks platform?
[02:23:59] Well, I'll go back to something you said at the start, Nikita. We aspire to have our services power more and more
[02:24:05] of MasterCard. Well, what that means in the agentic world is that over time, we're going to have an increasing
[02:24:10] set of agents that our clients and the ecosystem uses, but also it means that they have to be trusted
[02:24:17] and that they have to have that shared foundation within. And so, as we start moving from agents that
[02:24:22] not only provide insights and information, but can help you take action, help drive the ecosystem,
[02:24:28] that shared foundation is what's going to help them learn from each other and drive their power
[02:24:32] even further. So, when you think about some of the things I mentioned, like the virtual c-suite,
[02:24:36] like performance pools, like capabilities we're driving in agentic commerce, bringing that level of trust for
[02:24:41] us is the currency innovation. And we believe it's going to let us scale much further thanks to that shared
[02:24:46] components that we'll be able to build in. Thank you for being an incredible customer fed. Thank you for having
[02:24:51] us. Thank you, everybody. And now, a co-founder of Databricks, Patrick Wendell.
[02:25:10] Welcome to the stage, OpenAI's president and co-founder, Greg Brockman, and Databricks co-founder
[02:25:22] of OpenAI, and Vice President of Engineering, Patrick Wendell.
[02:25:34] So, you'll sit here.
[02:25:42] Well, my guest needs no introduction, but I will do it anyways. Greg Brockman is co-founder of OpenAI,
[02:25:49] and we're thrilled to have him here. Greg, I'd actually love to start by talking about your ever
[02:25:56] expanding role at OpenAI. You're known as sort of the godfather of the model production process,
[02:26:02] the infrastructure and talent and everything needed to power that core engine, but I know you've
[02:26:08] recently started taking on a much broader role as a founder, and I'd love to hear the motivation behind
[02:26:13] that and where your focus is these days. Well, first of all, thank you for having me. It's amazing
[02:26:18] to see so many people who are passionate about Databricks, about where technology is going in one
[02:26:23] place. And at OpenAI, you know, I've been involved in basically every aspect of our technology that I
[02:26:29] can, you know, help with since our very early days. And the way that I tend to operate is whatever the
[02:26:36] most important pressing problem of the day is. I'll try to go and make a difference there. There was a
[02:26:42] time like for GPT-4 where, you know, I was up at 2 a.m. every night, like, you know, getting the model
[02:26:47] back up and running, trying to figure out, like, you know, deep binary search through our nodes to go
[02:26:51] find which one was, you know, causing a problem, things like that. Build a lot of the training frameworks,
[02:26:58] and these days, one of our most important problems is that we've made, we have such a great model
[02:27:04] production process, right, that we really have spent a bunch of time over the past couple years to be
[02:27:09] able to innovate at every single layer. And connecting those models to the world is actually becoming this
[02:27:16] most important critical challenge. And so I've been spending a lot of time really helping the teams
[02:27:20] focus on what do we need to do to bring all this amazing research to bear, building the products,
[02:27:25] having a really focused effort to have one way of building agents that are able to add value to,
[02:27:32] you know, people in this room. Yeah, and certainly allowing these models to do useful work inside of
[02:27:38] an organization is something everyone here cares about a lot. And I will get into that. But before we
[02:27:42] get into it, I do want to talk quickly about the models themselves. You've been on such a tear in the
[02:27:48] last six months, or even really three months. If I have this right, you know, GPT 3.3 came out in February,
[02:27:54] 3.4 in March, sorry, 5.3, 5.4 in March, and 5.5 in April. So it was almost once a month. And as a user of
[02:28:02] those models, I use them largely in codecs, like each each model was just a huge leap from the prior one.
[02:28:08] I'm curious, what accounts for this like really rapid development and model quality? And how do you see
[02:28:13] things moving forward? Well, it's very much the, you know, the duck analogy, where it's like, you know,
[02:28:18] it looks very smooth and seamless. And you know, here's these models on a six week cadence, that kind of thing.
[02:28:22] Under the hood, it is a lot of work. It is a lot of teams coordinating together. And the interesting
[02:28:27] thing for me, as a technologist, you know, I have, you know, before OpenAI, I was working on Stripe. And
[02:28:33] there, you know, you sort of see what it's like to build a technology company. And, you know, you might
[02:28:38] think that building these models, it's like totally different, but it's actually pretty much the same,
[02:28:42] just in a very different domain. That requires a lot of like really sort of, you know, parsing apart this
[02:28:48] complicated space into a bunch of different layers. So there's pre training, there's post training,
[02:28:53] there's a bunch of just like, dealing with like, you have some race condition, you got to go find
[02:28:57] that figuring out how to get the inference into good shape. So there's a lot of technological problems
[02:29:02] that all stack together, thinking about the safety, how we actually are going to release this pricing,
[02:29:07] packaging, the whole thing. And that we basically have a flywheel internally, that I think it just goes
[02:29:15] faster and faster in terms of just like, how does every team contribute to each of these releases.
[02:29:20] And so at a technology level, I think that we've just made huge strides, like we were kind of,
[02:29:25] you know, two years ago, I think that releasing a model, even just like training a big model,
[02:29:28] was a huge amount of pain. And that the teams at every single layer, infrastructure, the research,
[02:29:34] everyone has come together in order to make that process smoother, better, and to be able to just work
[02:29:39] together as one to get these models out. So it really is a repeated process where we're building a muscle.
[02:29:43] Like, we at Databricks certainly saw when we rolled out 5.5 in our development process,
[02:29:48] there was like huge wins. And I think our engineers certainly saw it as like absolutely at the frontier
[02:29:54] around those types of workloads.
[02:29:55] Yes. And the exponential continues, right? And that is the wild thing about this field, right?
[02:29:59] It's like every single time a model drops, you try it, you're like, wow, like mind is blown,
[02:30:03] here's this problem I couldn't do before, now it's easy. And you just project that forward, right? And it's like,
[02:30:09] every single problem, both in software engineering, but also now in broad knowledge work, right? Any sort
[02:30:14] of work you do with the computer, it's starting to be something these models can accelerate, right?
[02:30:18] It's like this rocket ship that just moves you forward on anything. And the exponential,
[02:30:23] it's not going to stop. And I think that's something that's worth pricing in as people think about
[02:30:27] how they're building their AI strategy.
[02:30:28] So am I hearing that you're going to be able to tell us something about the next model timing-wise,
[02:30:34] capabilities-wise? How much, how much can you, we'll keep this just between us.
[02:30:37] It's just between us. Don't worry about these guys.
[02:30:39] Of course, of course. Yeah.
[02:30:40] I mean, look, like the, the, we are continuing to work super hard delivering the next level of
[02:30:46] capabilities. I think that the way to think about the moment, right, that we're going through is
[02:30:50] something like GPT-4 that, that was the, that was, that was a model that was great for chat because
[02:30:55] it was the first time that you had this interactivity. It was worth reading the output of the model when
[02:30:59] it was something that was very crafted for you. Whereas in the current moment, we have these
[02:31:03] models that can really do useful work, right? That they actually are able to accomplish things.
[02:31:06] They're able to use tools. And so you want those to be brought into your workflows, right? Software
[02:31:11] engineering is the first place that I think really felt it, right? Of, wow, if you actually utilize
[02:31:15] these models, it goes from 20% of the work of the, you know, sort of grind work being done by the
[02:31:20] models to 80%. Suddenly it changes the tools you're going to use. And I think we're going through that
[02:31:24] moment right now for workers everywhere, right? For any kind of slides, presentations, PowerPoints,
[02:31:29] like all of these things that you do are just connecting data between different systems. And all
[02:31:34] of that is just going to get much, much better. And so the way I'd look at it is like, we're going to
[02:31:38] keep having step function, better models that are more useful to you across anything that you need to
[02:31:43] do in your business. We're figuring out how to make these models broadly accessible. We think that's
[02:31:47] part of our mission is to bring these models to the world in a safe way. And that that is something that we're doing kind of in partnership with many companies,
[02:31:52] and bringing it to bear. So that pace of progress, the, you know, some model release cadence, those are all things that we intend to keep up.
[02:32:01] Yeah, fantastic. So, you know, this conference is the data plus AI conference. And so you have folks that are data practitioners, folks that are focused on AI and the synergies and combinations between them.
[02:32:11] One thing that I think is sort of underappreciated about OpenAI's work is the degree to which you use data in a
[02:32:19] way to inform the improvement of quality in models and in your own products. And it's been, it's also been a privilege to be a part of that, you know, Databricks is publicly used by OpenAI for certain things.
[02:32:31] And I love seeing when I grant, when you or Sam tweet something and I see a graph, I'm thinking, oh, I know that's a Databricks graph. I'm always texting you about that.
[02:32:40] But yeah, I'd love to understand a bit. What role does data play even just for your own development of products and models at OpenAI? And, and yeah, anything about also how Databricks or other pieces of infrastructure are important in that?
[02:32:52] Well, fundamentally, I think that data is almost the core ingredient upon which, I don't know if you want to think of our models as like a cake or something, right?
[02:33:04] Yeah, layer cake.
[02:33:05] Layer cake can work. Data's probably on the bottom.
[02:33:07] Exactly. Something like that. And that this is true both for how we actually produce the models and the amount of return on just like actually making sure your data is clean is like not to be understated, but also more broadly understanding how are people using our products, right?
[02:33:21] And really understanding, you know, ChatGPT is the most broadly deployed AI chatbot. And that there's a huge diversity of things that people do with it, right? And that a lot of what I really try to, that I really care about is thinking about what are the use cases, right?
[02:33:36] And there's a lot that we dig into where people are using our models in like these surprising and amazing ways we never would have anticipated. And sometimes, you know, it's applications like in healthcare, people using using AI to get information about, you know, you have a medical report that you upload.
[02:33:50] You get an insight from that that you wouldn't have gotten any other way. You don't have access to a doctor. And just being able to find and understand the different ways that people are getting value from these models is something we spend a lot of time thinking about and doing on the agentic side.
[02:34:02] I think that they're just like even just really looking through just understanding how is an AI actually solving a problem. It becomes much harder to just read a trace, right? You need some better way of looking at these things analytically.
[02:34:17] But the amazing thing about where we are is we have this technology that's capable of sort of ingesting information at a much larger scale and rate than humans alone could.
[02:34:26] And so sometimes we can actually use our own models in order to understand how the models are operating and behaving. And this is true on the insight side, but it's also true on the safety side, right?
[02:34:35] For example, one of the, you know, sort of most painful parts of using codecs has always been approvals.
[02:34:40] And now we have an AI layer that's able to do those approvals for you.
[02:34:43] Yeah, fantastic. Any Databricks feature requests? You know, I can try to make it happen.
[02:34:48] I mean, for us, it's always scalable. We want it faster. We want better latency. So I think it's just in general, the big thing we want is just like, we have such major workloads and everything is scaling and falling over constantly.
[02:35:02] And so just like the kind of basics of infrastructure reliability is always going to be our top one.
[02:35:06] Better, faster, stronger. I mean, the usual. Great. So I'd love to talk also, you know, Databricks and OpenAI last year announced a partnership together.
[02:35:15] That partnership was a $100 million partnership initially focused on allowing our customers to have access to the GPT family of models to combine with their enterprise data.
[02:35:25] And in, you know, the last six months, as you've continued to innovate around codecs, we've added support for that in our AI gateway.
[02:35:32] And we've also worked together on allowing Genie our AI to be used natively from inside of codecs. So lots of exciting stuff happening there.
[02:35:39] I'd love to hear how OpenAI thinks about first partnerships in general. What roles do partners play as you kind of are expanding your presence and then anything specifically, you know, we have thousands of joint customers.
[02:35:51] A lot of them are in this room. You know, what do you want them to hear about our future work together?
[02:35:56] Well, number one is that we view what we're doing is that we're kind of a spark in the overall engine of the economy of what, you know, this AI revolution that we're all building together.
[02:36:07] And so partners are a critical part of that, that we really want to be building with the ecosystem.
[02:36:12] You know, there's there's certain things that that we're able to do in house and we'll double down on.
[02:36:16] But I think that the only way that we get there is through the leverage of what everyone is doing and working together with companies like Databricks is absolutely essential for us.
[02:36:23] And so it's been really amazing to see how people are using systems like Genie powered by our models in all these different contexts, you know, people using it internally within their own companies.
[02:36:32] And so how many people in this room have tried codecs?
[02:36:38] So I think that everyone in this room should absolutely try it because it's like we have been evolving what codecs is in a very significant way.
[02:36:47] Like I think that the first of all, the codecs app is something is very differentiated.
[02:36:52] It's something that is worth trying.
[02:36:54] It's like there's nothing like it right now.
[02:36:55] It feels like a very different way to use a computer.
[02:36:57] And Patrick, maybe you could talk about your own use of codecs.
[02:37:00] Yeah, I mean, we've like rolled it.
[02:37:02] I would consider us early adopters at Databricks.
[02:37:05] And and not only are we seeing that huge productivity gains for our developers, but just the pace at which it's improving is remarkable.
[02:37:12] I mean, I come back, you know, either a new model release or an update to the desktop software.
[02:37:16] And like everything is just getting better very, very fast.
[02:37:19] So, you know, we're always encouraging our own customers as we do to look at multiple AI solutions,
[02:37:25] AI vendors kind of empower your engineers with the best models and tools that they that they can find.
[02:37:31] And, you know, certainly it's become a very critical part of our of our development stack.
[02:37:34] Yes. Yeah.
[02:37:35] And I think it's been very interesting to watch kind of the competitive dynamics where, you know,
[02:37:39] I think we were relatively late to the game in some ways for building the, you know, codecs type form factor.
[02:37:45] But we have focused extremely hard and the rate of improvement is so high.
[02:37:48] And so we're bringing this technology to make it more accessible available to everyone.
[02:37:52] One thing people actually underappreciate is codecs itself is open source, right?
[02:37:56] And that that is like in our DNA is how do we actually make this be something people can hack on and use in all the context that they want.
[02:38:02] So it sounds like your message is try codecs out and my message is you can do it with Databricks.
[02:38:08] It's super easy, especially, you know, if you're already working with OpenAI and Databricks together.
[02:38:12] So so, yeah, we'd love to have everyone try it out.
[02:38:15] There's also many talks at this conference about different folks through our partnership building cool applications and internal tools with Databricks plus OpenAI.
[02:38:23] So a lot of exciting stuff for folks here.
[02:38:25] Yeah, it's never been a better time to be a builder.
[02:38:28] Like I think that that to me is the most amazing thing about this moment is that creativity is being unleashed that anything you want to do, suddenly it's possible.
[02:38:38] If you can imagine it, you can build it.
[02:38:40] So we're almost out of time.
[02:38:42] I think for my last question, I will ask you, you know, earlier this morning, Ali was talking about AGI and in his mind, AGI is here in the sense that the types of tasks that, you know, our partners, our customers care about are almost already perfectly executed by by AIs.
[02:38:58] Now, AGI means different things to different people.
[02:39:00] What's your take on, you know, AGI?
[02:39:02] Do we have AGI?
[02:39:03] Is it very far away?
[02:39:04] Is it nearby?
[02:39:05] Is it undefined?
[02:39:06] You know, how do you where do you sit in that one?
[02:39:08] I do think it's kind of wild that AGI is this like very personal thing.
[02:39:12] It's almost like AGI is a feeling, not a defined thing.
[02:39:15] It's the friends we had all along.
[02:39:16] Exactly.
[02:39:17] But I do think that there is something significant that is worth thinking about.
[02:39:21] And I think maybe one of the category errors that I think about is that AGI is almost a spectrum, not a moment, right?
[02:39:28] Or it's like AI progress is not a thing that has an endpoint, right?
[02:39:31] It's going to continue.
[02:39:32] And I think if the AGI moment is something we haven't hit yet, like the way that I think about it is when you really do
[02:39:38] have an assistant that is able to really autonomously go and work on any task for you and accelerate you and do that at the level of the top humans.
[02:39:49] And we have this jagged intelligence, right?
[02:39:51] It's something that actually really accelerates people in certain ways, but it's also like very limited and stunted in other ways.
[02:39:56] And I think this is a great opportunity because one thing we really stand for at OpenAI is keeping humans at the center, right?
[02:40:01] So we really want humans to be setting the goals, to be in control, and to make sure these systems are in benefit of humanity.
[02:40:07] And so that means that we have this opportunity to sort of leverage these systems in cases that we really want, but also have a lot of time to think about where do we want the humans to fundamentally remain and to remain the drivers of everything that happens from here.
[02:40:19] And so I think we're not, I think, I think we're, we're not nearly at the end of this exponential.
[02:40:24] It's going to continue.
[02:40:25] And again, everyone in this room, I think, has something to contribute to, to that change.
[02:40:28] So in a sense, is it fair to say the, you know, the framing of AGI versus non-AGI is like not a helpful framing at this point?
[02:40:34] Or how do you feel about that?
[02:40:35] Because, because everyone has a different definition, I think it's not the right framing.
[02:40:38] I think that the right framing is what are the capabilities?
[02:40:41] What do we want these AIs to be doing?
[02:40:44] How can we benefit from them?
[02:40:45] How can they help us?
[02:40:46] Right?
[02:40:47] And I think that keeping that at the center and thinking about what we're building together, to me, that is number one.
[02:40:52] Awesome.
[02:40:53] Well, I think our parting recommendation is try out Codex.
[02:40:57] If you're a Databricks customer, you can use it with the AI coding gateway extremely easily.
[02:41:00] It takes about one minute to set up.
[02:41:02] I think Ali will come on next to keep the program going.
[02:41:06] But Greg, thank you so much for being here.
[02:41:08] And everyone, please join me in a round of applause to thank Greg.
[02:41:15] Thank you for having me.
[02:41:16] Yeah.
[02:41:27] All right.
[02:41:28] Okay.
[02:41:29] That's awesome.
[02:41:30] We love partnering with OpenAI.
[02:41:31] They're awesome.
[02:41:32] You should check out Codex.
[02:41:33] They're also using Databricks.
[02:41:35] You know, they have all these, you know, almost billion users.
[02:41:38] And they put all that data in Databricks.
[02:41:40] And they analyze to make sure that, you know, nobody's jailbreaking or nothing, you know, weird is happening.
[02:41:45] So let's bring us back to the data realm.
[02:41:47] We have the most exciting, actually, announcements still left.
[02:41:50] So you don't want to miss that.
[02:41:52] Okay.
[02:41:53] So what does the data realm look like?
[02:41:54] We already talked about the right-hand side.
[02:41:56] This is where you're asking your analytical questions.
[02:41:58] Data warehousing, data science.
[02:42:00] And then Reynold showed us that now we can get super low latency.
[02:42:03] World's fastest engine with Raiden and Lakehouse RT.
[02:42:07] And then what we did is we went over to the OLTP side, and Nikita, who's really our expert on OLTP databases, started focusing on how can we actually get all of that to be fused so that you can do everything postgres on top of the lake really fast, really cheap, with disaster recovery across the clouds.
[02:42:25] So how do we deal with this onslaught and the storm of agents that is coming our way?
[02:42:30] We still have these waters that we have to shuffle things between OLTP and OLAP.
[02:42:35] So let's welcome back Reynold of Raiden fame back to stage.
[02:42:45] Thank you again.
[02:42:46] So you heard from Bilal, simplified data engineering, from myself about how we're going to unify analytics, and Nikita from modernize OLTP.
[02:42:58] But honestly, one of the biggest problems, there's still one big thing remain, which is we have the giant continent OLTP databases where applications start putting their data in first.
[02:43:09] And if your applications become successful enough, you want to reason on that data and analyze on your data.
[02:43:14] And so we end up actually building CDC pipelines that ship the data from OLTP databases over to your analytics continent.
[02:43:22] And can I get a show of hands, how many of you love your CDC pipelines?
[02:43:27] Oh, there's some.
[02:43:29] I would love to learn from you how you actually maintain them.
[02:43:32] CDC pipelines are basically change data capture pipelines.
[02:43:36] It reads the bin log of OLTP databases, and it gets all the delta, little delta changes, and ships the delta over to analytic systems and reconstruct the state of OLTP databases in them.
[02:43:46] They're super fragile.
[02:43:48] They're very annoying to maintain.
[02:43:49] Many data engineers have been waking up at 3:00 a.m.
[02:43:52] Actually, because it caused the data corruption.
[02:43:54] As a matter of fact, some people at Databricks actually joke about CDC doesn't stand for change data capture.
[02:44:00] It really stands for continuous data corruption.
[02:44:03] All right.
[02:44:04] And to solve this problem, the industry started introducing zero CDC, zero ETL, mirroring.
[02:44:12] And all of this are just fancy terms of let's do a managed CDC by hiding the pipeline so it automatically happens.
[02:44:18] But in practice, it's still a pipeline under the hood.
[02:44:20] It suffers from exactly the same issue.
[02:44:22] CDC has.
[02:44:23] All right.
[02:44:24] So, some of you might have heard of another term called HTAP, hybrid transactional analytical system or processing.
[02:44:32] And HTAP is sort of the holy grail of database engineering.
[02:44:35] The idea is you create a single database system that's capable of handling both your OLTP workloads and your analytics workloads in one system.
[02:44:43] But HTAP has.
[02:44:44] I think most of you probably haven't actually used HTAP system in practice.
[02:44:48] And the reason is HTAP as a category had largely failed.
[02:44:51] There's very little adoption of it.
[02:44:53] And all of the HTAP systems out there are kind of very proprietary.
[02:44:57] They don't have a big ecosystem.
[02:44:59] And by building a single system, they started compromising on both the performance for OLTP and the performance analytics.
[02:45:05] So, the question is how do we truly unify the two, right?
[02:45:11] If HTAP doesn't work and CDC is very annoying and causes corruption all the time, how do we do it?
[02:45:17] And the solution to this actually goes back to the very fundamental sort of problem or dichotomy of role-oriented storage and column-oriented storage.
[02:45:26] As Ali and Nikita both alluded to earlier, OLTP systems needs role-oriented storage because it needs to do a needle and a haystab lookup.
[02:45:34] And needs to actually be doing updates on those roles very quickly.
[02:45:38] Whereas analytic system does a lot of scans of large amount of data and really benefit from column-oriented storage.
[02:45:44] If you look at the architectural diagram Nikita already showed you, where we have safekeepers and paste servers that write data in an actual role-oriented format to the data lakes.
[02:45:55] The solution actually lies right there.
[02:45:58] When we looked at and profiled the systems of safekeeper and paste server, those are storage services.
[02:46:03] It turned out they are very underutilized from a CPU perspective but very I/O bound because all they are doing is local disk reads, local disk writes, and network reads and network writes.
[02:46:13] So, it actually gives us the opportunity to leverage those idle CPUs on those storage services to do a transcoding at the very moment to turn the data from role-oriented format to a column-oriented format.
[02:46:26] All right.
[02:46:27] And it turned out once you convert the data from role-oriented format to column-oriented format, it doesn't actually increase the CPU utilization that much but actually dramatically shrink the data volume because column-oriented format has a better compression ratio.
[02:46:42] It's often somewhere between 10 to 1 to 100 to 1.
[02:46:45] And because of all the services I/O bound, by having a better compression ratio, we can now actually benefit from this transcoding instead of hurting the performance.
[02:46:55] And with that, we can actually write the OLTP data into the data lake in column-oriented format.
[02:47:08] So, with that data, you can actually apply any analytic compute directly against that data.
[02:47:13] So, how does this work in practice?
[02:47:15] Imagine inserting a new role into the lake base.
[02:47:18] And the lake base compute eventually sends that role to the underlying storage system.
[02:47:22] And the storage system will do the transcoding right there and actually write the roles directly into the column-oriented format in delta or iceberg.
[02:47:29] And now, you can actually file up a query against your lake house.
[02:47:33] And the lake house will read directly the delta and iceberg format and return your query on the freshest copy of the data.
[02:47:41] So, we think this is actually revolutionary.
[02:47:43] It's probably at least as big as the Raiden talk I was talking about.
[02:47:47] And the whole concept of lake house plus lake base, we can actually unify the storage for both of them.
[02:47:52] And we're giving this term, the storage technology, a name called LTAP.
[02:47:56] And LTAP stands for Lake Transactional Analytical Processing.
[02:47:59] We think LTAP is HTAP done right and accomplish the goals of HTAP without actually having a single query engine for it.
[02:48:06] But we're able to actually unify the storage, which is by far the most important part.
[02:48:09] And now, you have one copy of data for OLTP databases and one copy of data to govern.
[02:48:23] So, LTAP can truly unify your data infrastructure.
[02:48:26] You start with the OLTP database for applications, but you don't have to copy your data.
[02:48:30] There's no pipelines to maintain.
[02:48:32] And most importantly, there's actually no compromise in performance at all for both your OLTP systems and your analytic systems.
[02:48:38] And all of this are built on open interoperable formats, Postgres, Delta Lake, Iceberg.
[02:48:44] So, with LTAP, we can finally actually combine the separate continents of OLTP analytics into a single giant continent.
[02:48:56] But we don't want to do this alone.
[02:48:58] So, we'll be open sourcing actually a very fundamental library that allows converting Postgres data directly into columnar format in Parquet.
[02:49:05] The logo is actually -- it's kind of a cute logo that has a Parquet elephant.
[02:49:10] So, it indicates what they do.
[02:49:11] And this should be coming fairly soon.
[02:49:14] So, again, same question.
[02:49:22] What do you need to do differently if you want to leverage LTAP?
[02:49:25] Actually, nothing.
[02:49:26] Just start using LakeBase.
[02:49:28] We'll be rolling out the LPAP capability in coming weeks.
[02:49:32] And once it's rolled out, every single one of your LakeBase table will automatically appear in your LakeHouse.
[02:49:38] There's nothing you need to do.
[02:49:39] And we can only do that because there's no pipeline to maintain under the hood.
[02:49:43] It's available for every single table.
[02:49:45] All right.
[02:49:46] And in order to sort of demonstrate to you, this is the last demo of the day.
[02:49:50] We invite Holly on to stage to actually show you a demo of LakeBase combined with LTAP.
[02:49:58] I'm going to give you a tour of both LakeBase and LTAP and why you need to combine your analytical and operational data.
[02:50:14] I have business banks across the Americas creating billions of transactions a day.
[02:50:19] And I want an agent that spots potential VIP customers and matches them to an advisor based on their qualifications and local regulations.
[02:50:27] All in time for a teller to tell them that information before you've left the door.
[02:50:33] Previously, this would have been impossible.
[02:50:35] The delay would have been too slow and the intervention would have come too late.
[02:50:40] But not anymore.
[02:50:41] So, let me show you Databricks and let's go ahead and start with LakeBase.
[02:50:46] LakeBase is open source Postgres fully managed by Databricks with separated storage and compute.
[02:50:53] And it's really easy to get started.
[02:50:55] I'm going to create a new project.
[02:50:58] I'm going to name it.
[02:50:59] And I'm going to stick with the defaults.
[02:51:01] And that's it.
[02:51:02] It's made.
[02:51:03] The compute size isn't fixed.
[02:51:05] So, I can scale that up and down within the UI and set a minimum and maximum size.
[02:51:10] And on the left, you can see we've got branches here.
[02:51:13] And I can click on it.
[02:51:14] And I can go and create a new branch.
[02:51:16] We can name it.
[02:51:17] We can have it auto delete.
[02:51:19] We can make a branch of production.
[02:51:21] We can make a branch from a point back in time.
[02:51:23] But we can also branch from a branch.
[02:51:25] And it's created almost instantly because there's no need to copy data.
[02:51:30] Speaking of back in time, we also have backup and restore.
[02:51:34] This is configured to retain seven days of history by default.
[02:51:38] But you can do more or less if you like and pick any time window to restore to.
[02:51:43] Or if you wanted, you could restore from a snapshot which is scheduled on a regular basis.
[02:51:50] Now, all of this is empty.
[02:51:52] Here's what I set up earlier that's processing all of our data.
[02:51:55] And we can go into the tables.
[02:51:57] And I can see all of my transactional data here.
[02:52:00] And I'm going to need a query to scan across all of these historic balances to calculate who is an outlier and potentially a VIP.
[02:52:08] And for those queries, we care about some metrics.
[02:52:11] So first is TPS.
[02:52:13] So this is the line chart in green that we're going to care about on the right-hand side.
[02:52:18] If this number drops, our business is impacted.
[02:52:21] And if it stops, our branch staff are going to have an outage.
[02:52:25] The second metric we care about with our queries is how long our analytical queries take.
[02:52:30] And this is going to be the top left number in blue here.
[02:52:33] And finally, how fresh or stale our data is.
[02:52:36] And this is measured in seconds.
[02:52:38] And it will be the top right number in purple.
[02:52:40] Defines how up-to-date our data is.
[02:52:42] And if I'm querying in real time, I better get real-time answers.
[02:52:46] So earlier, Reynolds said there were three ways of running a query like this.
[02:52:49] And the first is what an agent might try by default.
[02:52:53] And that's to submit the complex analytics query to existing LakeBase compute.
[02:52:58] I'm running this from the Databricks SQL editor.
[02:53:00] It's possible to connect to LakeBase from there.
[02:53:03] And when I'm running this query, I can see that this is not quick to come back.
[02:53:07] This is not good of a one-off analytics query.
[02:53:10] But it's terrible for our agents who are going to need answers a lot faster.
[02:53:14] And we can also see that it's starting to impact our TPS.
[02:53:19] We can see that this green line is going down and shows a bit of a worrying trend.
[02:53:23] Because realistically, it's not just going to be one tip here.
[02:53:26] It's going to be thousands of them.
[02:53:29] You can barely do analytics like this, let alone flag in real time who would be a VIP customer.
[02:53:41] And it's still going.
[02:53:51] Okay.
[02:53:52] Well, you all know my name.
[02:53:54] It looks like we have time for everyone to go around the room and introduce themselves.
[02:53:59] We'll start at the front, maybe.
[02:54:02] And...
[02:54:07] Oh, it's done.
[02:54:08] Okay, finally.
[02:54:09] Maybe we'll do that later.
[02:54:11] And in blue, we can see that this took over a minute to complete.
[02:54:15] We can say that the staleness is also quite bad as well.
[02:54:19] And we risk taking down production.
[02:54:21] Okay, so that clearly didn't work.
[02:54:23] So what if instead we took a copy of this data using CDC?
[02:54:35] I'm using the lake-based change data feed feature, which is meant to be for audit workloads, not up-to-date analytics.
[02:54:42] And with this, I know that there's going to be a bit of staleness with at least 15 to 30 seconds of data.
[02:54:48] But this is within the Databricks ecosystem.
[02:54:50] If I was using a different Postgres offering or a different CDF tool, this could be into multiple minutes, not to mention the additional bill and the additional team to maintain it.
[02:55:00] We can see that the TPS is flat, but it's finished running, and it took 13 seconds to run, but it was out of date by 25 seconds.
[02:55:10] So finally, the LTAP way of doing this.
[02:55:13] And with LTAP, these tables are available in Unity Catalog.
[02:55:16] And here it is looking like a managed table, but we can see from the icons that, yes, it's a lake-based table, but it's also available to both Delta and Iceberg readers.
[02:55:27] I didn't have to set up any additional pipelines to make this happen.
[02:55:30] I just specified a catalog, and it appeared.
[02:55:34] And you might notice a slight difference in our UI here, and that's that we have an S3 path.
[02:55:39] And if I go to this S3 path, I could be able to see all of my tables in here.
[02:55:45] This is great for openness.
[02:55:47] I can query this with open source Spark if I wanted to.
[02:55:51] That's a bit of a tangent.
[02:55:53] Let's head back to LTAP.
[02:55:56] So here is our query again, and I'll be querying it with a new type of warehouse, a real-time warehouse.
[02:56:03] And so now I click run, and it's done.
[02:56:07] It ran in milliseconds.
[02:56:11] And we can see that this is very real-time data without any staleness, and it had no impact on our TPS.
[02:56:18] So we can throw as many agents at it as we like.
[02:56:21] And because this ran almost instantaneously, it also means it's significantly cheaper in compute cost, too.
[02:56:27] It is also cheaper to maintain because there were no intermediary tools to manage.
[02:56:32] For all of us, this is huge.
[02:56:34] We have a single system of record where your data can be used for both your operational and analytical systems
[02:56:40] without having to compromise on performance.
[02:56:43] All of this happened automatically.
[02:56:46] I didn't have to set up any steps in the background.
[02:56:49] And finally, we no longer have a divided view of the world between operations and analytics.
[02:56:54] We no longer have to choose.
[02:56:56] Thank you.
[02:56:58] All right.
[02:57:10] So that's it.
[02:57:11] We were able to unify all these different islands, remove all of the data movement into just this one big Pangea.
[02:57:16] All right.
[02:57:17] So we have the-- with the LTAP or LakeTAP, you can just, you know, on tap, get your lake data out.
[02:57:23] The underlying data is just stored in Iceberg.
[02:57:25] Okay.
[02:57:26] So it's just open source Iceberg Delta.
[02:57:28] And you can do transaction processing.
[02:57:30] And you can do analytics.
[02:57:31] You can do real-time on it.
[02:57:32] You can do data science.
[02:57:33] So that's a game-changer.
[02:57:34] Okay.
[02:57:35] So just to recap, we already unified data engineering and data science with Spark.
[02:57:39] Then we had data warehousing with the Lake House that we unified there.
[02:57:43] And then we were able to, with the Raiden engine or Lake House RT, unify all of these into one.
[02:57:48] All TP databases.
[02:57:50] And then that got us LTAP.
[02:57:53] Okay.
[02:57:54] So that's what that looks like.
[02:57:55] So to zoom out and recap today quickly, we have our agents, but they're lacking context.
[02:58:02] We want to give them all that context on the platform.
[02:58:04] So we did a bunch of announcements today.
[02:58:06] So we introduced the Gentic data foundation.
[02:58:09] We had the Lake House RT.
[02:58:11] That's super, super fast.
[02:58:12] We had the lake base with disaster recovery across the different clouds.
[02:58:16] LTAP.
[02:58:17] Lake flow with all the connectors.
[02:58:18] So super exciting.
[02:58:19] Then we went to the context layer.
[02:58:22] Their genie ontology is extremely important.
[02:58:24] That's your context graph.
[02:58:26] We have the unity AI gateway for all your control cost governance of all your models and agents.
[02:58:32] Open sharing.
[02:58:33] And then we had Omnigent, which you're going to hear more about tomorrow.
[02:58:38] And then finally, we have the agentic dev work.
[02:58:41] This is all the genie agents.
[02:58:42] There's three main ones.
[02:58:44] There is genie one, genie code, and genie agents.
[02:58:48] Those are the three main ones.
[02:58:49] And then finally, we talked about the apps that we're getting into.
[02:58:52] And we're very excited about those.
[02:58:53] Lake watch for your sim and customer lake for CDP.
[02:58:57] So that's a wrap for the keynote session today.
[02:59:00] 11:30, we have the sessions.
[02:59:02] And also tomorrow morning, don't miss Mattei's keynote on Omnigent.
[02:59:06] Have a great show for you.
[02:59:08] Thanks for coming.
[02:59:09] Thank you.
[02:59:10] Thank you.