Ryan is a member of the Azure Document DB team at Microsoft.
Ryan describes himself as a 1st class world citizen. Born & raised in South Africa, spending time in the UK, before moving Downunder to the very fine New Zealand, before moving again to Redmond to join the Azure DocumentDB team.
Luckily wherever he goes there are mountains to ride and slopes to ski (in between building awesome Azure services and engaging with the developer community)
Greg Low: Introducing Show 64 with guest Ryan Crawcour.
Welcome, our guest today is Ryan Crawcour. Ryan describes himself as a first-class world citizen, born and raised in South Africa and spending time in the UK before moving down under to the very fine New Zealand and then moving on again to Redmond to join the Azure document DB team. Luckily wherever he goes he says there are lots of mountains to ride and slopes to ski in between building some awesome Azure services and engaging with the developer community. So welcome Ryan!
Ryan Crawcour: thanks Greg, good to be on your show.
Greg Low: awesome, listen so where I came across you I saw you recently on the Edge web cast out of Channel 9 where they were sort of discussing document DB. I gather you seem to be the main one evangelizing is in that at the moment. What I will get you to do before you start is just tell us how on earth you come to be there?
Ryan Crawcour: yes sure, I have grown up being a developer back in the day before I join Microsoft. I have always been a developer that has been fascinated with data. You know like most people, I have got my history I roots with SQL Server and relational databases. All that was good and then I left the development business and kind of got into consulting and solution architecture business. Eventually I found myself working for Microsoft and through my travels like you mentioned I ended up with Microsoft in a role in Auckland where I worked with developers from a number of different organisations as they were trying to on-board their solutions into the Azure platform. Whilst I was working with them and I worked with all sorts of organisations from the one-man start-ups to 100 to 200,000 kind of organization. As I started working with these guys, I came across more and more of these other solutions that we were dealing with data day in and day out that I was on familiar with. People were storing data in Mongo, people were storing data in Raven, Couch, Cassandra and Craft DB. All these kinds things that I was on familiar with, so I started as an effort to familiarise myself with these technologies and to help those companies bring their solutions into Azure. How to run those we will call them foreign technologies or at least they were foreign to me and how to run those technologies on Azure properly. So that’s where our kind started, you know I found this awesome book 7 databases in 7 weeks. I started reading up on these different style databases and that just got me back into being the kind of data guy that I have always been.
I became really interested in doing that and we had some really successful projects with running other technologies on Azure VMs and seeing how these other kinds of data technologies enable solutions that previously we were struggling to do with relational databases. Some things they were just really difficult and seeing just how easy it was for these other technologies to cope with that.
Greg Low: what sort of things do you think are difficult in those areas? I suppose this is the move to a no SQL, sort of movement in the first place but what do you categorise as the main reasons why people head down that path?
Ryan Crawcour: I think there are two big drivers, well two big drivers that I have seen at least maybe there is three. SQL and MySQL and Oracle are those ones that we know and love so well they are really good at doing relational databases. By relational I’m sure you know that that is a piece of data related to some other piece of data which relates to some other piece of data and you end up building these kind of deep relationships structures. You know relational databases are also met doing that but there are other kinds of data modelling that is really difficult to do in a relational world. You know if you look at let’s take an example like LinkedIn’s data, where I am linked to you and you are linked to someone else and that person is linked to someone else and therefore I have got a fifth degree of separation between myself and that other person. You are creating these really kind of strange relationships, they are still relationships but they are really difficult to model in a relational database.
You now have got graft databases which are absolutely perfect for bottling that sort of stuff and if you look at other sorts of things where you have got these very flat kind of data structures that aren’t really related to anything else. There are just these flat structures, those are particularly good and can easily be modelled all thought of as stand-alone isolated documents. I think what we have seen as applications have evolved and as applications have been changed we are dealing with different kinds of data. Rather than kind of forcing things into a particular technology or a particular way of working we may have had to do I do know it 10 years ago because there weren’t these other kinds of technologies up there we have seen some guys like Facebook and Twitter realise that pay when you’re operating at this kind of scale that these guys do. Rather than trying for something let’s look at other technologies you know things like Cassandra DB and CouchDB, Cloud Ads, mongoDB, and RavernDB. They have all kind of started arriving on the scene, making the world a little bit easier for certain kinds of things.
Greg Low: it is interesting though although one of the things I find is that I find people sometimes push these technologies into areas that they probably don’t suit. There is a sort of thing, for example when I deal with the Twitter API, every now and then when I deal with it, it just comes back and says “oops” and so on. I mean I am okay with that losing a tweet, I am not okay with that losing a business transaction.
Ryan Crawcour: yes absolutely, I think the same argument holds true when you look at it from the other way. We say don’t force things into a particular technology when it’s not suited there. The same can be said about these other technologies as well. If you have a particular need, don’t force something that it is not designed to do. Pick the right tool for the right job. More and more we are seeing more solutions that are not just composed of a single data technology or they are not composed of a single piece of web technology. There are so many different pieces out there and different frameworks and different technologies that our applications now become these kind of compositions. Your data tier is now actually made up of three or four different data technologies. One specifically really good at storing and retrieving unstructured data for example like images. So don’t go and stick your images into a relational database, put your images in somewhere that is really good and is really cheap for storing them there. Don’t go in do things that feel a natural, if there is a better technology out there. Rather look to how I can take the things and compose them to gather and build my solution there.
Greg Low: it is interesting, even in the Mongo situation, there was an interesting example recently where somebody had built their bit coin exchange on that and a user ended up managing to completely empty that bit coin exchange simply because they haven’t really worked in transactions. Oh well, and so on.
Ryan Crawcour: it is this world of we are going to move to, I think if you read that article it is actually published online somewhere. They exploited the fact that a particular technology doesn’t have transactions and is running in eventual consistency and they knew how to do that. The guys that were doing as were actually pretty smart but you know I think there are still things that bit coin exchange could have done differently to ensure that that sort of stuff didn’t happen.
I think we let have learned from our mistakes, I don’t think there are other bit coin exchanges that are going to be. I think they may have learnt from that mistake.
Greg Low: indeed. So listen, in that case we have got sort of no SQL things at one end and we have had the relational at the other end. One of the ones that is kind of interesting of course is JSON, if I look at SQL Server for example we have got XML support built-in in SQL Server 2005, including X query which I thought was an interesting addition to the database product there. I think the industry basically moved to JSON in 2006 and if you look at the list of top requested items for SQL Server. The addition of native JSON support has probably been one of the top items on the entire list ever since then. There is a crying need for someone somewhere to stored JSON and to work with that prospectively. Now Document DB seems to be some middle ground between the no SQL side and be relational side, where we can stored JSON natively.
Ryan Crawcour: yes I think you are right the industry has somewhat kind of ditched XML and move towards JSON for a number of reasons. I’m not getting into those, XML with its opening and closing tags is kind of verbose and wastes a lot of space. Where we are sensitive to what goes across the wire and kind of stuff, so JSON is a more compact kind of structure. What is interesting is someone has invented a XML schema to represent JSON which is just bizarre.
Greg Low: that is awesome! I must admit the first place I used to come across JSON worked in Ajax calls out of websites and again it were short, sharp, burst the things that didn’t want the bulk of XML when they were communicating.
Ryan Crawcour: so I think, you would see a number of databases store JSON natively. Azure table storage has had JSON support for a while. Mongo DB those store a variant of JSON. There are a number of database technologies now that work would JSON directly. Document DB we took when we set out to build the servers, we looked at it. We are not just going to build JSON on top of it, we really want JSON to be the heart of database because it lends itself so well to this concept of no schema, and varying kind of structure. It just lends itself really, really well and we can do some really cool things just by taking JSON and building our indexing technologies on top of that. We took a big bet on JSON, that is our native data format. When you passed up to us it comes across as JSON, when you query back from us it comes back as JSON. It is stored as JSON in the database, it is indexed very similar kind of stuff.
Greg Low: yes it certainly ended up being everywhere at the moment I am certainly seeing it almost everything that needs a past data around. It is surprising the number of places that are now JSON base.
Ryan Crawcour: we will probably see JSON added to SQL Server at some stage.
Greg Low: yes that is certainly still in our list of we would love to see that. Certainly would be interesting to see the level at which that would end up in there. For example they could add storage, but they could also add the ability to query within the JSON and so on and so and or some sort of indexing.
Ryan Crawcour: yes I mean it would be interesting to see where it goes. If it is just going to be the same as we have had it, the select for XML or select at XML.
Greg Low: yes maybe for JSON instead.
Ryan Crawcour: yes I think there is more to what Document DB is doing, rather than just storing JSON. I think it is fundamentally given us the ability to store things that are different together and not be constrained by this concept of a table. Where you has columns with data types, the rows and as soon as you get a record that slightly doesn’t fit into that table you then have to force it in there.
Greg Low: yes what are you going to do. Yes that’s right.
Ryan Crawcour: or you have to restructure the table or something. With this you can take, we have a collection of documents and literally it is just that. It is a bag of documents, and each document is JSON so it has a structure but every single one of them can be entirely different. Yet we still indexing everything and allow you to query on everything.
Greg Low: yes, look I think that is going to be an awesome option that is available. Look down the track, my guess is that applications increasingly will just be things that string together services that are provided externally. Rather than things necessarily build and store and store all themselves. I think it would just be, the skills in the future will be how you tied together different services to a given outcome. So having it is a service is perfect.
Ryan Crawcour: I heard someone say many many years ago, I can’t ever remember where it was. I think it was when I first join Microsoft many many years ago, someone said well, as you are either a developer that builds kind of database technologies and network transport stacks and that sort of stuff or you are an application building line of business applications, adding value to your business. If you want to build databases and you want to build network protocols and transports and/or that kind of stuff go work for someone who builds stuff. If you want to be the person who is building a business application to add value to your business to get a job done and to solve a problem and to be more profitable and whatever the case may be. If I was that kind of developer, I really don’t want to have to be mucking around with stuff.I do want to be worried that how I manage indexes, and how do I build indexes.
A customer of ours said the other day, for him his mantra is not how to index better, but how not to index at all. Let’s let the database technologies to figure it out and you can guide it and you can give it some hints but let it figure it out. I don’t want to have to worry about indexing, just let it do that. I don’t want to have to worry about backups.
Greg Low: yes I think it is the same target eventually with SQL Server as well. My take on that is that I think very much it is a service will which it already kind of these but again I think you will eventually become a much self tuning, self everything service and in the end you would just have servers with the T-SQL endpoint and just talk to it.
Ryan Crawcour: yes that is the idea, I know it is a long way away. The server knows when the queries not performing well. Server knows when the index is optimal all when the index is not optimal. It should be I would learn from that and tuner index itself. Why does it have to wait for someone to come along and say oh by the way this index that you already know about is not very optimal? Let me change like this which you probably know about anyway. Let me change like these so it can become more optimal. I think we are a long way away from that.
Greg Low: yes I suppose the thing also with indexing, is at the moment you have to know in advance what the queries would look like all you have to wait until there is pain to try and sort it out any way.
Ryan Crawcour: I remember the things that we were trying to come to solve with Document DB as well. I’m not sure if you realise that you don’t really know when you set out to design your data model write our front before you build your application. You have an idea of what your application is going to need. Even if you know what your application today needs, in six months’ time the queries are different or the workload is different. You have gone from write heavy to read heavy or a verbose the kind of workload or whatever, or that affects indexing right? Or that affects how you are storing your data and that is one of the kind of goals with Document DB. To try and be more flexible and be more agile to let you just focus on storing your data and retrieving it and leave the indexing up to the database engine.
Greg Low: and now in terms of the interface for people talking to it. Unlike a T-SQL end point, this is just, this is a rest style endpoint.
Ryan Crawcour: yes.
Greg Low: we POST to put new things in, we PUT to update them, we DELETE to delete them and use GET to retrieve them.
Ryan Crawcour: yes absolutely.
Greg Low: we basically use just stock standard verbs.
Ryan Crawcour: yes, it is REST. We have a bunch of client SDKs, for your various languages, sitting on top of the REST APIs which we wrap and enhance and we do some things. We will handle retries, and we will handle some stuff inside the.net SDK. We will handle serialization, serialising so you can just work with your.net object, we will worry about converting to JSON on the way.
Greg Low: yes, as well is.net I was pleased to see you also had no JS option and python and java and so on. It doesn’t really matter so much what the client is, there are good connectivity options.
Ryan Crawcour: yes we have had some guys, the response has been kind of overwhelming. If you go and have a look today you will see people evolving other SDKs for us. So someone has started an initiative to build an SDK for GO. Shortly there would be a GO SDK published, although there are some other ones like PHP I saw the other day and stuff like that. It is again over the standard REST API with the standard verbs you are able come and build your own SDK or customise your SDK.
We will have all our SDKs, it will be all open source anyway so anyone in the community if you want to contribute and change the SDK. Or whatever you can join the community and contribute to it.
Greg Low: yes that is one of the real powers of having a REST API sitting under it. It is so easy to go off and build whatever is needed. You are not going to get boxed into a corner doing that. That is great now listen, so basically we have a JSON document store and it also has the ability to store attachments I noticed as well.
Ryan Crawcour: yes, so basically you can have a binary attachment and those we off load to BLOB storage under the covers. So the JSON itself is stored on our nodes on SSDs, with low latency and really low to write operations and read operations.
Greg Low: yes I notice that, it is actually optimised for write operations and SSD back under the covers.
Ryan Crawcour: yes if you give us any binary kind of stuff that you want to attach to that document, you can give it to us as an attachment we then offload that attachment into BLOB storage. Where we will manage the life cycle that for you. So if you then delete the document, we go in delete the attachment. If you query for the document, you can query for it with its attachments or without its attachments. All you would just be attachments whatever you want to do. We have sort of kind of built a file stream data type.
Greg Low: yes.
Ryan Crawcour: into BLOB storage.
Greg Low: so yes, like any of these in this case so it is schema free. In that case we can put basically arbitrary JSON documents in. The indexing is the thing that is interesting, from what I can see it seem to also index an awful lot of stuff but you could put influence over what it did and didn’t index in terms of performance.
Ryan Crawcour: yes by default, if you left all the defaults on we automatically and consistently hash index everything in your JSON document. There are a lot of work with and collaboration with some internal teams, we have come up with an indexing technology that we believe can keep up with higher rights and volumes over a sustained long period of time. So you don’t have to worry about the fact that because we are indexing everything your write performance is going to suffer. It can kind of keep up, now if you want to get even more write performance out of the system, you can kind of start playing around with changing be consistency all the indexing mode from consistent too lazy for instance which will then kick in and async indexing process which will give you even higher write speeds but then you get some inconsistent reads until the indexing is have caught up.
You can also do things like you have a JSON document that’s got arbitrary are hundred attributes in it and you know that you will own a query on 10 of them. What you can do to say space and make the charge for it kind of operations to kick out the best performance that you can. You can go into it and exclude out the 90 properties that you are never going to query on.
Greg Low: yes, actually that raises a good point because again in terms of performance the unit you buy this in I gather is capacity units and that seems a combination from what I could see storage in gigabytes and also request units which sounded a bit like a nebulous sort of being to what that was.
Ryan Crawcour: so what we have done is effectively it is a multi-tenanted service right so when you are running multi-tenants on a node and we need to provision up all slice up the capacity of that machine into tenants. So what we have done is that we have slice things up into these capacity units and that is how we break or slice up the boxes into these things called capacity units. Then we said all right each capacity unit is going to have a certain amount of storage and ease going to have a certain amount of throughput capability. When we looked at it we went well, some operations are going to be a lot more expensive than other operations.
For example if I’m writing a 1K document with five properties in it, it is going to have a particular kind of impact on my throughput or CPU or on my memory or on my disk I/O or whatever the case may be. If I am writing a 256K document with about 150 properties and they are quite jagged and nested, that is going to have a very very different impact on your performance. To be able to give you this predicted, or a way to calculate a predictable model we came up with this normalise concept called request unit. So everything that you do on the server whether it is a request, whether it is an update, whether it is and execute a stored procedure, it doesn’t matter what it is everything we calculate. We have got like an in accounting engine basically, that says write this operation cost you five request units, this operation cost you are hundred request units, this operation cost you ten request units. Now if you do that same operation tomorrow, the same costs will apply. So you can start getting this level of predictability, so you can actually plan your capacity. So you will would say well I do so many of these, so many of those, I estimate that we would need so many of these, you would be able to work out to a fairly accurate level what your total request units spend is. Then you can actually know how many capacity units you need to buy in order to satisfy your number of requests units that you are using. That is all the request unit is, it is kind of a virtual currency and when you buy a single capacity unit, currently you get 2000 request units per second. You know that you can spend 2000 of these things are second, go do some queries, go do some inserts, go fire some stored procedures. See what your charge is, we tell you in the response of every operation that we do we tell you what the charges. So you can see that a this query by ID, cost me 2 ½ units.
Greg Low: oh, so that is actually in what the head up all the body of the return?
Ryan Crawcour: yes it is in the header of the return.
Greg Low: oh that’s awesome!
Ryan Crawcour: so if you look at the return there is a header called charge, and that will quickly tell you the charge that we are accounting for this operation.
Greg Low: Ahh I must play more with Fidler and go and have a look. I had a notice that sitting in there that is awesome. This is really good for being able to estimate and also I suppose the performance of different types of queries in terms of costs.
Ryan Crawcour: so shortly we will have in the portal at some point, a way to be able to visualise which operations are costing you a lot, which operations are not as effective and efficient. We will allow you to be only change your own indexing policy maybe excludes out some of the paths or maybe change the precision of some paths whatever the case may be. Then rerun the same operation and see what that does to your charge. You will see that your charge could come down, hopefully your charge would not go up because you have gone the wrong way. If your charge has then come down, that means you can now do more of these operations per second then you could before.
Greg Low: and transactions as well, it looks like a level support for transactions. What are the boundaries for that though?
Ryan Crawcour: so the transaction at the moment is bound to a single collection, so think of a collection as a unit a partition. A lot of people in fact, it is even in some of the documentation for other document databases where they say if for a record in SQL is the same as a document in a document database. Well then a table in SQL is the same as a collection in a document database and that is fundamentally wrong I believe.
Our collection to us because it is no schema database, you can store any schema you like inside a collection. So why limit yourself to storing only a records of the same type together because that then limits you in terms of what you can do around things like collections, joins and all that kind of stuff. So for us a collection is really a partition. All of our stuff at the moment is bound to a single collection, transactions, joins, queries, all of those things are bound to a single collection. If your data spans multiple collections, then you need to start doing fan out type queries.
Greg Low: yes so it is the client that needs to do that, not the service that does that or some other tier.
Ryan Crawcour: so currently there is the client that needs to do that, so we are looking at what some of our other so SQL DB or Azure DB or what is it called today?
Greg Low: it is Microsoft Azure SQL Database.
Ryan Crawcour: or that one, or SQL Azure as we still all call it, I mean they just recently released the preview of something called elastic scale which is a change from federations before. Where federations used to trying to it on the server level and try to be one solution for everybody. Elastic scale is now kind of a library that you can kind of plug into your application and your application actually talks elastic scale. Elastic scale kind of manages your background for you which actual database you know has been working.
So we were looking at how they have done it and we were looking at some other databases and saying world these guys have decided to try and do it at the server level. These guys have said we are not even going to enter into that, it is all up to you. We will try and look and see what is the best approach, so you will see stuff coming from us in the future where we will make this easier for you. For today, it is the application that needs to manage that sort of stuff.
Greg Low: yes and look I think in the future, if you are looking at all sorts of scale things often I think having the application do that is often a very good solution anyway. I look at people who are trying to do fan out queries with SQL database and I think that actually misses the point. I would actually rather the application send out a whole lot of requests concurrently in parallel rather than the one service at the back end try to feed it out.
Ryan Crawcour: well I mean it is interesting that you kind of say that. We have even seen some queries where you do this kind of hey SELECT me a record where this ID is in a list of other values. I mean that is a common query that people write, now that is a fairly expensive query to process in the backend on a server, because you have to do it with a bunch of scans, so it is a fairly expensive query. Now imagine if there were five records in the list that you were doing. It is pretty easy to write and fire off those five individual queries in parallel, sit in .net and wait for them to return and then aggregate the results back together again.
A lot of times that is actually more efficient than things doing a big nasty ugly SELECT WHERE IN kind of query. You were firing smaller requests to the server, I am sure you are doing five and kind of network round-trips but not doing them synchronously.
Greg Low: yes that’s right you are not doing them sequentially, you are doing them at the same time and I think the other thing is probably gives it a more natural threading
Ryan Crawcour: absolutely.
Greg Low: in terms of each one is going to be picked up by a separate thread to go into it and you just wait for the more to complete.
Ryan Crawcour: yes you just wait for them or to complete and then you have the results sitting there and you can kind of filter and aggregate and join and do whatever you like in the application.
Greg Low: now one I did want to ask about though is the query syntax, which again is sort of SQL like from what I’ve seen as well.
Ryan Crawcour: it is mostly is SQL.
Greg Low: mostly SQL, there are kind of view clauses in there that I haven’t seen in SQL though.
Ryan Crawcour: yes there are a few things in there and I think what we have tried to do is kind of keep it as close to SQL as we possibly can. We use SELECT, FROM, WHERE like you would expect them to be used. We use the field AS some alias, like you would expect. We use the notion, now understand that a document is kind of hierarchical so we use the notion of dot kind of syntax, so SQL development is kind of familiar with server.schema.table whatever. So we use those same kind of syntax to kind of navigate a hierarchical tree.
Greg Low: yes.
Ryan Crawcour: but then there is some kind of interesting ones and we have had to introduce those just because of the fact that we are dealing with schema free kind of business. A lot of times in SQL Server, the engine knows before start executing the query, it knows a lot of things because it can infer a lot of that stuff from the schema.
Greg Low: yes.
Ryan Crawcour: now we don’t have the luxury, we have no schema. So we don’t know before we start processing those records necessarily what we are going to run into until we run into it. So there are some things that we have had to introduce into that SQL language for those specific things. You will see a particular key record called VALUE, and that is just do you want the value of the element or do you want the actual element itself?
Greg Low: the actual element, yes and I noticed there was a from memory there was an ON or FOR clause or something?
Ryan Crawcour: I think there is an ON when you are doing some joints, so.
Greg Low: yes it might have been, yes that was a little bit different some of that. But you know overall.
Ryan Crawcour: yes we struggle with that one for quite a while, in terms of what was going to be the easiest syntax to kind of make sense. I think some of the stuff that we done now is setting us up for where we want to go with a query engine in the future.
Greg Low: yes.
Ryan Crawcour: so like a lot of people
Greg Low: Scott was saying, it was just like grab the SQL and run but it didn’t quite look like that to me.
Ryan Crawcour: simple SQL like SELECT id FROM collection name WHERE a field equal a value that is just normal SQL. But when you start digging into a little bit more complicated things and you starting to traverse the hierarchical tree, then it starts getting a little bit more complicated. When we want to star introducing things like, find all the documents that don’t have this particular attribute, or find me all documents that have this particular attribute. You know those kind of things, they don’t have a SQL syntax for them today because there is no matching kind of query in SQL Server.
You can’t query a SQL table and say find me records that you know don’t have this field because all elements in that table have that field, they may be null.
Greg Low: actually it has been very similar for XML. I ended up having to write things that said look fine me this value if this happens to be there. You know that sort of thing, yes it is a different way of doing things.
Ryan Crawcour: yes it is something that we have had to deal with because someone may say well my query where ID equals a particular value, well not every document okay ID was a bad the example but if you said Name equals Greg while not every document in this collection actually has an attribute called Name.
Greg Low: Name yes. No indeed, I also noticed that there is a concept of something that I suppose stored procedure you might call it for familiarity for server side programmability.
Ryan Crawcour: yes.
Greg Low: yes, I mean we use at all over the place in client applications that is for sure and there is so many people who use it is node on the server and that would be an easy thing for them to then get their head around.
Ryan Crawcour: yes when we sat down and said well, what are some of the things people miss about SQL Server or a typical database server technology? Some of the things they said they missed was the ability to do things on the server. Now when we announced Document DB and we announced the fact that there was server programmability. Some people took their pitchforks out and sharpened it and said you are going back 20 years and forcing us to do things on the server, but we are not. I mean you can do everything client side in your app but there are certain cases where we are doing things on the server and it makes sense.
Greg Low: yes look I mean it certainly does in SQL land as well. I mean I cringe every time I see somebody rehydrate every single object in a table individually back up to middle tier and write them all back again all things like that. That is just ridiculous!
Ryan Crawcour: there are some cases where, where stored procedures make really good sense and in the Document DB world one of the really good things about stored procedures is that stored procedure runs within the context of a transaction so you can actually now make changes to multiple documents and different types under the covers all within the boundaries of a transaction.
Greg Low: yes, that is awesome!
Ryan Crawcour: you know you don’t have this world where you don’t have transactions any more, I can actually now debit an account or credit an account within a transaction.
Greg Low: which a lot of the document ones don’t currently do, which is very important.
Ryan Crawcour: yes with that kind of stuff.
Greg Low: so listen, in terms of scale out it is the next area of interest to me. So again we can have replicas of it but that is where we start to trade off the consistency? Or is it we choose whether to trade that off or not?
Ryan Crawcour: yes that is an interesting thing about Document DB, is that we allow you to choose your consistency rather than dictating it to you. We allow you to choose what consistency you want to run. So we have got four levels of consistency that you can choose. So right from strong consistency which everyone is already familiar with because that is what SQL Server does by default. Right down to eventual consistency which people are becoming more familiar with because that is what a lot of these other No SQL systems do.
Greg Low: yes.
Ryan Crawcour: we also have two that are kind of in between. There one that is really interesting is the one that we has said is the default which we call session consistency. What session consistency is, is it is a good trade-off between strong and eventual where to you on that session you are operating like strong consistency mode where you can read your own writes. You get a consistent kind of behavior, so if you change your name from Greg to Sam and immediately do a read you will see the name is Sam you won’t get this weird inconsistent kind of behaviour which is sometimes get on eventual databases.
But everybody else that is connecting to that database they are able to see you update immediately. To them it is an eventual kind of thing, they will eventually get the fact that you have changed your name but to then it is not that important. To them what is important is what they are currently doing with the piece of data that they are currently working on.
Greg Low: yes look that would answer very large number of scenarios that we come across.
Ryan Crawcour: yes, that is why we made that wonder default because we think that one for a lot of people will be a good kind of compromise between strong and eventual. But if you are running an application that has got massive write kind of speeds and you don’t really care about the fact that you can potential get some inconsistent reads, for then run eventual consistency. It is going to be your quickest one. If you are running an application where you absolutely have to have consistent reads every single time and you can’t afford a dirty read then strong consistency is the way to go and we allow you to kind of change that, which is kind of cool.
Greg Low: yes what about what write guarantees though? So if it is eventual consistency, what does that mean in terms of writes? Does the have to commit to a replica before it tells you he has done it?
Ryan Crawcour: so eventual consistency it will return back be act that the writers happened as soon as it has been committed to one of the replicas or one of the nodes and eventually in the background we will replicate that out to the other replicas. In an eventual consistency world we will enable our reads this still happen and the reads could be landing on any of the replicas anywhere. We will still allow, our replicas are actually serving reads and it is not just the primary. Their primary use for the writes, our replica are is for our reads.
Greg Low: I think what I was getting at is that tells you that it has written, is it actually written?
Ryan Crawcour: Yes if it tells you it has written, I guarantee you it has written. It may not be replicated to the other nodes yet but it has been written. Absolutely!
Greg Low: yes that’s good. Now what about security? Now I gather there are item level permissions in place? And so how is that done? We have a series of roles?
Ryan Crawcour: so today what you can do is we have the notion of users, so like SQL has of users.
Greg Low: yes.
Ryan Crawcour: you can also create this notion of permissions, so you can say this is a permission object that grants read access to a particular document or read/write access to a particular connection. And then what you do is that you can assign users or user to that permission object. So you can actually say, use these users have this permission or those users have that permission.
Greg Low: what we normally say in SQL server though is that we don’t like individual user permission, we try and drop people in roles and assign permissions to roles instead.
Ryan Crawcour: yes, that has been identified as one of the downsides that we need to work on is role based authentication. We are kind are waiting to see what happens to see the rest of Azure and it’s role based, you know permission structure. As you know we are starting to add back into the Azure portal, we are starting to do things like this role in the Azure portal has read only access or write only access. What we are trying to do is that we don’t want to go off and build this notion of roles and then have to roll that back out and implement something else. So we are kind of in a waiting game at the moment to see where backers are kind of going and how we can leverage some of that stuff and how we going to integrate into active directory or Azure active directory for instance.
Greg Low: yes because I must admit, that is one of the big ask in terms of the database formerly known as SQL Azure. It tends to be, people after authentication or things like that rather than just SQL server authentication.
Ryan Crawcour: absolutely, I think that has been an ask since I have been working, since before I was working in this role. It is an ask that people are having on Document DB and it is an ask that we will and so some point we are just waiting for some things to kind of catch up before we can make better on it.
Greg Low: so if you have users at the moment so I presume you have users and passwords. What does that mean about passwords policies?
Ryan Crawcour: so we don’t actually have, we are not an identity provider. So we don’t actually have passwords, right? Users have request tokens, say every user you create in the system gets a token. Like when you work with storage or when you work with Document DB, we would give you these master keys.
Greg Low: yes an access key.
Ryan Crawcour: an access key, so every user gets an access key. Now what will end up happening is the user needs to go off and authenticate wherever your users are authenticating today. Your application would then say yes I trust this user is who they say they are because I trust this identity provider. This user has access to my application, all is good. You would then go off to Document DB potentially and you would say hey Document DB for this user give me their permission levels and we will then craft up a token for you that contains those permissions. It contains an expiry date and all that kind of stuff and we start to encrypt it and we give that token back to you. You can then take that token and your application tier can take that token and give it to the user and the user can then use that token to talk directly to Document DB.
Greg Low: yes all makes sense. Another capability then that we would normally have in database is some way profiling and this is something that is also missing from the database that is formerly known as SQL Azure, is there some way of attaching like a profiler? Because if I want to collect all the queries that have been run against Document DB, is there a way to do that at the server end rather than at the client end?
Ryan Crawcour: at the moment no. It is something that is definitely now ask list and we are looking at it and determining the best way to do it. We log everything down to an individual activity ID but it is how we expose that out to users in a meaningful kind of way. So it is definitely something that we want to look at in terms of how we do this.
Greg Low: I suppose, at the moment we can build just a proxy that sits as a service in Azure, and something anyway that hits that and have it re-routed and so we can do that to pull it out but.
Ryan Crawcour: We are going to add to the portal soon I think, I can remember the exact roadmap that soon we will add the ability to see what queries have been executed.
Greg Low: yes.
Ryan Crawcour: SQL Azure has the ability to see your top five executed queries, that kind of thing. We will have the ability inside the management portal before too long and then you will be able to get some level of indication but it won’t be every single query. Whether we create these management views or something like that we don’t really know yet where we are going to go with it.
Greg Low: yes I think if we build like a redirect proxy or something that would be kind of interesting because you could also catch the cost from what you were saying. Basically details and things as well, that will be actually kind of cool.
Ryan Crawcour: Yes that can be possible.
Greg Low: look on a similar vein, the next question is usually about auditing. So again is there anything in place or in play?
Ryan Crawcour: No today, we don’t have automatic auditing in play today. We have got a couple of customers that are using triggers to do auditing so theirs the notion.
Greg Low: Yes, that is right there is the concept of a trigger isn’t there?
Ryan Crawcour: Yes there is a pre-trigger and post-trigger, so you can use that to do auditing if you want.
Greg Low: can you talk to the SQL Server guys and get them to think about pre-triggers?
Ryan Crawcour: I know that SQL DB guys have kind of now released native kind of auditing capabilities. We are speaking to them, we are all on the same team ultimately. We are speaking to them to see what we can learn from that but today the best way to do that would be with a pre-or post-trigger.
Greg Low: yes, that is another good scenario where server side bit of code actually makes sense.
Ryan Crawcour: yes absolutely! The nice thing about triggers is that you will see when you use them is you actually have to opt into them. So the trigger can be defined on the server but when you actually execute be command you actually have to opt into the fact to get into them.
Greg Low: yes interesting!
Ryan Crawcour: so won’t actually execute and two-year actually tell it, hey a look with this operation five the following trigger.
Greg Low: yes alternately you could make the operation and be one that cause a procedure there and it could do the logging as well as doing the operation.
Ryan Crawcour: yes, so exactly. The operation could just be a call to a stored procedure and these stored procedure can do your audit log and you can actually do the operation and he could do it all in a transaction if you wanted to.
Greg Low: yes, that is nice. So look for people to. That is awesome, actually, that sounds awesome, so for people to try this, the obvious thing is that Azure is a really easy place for them to go and try this as well.
Ryan Crawcour: yes absolutely, this will work on any Azure account, whether it is a full blown pay as you go account or MSDN account or a trial account. You just got to go to the preview portal and then the preview portal would just asked you to sign up for Document DB account. Then create a database then create a collection and start putting some documents in and off you go.
Greg Low: yes that’s cool!
Ryan Crawcour: we have just release a new portal capabilities that allow you to, write some queries, add some documents. That kind of stuff right there in the portal, so you don’t actually need to write any code at all but when you are ready to write some code, you just grab the SDK of your choice and have a look at some of the samples that we have got published and you should be good to go.
Greg Low: Yes, also fellow Kiwi, I notice that Chris Auld also had a Document DB session at Azure Conf the other day. I will put a link to that in the show notes as well.
Ryan Crawcour: I noticed I think the same session or a similar session at Tech Ed Australia.
Greg Low: Tech Ed Australia yes.
Ryan Crawcour: those are really good sessions I think, and then there is Azure Friday and Cloud Cover, you will see I have done two sessions there, so have a look at those and you know if anyone wants more information they can reach out to me.
Greg Low: yes.
Ryan Crawcour: cool.
Greg Low: Magic!
Ryan Crawcour: Alright.
Greg Low: thank you so very much for your time today right.
Ryan Crawcour: it was a pleasure thanks for inviting me.
Greg Low: Awesome, we would talk again soon.
Ryan Crawcour: Thanks Greg, chow!