Erin Welker
SQL Down Under Show 55 - Guest: Erin Welker - Published: 23 Feb 2013
SDU Show 55 features former SQL Server MVP Erin Welker discussing common design issues and patterns in SQL Server dimensional modelling.
Details About Our Guest
Erin Welker is BI consultant at Artis Consulting in Dallas Texas, her passion is SQL Server BI. She has been working with SQL Server since 1.11, she serves as a data warehouse architect or lead DBA on several large projects for globally recognized companies. She has also presented on SQL topics at several conferences and SQL Saturday’s and is the author of various white papers, articles and book chapters on SQL Server and data warehousing.
Show Notes And Links
Show Transcript
Greg Low: Introducing Show 55 with guest Erin Welker.
Welcome, our guest today is Erin Welker. Erin is BI consultant at Artis Consulting in Dallas Texas, her passion is SQL Server BI. She has been working with SQL Server since 1.11, she serves as a data warehouse architect or lead DBA on several large projects for globally recognized companies. She has also presented on SQL topics at several conferences and SQL Saturday’s and is the author of various white papers, articles and book chapters on SQL Server and data warehousing. So welcome Erin!
Erin Welker: Thank you so much glad to be here.
Greg Low: Indeed, I haven’t managed to for probably a couple of years now at one of the PASS events or something. You are working in and around Dallas though as a consultant?
Erin Welker: That’s correct, I try to stay out of the travel too much. That explains why we haven’t been running into each other too much.
Greg Low: Indeed.
Erin Welker: I do try to make it to the PASS summit every year though.
Greg Low: Yes I try to, in fact again last year I had a thing come up that stop me doing that but hopefully this year anyway. What I get everyone to do when they first come onto the show is tell us how did you ever get involved with this?
Erin Welker: With this?
Greg Low: With SQL Server and or BI.
Erin Welker: SQL Server that’s actually so long ago that it is almost hard to remember. I am old enough to have come from a mainframe environment and we worked with a product called model tools 4 and at the time I was a programmer and I was just fascinated by how the database worked and I was constantly in the DBA’s office saying how does this work? They tell me a about caching and locks and this and that and I thought wow I want to do that. And then the opportunity came up that’s just when PCs were really just starting and the opportunity came up for a pc DBA and the platform we will we using was SQL Server version 1.11. That is what we were using at the time and I was given that opportunity and I took it and I never looked back.
Greg Low: Wonderful.
Erin Welker: I love it!
Greg Low: I do have a copy of 1.1 running on my machine and it is totally fascinating now to sit and look at that and compare where the product is now. It must feel like quite a journey.
Erin Welker: Yes there is actually a few things that still are the still the same. If you go into sass and press control R you will still get their results screen toggling back and forth.
Greg Low: Control E executes code.
Erin Welker: Exactly, right.
Greg Low: Indeed, it is kind of interesting to work with but look you all passion is mostly in the BI area how did you move into that?
Erin Welker: While that is an interesting question. I guess it was the mid to late 90s, we had the name to dynamically all the users had the name to dynamically query claims data. At the time I think it was something in my head at about 200 MB database which at the time was huge and they needed to be able to query that and minute so to speak. Not mining as we think of it now, ferret through all that data and various things. We as DBAs gave them and ODS and they were able to sort of like super users. They were able to go into the database and query real time.
Greg Low: Actually I should play acronym police and so ODS defined?
Erin Welker: Operational data store which essentially is another copy of your database, only you can query it without impacting the production OLTP environment. It is a reporting platform and even though it wasn’t Star schema or anything I kind of consider that my first step towards BI because we had to deal with all the things we deal with today so load Windows, how do you load a large amount of data and it is all relative but at the time it was a large amount of data and have it available to the users and dropping indexes and all that stuff I went through back then.
Greg Low: Actually an interesting question, at the time you consider that a large amount of data what do you now consider a large amount of data?
Erin Welker: Well I like to work with a large amount of data I can’t say that I have the opportunity to do so recently. I mean certainly to me a large amount of data would be multiple terabytes. I hear you people working with more than that but to me that is still pretty large.
Greg Low: I have clients that have to tables up in the 6 to 8 TB, I must admit it makes you very careful about everything you do with it. Because everything takes a long time and everything takes a long time to undo if you do a wrong.
Erin Welker: Right good point.
Greg Low: So it’s good, with one of the passions that you noted you have at the moment is BI done right, so what I would love to run through is what characterizes it done right as opposed to done wrong?
Erin Welker: I hate to be black and white, this is wrong and this is right but when I got into true BI as we know it one of the very first things I did was pick up the infamous Kimball data warehouse toolkit book which is at least the very first part of it really most of it is not a highly technical book. First of all it is vendor agnostic and there is a lot of instances on the front end of business analysis and not gathering requirements in the traditional way that we used to do on an application, an OLTP application. That is just really that makes so much sense to me and when I first started out in BI I was often the only person at my company who did BI. I got to do things the way I wanted it to do them, so I implemented that right away. Now I am not a business analysts I am much more a technical person but I did my best to ask the questions regarding how the business works, not what did you want to report on today but how does your business work? What would you see yourself doing in five years versus kind of global futuristic question that gave me insight into not only what would be valuable information today but what could be important to them going forward and I have since seen cases where doing that resulted in a much better outcome than not doing it. We went to clients, I went to one client and it is a company everybody would recognise and we went in to tune their ODS the more I looked at it the more I thought gosh this should be superfied. I am not sure if that is a word but you know what I mean.
Greg Low: Indeed.
Erin Welker: Slice and dice and I said while that would solve their problem, performance problems as well and I said well why don’t we do that. They said well we did that and you know it was great for a month or two and it just didn’t grow with the business questions. I thought that is somebody that did not do their front end work to ask the right questions. So it is something that I have come to feel very adamant about and as I went to other companies that had an existing business intelligence department I could see they weren’t doing that. I was kind of surprised and got on my sit box about doing BI right.
Greg Low: That is good actually even the book Kimball one, it is worth noting that there are two versions of that. I noticed the data warehouse toolkit was the original and then there is the Microsoft data warehouse toolkit and I actually prefer the original one because he spends a whole lot more time on dimensional modeling and things like that in a very vendor agnostic way. Where the Microsoft one I thought it would be called because he might focus on how that is done with the Microsoft tooling but sadly I find the vast majority of people key spends most of his time talking about what is different about using the Microsoft tooling to the Kimball method. The problem I have had when I point people at that is they find if they haven’t read the Kimball method it is kind of really will weird reading a book that spends a whole lot of time comparing it itself to something else. Yes the original book I think had a really good level of interesting information around dimension or modeling.
Erin Welker: I couldn’t agree with you more, the Microsoft toolkit came out post 2005 I believe. So it wasn’t even available when I was reading them. There are some other modelling books that he has put out I think there is true or three of them and the toolkit I think is gone through a second edition and I think I saw it put out a third edition this summer. I don’t know what the differences are I didn’t get edition 2 but I do plan to get edition 3. No I am not getting any money from selling Ralph Kimball’s books.
Greg Low: From Kimball no, no.
Erin Welker: To me it is the very best place to start if you are going into BI.
Greg Low: I don’t think he’s broke so I think it would be okay. Indeed. So listen with modeling and when starting to build dimension or models what is the most common mistakes you see with that?
Erin Welker: Well if I go to a client I will see a fair number of mistakes I guess probably the most common mistake I see is throwing things into the fact table that probably don’t belong there. I could go down whole new road with indexing strategies that is another passion as far as performance tuning how I would see no indexes all I have actually will gone into one client where they indexed every single column in these dimension tables. As far as a logical modeling, I think most people by the time they get to that point they get a pretty good understanding of dimension all modeling. I think somewhere between the logical implementation and the physical implementation that things kind of breakdown on us so maybe performance problems are encountered or maybe we should throw these things in here it may perform better or something. Is this something you have experience?
Greg Low: Yes, very much so, apart from that what are some of the things you have seen? For example religious is in terms of naming conventions?
Erin Welker: I try to be and I will admit to you that sometimes that breaks down as project demands start coming along but I do try. To me I am not religious regarding what the naming conventions are, I try to be adamant as far as once you set the standard follow it, be consistent. We have been through those discussions before at various places, only needs to be this or it needs to be that adamant especially when you go into various environments. I try to be conscious of the fact that people in that environment might be more comfortable with their type of naming conventions but to me it is all about being consistent. I will admit I hate underscores.
Greg Low: It is great to hear you say that. I think one of the marks of someone who have a maturity in consulting they usually say that when you come into the door you don’t want to simply replace everything the first time you walk into the door. I think it is a case even if somebody has a scheme in place unless there is a compelling reason, I mean in general you need to try, whatever you build to fit in with the scheme that is already there because otherwise the outcome of having three or five people working on things is such a mess.
Erin Welker: Yes, well I think a lot of that comes from the fact that I kind of had to careers. I was at a single organisation before I became a consultant and I was therefore 16 years so I kind of got a feel for and it our baby there is a reason for it, you know various things happened. I know what those things are so that most of the people that did the based design and the development were doing the very best they called with what they knew and the tools they had at hand. I try to think about that as I go into organisations that the people there aren’t stupid, they aren’t lazy for the most part, they have done the best of their ability so it is not that they don’t care. I have heard that from various consultants as they go in, oh they don’t know what they are doing. They are doing the very best they can.
Greg Low: Indeed and look I even see it in the SQL Server product you only need to look at tinyint, smallint, bigint and then you look at smalldatetime, datetime, datetime2 it wouldn’t of taken Einstein to come up with bigdatetime instead whether or not you like that that was the scheme was in place. I think it is really important to try and fit in with whatever is already there. One of the things I would really like to know the in terms of how you lay out your dimensions and so on one of the things I find when I go into sites is code smells shall we say tends to be a large snowflakes dimensions. I would love to know your thoughts on that?
Erin Welker: While it is funny you should say that and I think this came from my DBA third normal form background when I first went into BI I was big into snowflakes schema and I read Kimball and he said goes star and here are the reasons and I was like SQL Server can handle the job joins and as long as you index them well. What I would do for users standpoint is I would create a view and make it appear like a star schema to simplify it for the users. Now as time went on I gravitated towards the star schema, I can’t remember the last time I did a snowflake design so I think again it goes back to when I was a DBA it just took me a long time to get rid of those tendencies to not be redundant and save space and moved towards well this is a different world and this works better.
Greg Low: Yes I think that is the key thing I can always tell when I go into organisations if someone was a very relational background and has designed the dimensional model and they haven’t been doing it for very long that is usually one of the tell tale signs is I think a large number of snowflake dimensions.
Erin Welker: That is a good point.
Greg Low: I presume you tend to flatten them all out so?
Erin Welker: Oh yeah.
Greg Low: So if I have a product in a subcategory and the subcategory is in a category you would simply flatten those out into a product dimension and then assemble a hierarchy within the product?
Erin Welker: Correct, yes.
Greg Low: Lovely, how do you make your decisions about what is going to end up in fact table?
Erin Welker: Well that is one of those things that I don’t know if it will would be easy for me to express. Once you have done something for a while, don’t get me wrong I could learn things but you know once you have done it for while you just kind of do it. You listen to the users and even as I am listening to them I am thinking fact table, dimensions, measure, yes that would be a separate fact table and I can’t say that there is any explicit process that goes in my mind. I hate to presume that it is second nature but it kind of is an that is not to say that sometimes when I have implemented something I don’t sometimes look back and go hmm maybe I shouldn’t have done that a different way that still happens but I can’t say come up with a flow chart and oh if this I put this here. In general I stay true to traditional fact tables if I have foreign keys to surrogate keys in the dimension tables that are meaningless as far as this goes there is always going to be identity columns for the dates. I put measures in there and every now and then I put reference dimensions, say like order number or something that makes it easier. For me as often as I hate to say it, it helps me diagnose, it in helps me during the ETL. I don’t do that all the time, sorry I meant to say a degenerate. Is that right?
Greg Low: Yes indeed fact based dimension, yes indeed. I think the.
Erin Welker: I try to stay pure as far as what goes into the fact table.
Greg Low: In amongst that discussion you mention that you tend to have foreign keys in place. It is always a topical discussion one of the things I find so many places that I go into who don’t have any but when I discuss with them why they always say it is a performance thing we can’t but always asked them if they have tested it and they usually haven’t.
Erin Welker: Well I will tell you, when I say foreign keys I don’t always mean I have implemented a foreign key, an actual foreign key constraint. I did actually test that back in the day unfortunately sometimes I test something and years go by and I haven’t retested it and to say okay five versions of SQL server have come out now is time to retest this and see if this is still the case. If it is a very large datawarehouse I usually will not implement foreign key constraints in the database that is assuming that we have a very strict formal ETL process in place that is doing that checking beforehand.
Greg Low: Yes that is the thing that concerns me, as I said I find often people tend to not put them in at all and in most places I go into there might be one or two tables where if you actually tested to see where it would actually make a difference or not. Even then in most cases it is not all that substantial but I mean there are some large fact tables where that is going to happen but I do have a preference to personally putting them in and up to the point I like to see them there until I can’t see them there. You know that is what I’m thinking. It is a bit like one of the things I find all the time is people say oh look coming from a system that does the right thing, so it is checked in that system so it must be okay over here to. Don’t you love it, I love the way you said formal thing around the it ETL but the thing they never think of is you might have bugs in the it ETL process or things like that. That is where it is not necessarily a valid argument.
Erin Welker: Yes well I am making a mental note to myself that I do want to test it again and you have seat something that I remember way back many many many years ago. Someone on the development side I suspect when I was a DBA was reviewing a design with a group of us DBAs. We asked him why he designed it like this? I believe he duplicated a column to another table. He said oh that is so we won’t have performance problems and my thought was why are you thinking ahead of time you are going to have a performance problem and possibly introducing a problem when you haven’t and you don’t know you have a performance problem.
Greg Low: That is right.
Erin Welker: So I agree with his best scenario is to put it in there and if it does become a huge problem you can always drop it afterwards.
Greg Low: Yes another thing nowadays you can disable them constraints and things it is nice to have them in the meta data and it would be discoverable if you really wanted to but yet that is one I find. The other thing I find as a corroboree to that every time they tell me they don’t need that and then I go unchecked data it is usually wrong.
Erin Welker: Oh yeah.
Greg Low: That is the other aspect of that, what is interesting is that I find the same in transactional systems I find people say I don’t need to do that, you know the application does that. You say yes okay and you we go unchecked the data and you find out that it is wrong and you find the particular rows and they go oh yes there was that bug that we had the other week. There is always a story but invariably it is wrong.
Erin Welker: It kind of reminds me when you go into a client when I first implemented a data warehouse and they say oh our data is clean. I have never ever found that to be the case, in fact they are amazed and horrified what our data warehouse will reveal about their operational data.
Greg Low: Because they can see it, that if finally actually get to see the data and literally hits in the face. They realized that it isn’t a pretty sight. I find that it is quite fascinating; it is a very very common thing I see. As I said I like to see scenarios where I put things in place if I get to a point where I cannot have it fair enough that is an argument there. You mentioned also along the way you tend to use surrogates keys as well? I do exactly the same but I would love to hear you just describe why you do so?
Erin Welker: Well that goes back to the. One thing I like about when I was first reading the Kimball data warehouse toolkit was that again this was me coming from a relational background because it was all a foreign world. What he’s good at doing, is explaining why you do this and so again this made all perfect sense to me. You want to remove, abstract to speak, the key of the dimensions from what the business knows and there is various reasons for that certainly if you have got changing dimensions Type II specifically.
Greg Low: So being able to version a row you obviously have to have a different key to the source systems.
Erin Welker: I am designing this dimension and they don’t want Type II we will just replace any changes but if they change their mind now your entire design has changed, you need to update all the keys in the fact table. He also talked about say if you have an employee dimension, an employees in various departments in your organization key differently for example one department made key a security number and another department made key by what they call an employee ID and it is a way of saying ok you both can reside within the same dimension number and we have assigned this totally meaningless key that you can look at it either way depending on what department you are from.
If you have you have a corporate merger where now you have got to bring new information in that there keys are different then the surrogates key’s extract from that. From a DBA standpoint that is usually much smaller so your indexes are much smaller and that is always good as far as performance and memory utilization. Of course once you put all those into a fact table which has millions and billions and trillions of rows and that is going to make those rows narrower.
Greg Low: Yes I always think the basic rationale for the idea that even though you might have a source system key you might have multiple source systems is a big issue. The versioning is a compelling it issue for doing that but the other one it is a performance one. And complexity one where you are taking small systems that have composite keys all multiple columns making up a key I don’t want that through the design of the data warehouse and so I suppose consistency of the design of the data warehouse is another key aspect of that.
Erin Welker: Yes.
Greg Low: That is good now one of the other things I tend to now be a bit of a fan of having for example it in fact table or stored somewhere similarly is lineage and information? Again you often look at somebody who will look at a result in some sort of analysis and say hang on I don’t believe that number. And I think that one of the most important things is being able to identify where did that value actually come from? How did it come to be in the data warehouse? What do you tend to do in that sort of thing?
Erin Welker: Well I can you give me an example?
Greg Low: One of the things I would do is if I have some ETL process that runs I might have in some sort of chunk or every time it executes I might give that some sort of number and I will record it in separate tables something about that and I would just have the lineage ID and ID itself in the fact table.
Erin Welker: Okay yes I have a drawing of that. Usually I put some kind of batch to cater for it, one reason I went from being independent back to working with a consulting company is because I felt that I was doing all of it by myself and I wanted to learn how other people were doing it. As I go to various other places I am adopting their frameworks but in the past I was like as and ETL process runs each run generates a batch number identifier that is a find to any row that it inserts or updates. Type II slowly changing dimensions attributes results in a new row then though batch ID is assigned to that new row so I know that so I can go back to the logs from that ETL run and save this batch ID ran on this day and generated these rows and that is when the change occurred.
Greg Low: What are your thoughts around granularity of data so how do you decide how fine of data you are going to store?
Erin Welker: I tend to want to go, I tend to air on the side of going lower even though I know that will result in more data but if I hear it of any hint during the interview with customers that they want to go a little lower initially thought that is what I will do I will go lower. We can always aggregate higher but you can’t take something aggregated at a higher level and then real time try to dissect what is the detail behind that. I was actually recently on a project where that is exactly what happened, they went to a higher level and the customer came back later and asked where is all our detail? Well you didn’t say you wanted that and it took a while to redesign that to go to the next level so that is not a mistake that is easy to fix.
Greg Low: No I must admit I think very much the same. I think if in doubt stall the lot, I tend not to aggregate all assume that but again there will be some scenarios that is a problem. I have seen things like electricity authorities where they used to get a reading every 3 months from electricity meters and put it into a table and as they changed over to a smart meters they are getting a reading every 15 minutes instead. They have got exponential table growth there, there may come to a point where it is simply impractical for them to do it. Even there I think with tiered storage and other ways of trying to deal with it, I would really try and store that if I could.
Erin Welker: I think we have both been with is SQL Server for a long time and we know that the old reputation was that it was not a platform that could scale well but I think that perception in many cases even with people who are implementing solutions are still very much true. We do have so many tools available to help us deal with volumes of data.
Greg Low: Yes it does do a remarkable job now.
Erin Welker: So it is kind of going back to what we were talking about earlier. Making the assumption that there is going to be a performance problem, it is easier to go lower and if that does become a performance problem there is ways of mitigating that. Maybe you partition, maybe you have a summary view for longer periods of time. The more recent you have the detail, other designs decisions and technical decisions that you usually make to mitigate that.
Greg Low: Yes one of the things I’d like to get your ideas on is often discussions on whether facts can be ever update it or not? Do you tend to come across things with updatable facts or once they are facts they tend to stay?
Erin Welker: Well one would certainly like, I think that once you have a fact it stays and if you have a change may be make an adjustment.
Greg Low: So yes compensating.
Erin Welker: But I was actually at a client not very long ago that there was just no way to do it. They were getting, basically they were a third-party that was getting data from clients and the data was just changing. I scratched my head and all I could do was update the facts and maybe there was a way of getting around that but I certainly couldn’t find a way. If possible I prefer not to do that for various reasons.
Greg Low: Yes most of the time I would prefer some sort of compensating transaction that reverses what was there and puts the new one in, but yes sometimes you just can’t do that. Again I see people have hard and fast rules on this and I don’t think you can. I again I have seen people who have invoices and maybe they are never able to be updated but if they do some sort of pro forma invoice and it is in and nebulous state itself it is really really hard to be adamant to say that can’t change because absolutely it can change if that is the thing you are actually reporting on.
Erin Welker: Yes right that is what keeps us in business and keeps life exciting that there is no absolute rules every now we get a big adjustment based on this is how their business works.
Greg Low: Yes you tend to do a lot of calculations during a fact load, I suppose the question is if I have a quantity unit price, a tax rate, I discounted percent I need an extended price, do you tend to calculate those and store those in the table?
Erin Welker: I guess the answer is it depends.
Greg Low: No know that is fine.
Erin Welker: The one thing I will say that I am adamant about is preserving any information that comes from the source. It is quite possible that I might do some calculations when actually storing it in the data warehouse. I might wait into it goes into the cube, a lot of that would depend on how the data was going to be consumed. Is it going to be consumed strictly from the cube and we know it will never go back to the data warehouse and as I am talking I am err on the side of the data warehouse. Sometimes you just need to go back there; sometimes you can’t consume everything out of the cube. I prefer to err on the cube being absolutely consistent.
Greg Low: Look it is an interesting point, when you are talking about if I have calculations that I can do either in the cube or in the dimensional model underneath. I would tend myself to always do it in the thing underneath. Part of the reason I think that way is invariably if I present something from an analytical model one and they drill down into the underlying data and then it is really confusing for them if they can’t find it.
Erin Welker: It is a rare case, in my experience when somebody doesn’t ever go back to the underlying model.
Greg Low: Yes.
Erin Welker: They could strictly consume out of the cubes, so I always want to in that case have the same information in both sources.
Greg Low: Yes is the same argument I tend to try and avoid renaming things terribly much inside may be a data source view in a cube. I would rather try and do it in the underlying model again if I can because again I don’t want to confuse someone who drills down to the underlying data.
Erin Welker: Yes good point.
Greg Low: Indeed do you get a lot ones where you need to blow out the grains, I mean we talked about potentially summarizing the data but pushing the data. I mean the most common is people would tend to have budgets. Yes things where you need to I mean given the granularity you need to store perhaps a lower granularity?
Erin Welker: Do you mean store a lower granularity in the data warehouse then in the cubes all?
Greg Low: Yes where somebody gives you a value at one level but you actually end up storing something lower than what was provided.
Erin Welker: Well I can’t say that I get that. I don’t know if I have ever gotten that where it searched back it certainly doesn’t ring the bell.
Greg Low: Yes okay, now that is good. Now another I suppose is a part of that, in terms of the processing that you have for the ETL work that brings in I presume you are mostly working with integration services?
Erin Welker: Yes.
Greg Low: Excellent. How do you make a decision in your case if you are getting performance related questions bit between doing things as T-SQL statements as opposed to doing them as steps in transformations within integration services packages itself?
Erin Welker: Well I hate to say it, but again I am going to lean back on well I kind of like to think for the most I know it is going to perform better in SQL Server. SQL Server you can’t really beat it as far as doing set based, database operations so things like I am trying to think of examples.
Greg Low: Oh for me lookups are the most common things, you mean again you can run things through transformations at and do lots of lookups but I would rather have date in a staging table and then simply running one pass of an update in a whole series of left outer joins and find all the surrogate get keys that I can immediately.
Erin Welker: One thing I tend to go back to a lot from performance tuning days is profiler is by far the most underrated tool. In saying that I know a lot of people uses it a lot, in cases like that if you implement a lookup in SSIS yes and run a trace on it and you can see right away it is doing a look up to the database for every single one of those verses doing it in a database join.
Of course you can see it in an elapsed time it is certainly a lot easier but I am one of those people that like to peer underneath the covers and see what is going on. So I like profiler to do that.
Greg Low: Yes absolutely it does, so listen apart from lookups and things we mentioned before when loading up dimensions, do you tend in different types of slowly changing dimensions. How do you make your own decisions about what should be a changing attributes to historical attribute all type I type II or type 0 or a fixed attribute. I do know if you have come across those things. How do you make a judgment as to what things should cause versioning of rows?
Erin Welker: I don’t and when I say that. That to me is absolutely a business decision.
Greg Low: Yes I’m so glad to hear you say that.
Erin Welker: A caveat is you have to be careful with how you frame that to the user because if you explain it to them their initial reaction is they want everything to be type II, version everything. That is always kind of something that could be challenging but again it is like the grain question it is better to err on the side of having too much historical information than not having enough. When in doubt I will make it a Type II, meaning I know it changed on a date and used to be this and now it is that.
Greg Low: Yes one of the things that people while always horribly critical of is of course that the Slowly Changing Dimension Wizard in Integration Services. I was wondering how you implement your slowly changing dimensions.
Erin Welker: I usually as you mentioned earlier a staging table I usually get into SQL Server and take care of a few things there. I have used the slowly changing dimension Wizard in the past, usually only when the dimension is very very small. Sometimes I have used it as a starting point, generate these facts for me and I will deal with it from there. My preferred approach of late is to go ahead and get SQL server and take care of it there.
Greg Low: Yes that is interesting to hear you say that. Other ones that I do see people using up at Codeplex there Kimball SCD or the merge SCD component they call it where basically you was a community effort and it is certainly dramatically better than the one that is in the product and I note the guys in pragmatic works was looking after that and have their commercial version as well. It is certainly heads and shoulders above the ones in the box but the vast majority of that work I have a strong preference of doing it in T-SQL wherever possible. What about inferring members as well? Do you tend to deal a lot with inferred members?
Erin Welker: No not really. I am trying to think of the last time I have done that and I really don’t.
Greg Low: It is one I tend to use quite a bit actually I find that I try and design the ETL process that no matter what order you loaded anything in you basically end up with the same outcome. Typically I try and load dimensions then the facts and so on but the world is never quite perfect.
Erin Welker: Yes that’s true.
Greg Low: The system that the data comes from are not always hundred percent good either. You get the whole issue about if I do have a sale arise and I look up the product and there is no such product the question is what do I do next?
Erin Welker: Wow, well I do, okay I do that on occasions. I usually throw it as an exception or put it into an unknown member bucket.
Greg Low: Not necessarily divert the row, I suppose I see different people take different approaches with this. As I said if I get a sale in and I have no such product when I look it up I see some people who will push all the rows to one side to deal with. I think that is bad because the problem is if a look up in one dimension fail it means analysis by any dimension doesn’t have the data. I think that is nasty and you have to go back and deal with somehow.
The second approach I see people take is where they will remap everything to an unknown or something like that and the concern I have with that one is even though the analysis is fine is invariably later the real thing turns up the product that should have been there in the first place and then you have got to go how do I know which one is mapped off to the unknown one to map them back and so on. It is hard to regain that sort of initial state. I think it is kind of nice if I have a sale that comes in for a product and I have no such product I would tend to inferred that there must of been a product and then basically fill in what I can about it and set a flag and say hey this thing was inferred in the first place. At least when the real thing turns up I can at least just overwrite it and turn it into the correct ones. I do have a strong preference to try and make it so I can run things in whatever order I can.
Erin Welker: Yes I have got a couple of responses to that. There are certain things over the years, I have done things in various ways and learned lessons as I hope we all have. Absolutely great with cubes on the exceptions because unfortunately I think the users have got to come back and we have got to work through these exception piles and get them back into the fact tables and invariably that never happens.
Greg Low: Yes.
Erin Welker: That is a well-meaning way of dealing with it.
Greg Low: Yes but it is not going to happen.
Erin Welker: But typically it has not worked well in my experience and I have tried it. The question is one of my prior clients the data were so bad for example let’s take the state column which should be an absolute set of values. Again sometimes the data is not coming from your client but from some external source, so do you create a member for each one of them? It guess it depends.
Greg Low: Of course that is the downside, if I have data that is really poor quality coming in, I don’t want end up creating members for all of these values that are actually valid values so that is the second thing I think you have to have some sort of monitoring that you know that this is happening. Otherwise you could silently hide the fact that the data is nonsense.
Erin Welker: Yes.
Greg Low: Yes until somebody starts looking at the analysis. It is a real challenge but I think it does concern me where in a perfectly strict world where everything happens in the right order I would Ashlee rather see it rather just go bang in the ETL and fix it and continue on right. I would actually prefer that. I think sometimes you try and have something a little more flexible in that and if you do you have to have some way of monitoring it.
Erin Welker: Yes.
Greg Low: Yes it does trouble me that you could go off and create a whole lot of things that is going to be another mess to deal with then even though you are trying to do the right thing. I think there are challenges there, it is a bit like I had a friend I remember spent the entire weekend loading up data using the old DTS many many moons back and he thought he had these dates in Australian format, but he had them in US date format and he didn’t get a single error right on the entire load. What DTS and VB and a lot of those languages used to do is if you converted something to a date it would take it just as if it was find but if the month and day was around the wrong way it would just silently swapped them for you. But of course in his case it was actually the other way around and what actually happened is he ended up with two thirds of his dates with the month and day around the wrong way and because this was silently fixing it for him on the way and of course you don’t notice it straight away. It was to do three days into production when people started to say hang on that is not when that actually happened. But of course then you are trying to work out which dates are real and which dates are not after people have worked on it for a few days. That is like wow! I compare that to if he was given a choice between that and the fourth row of the initial load and it just went bang let me tell you he would take it going bang every time.
Erin Welker: Yes I agree.
Greg Low: Nnother one in terms of the lookup, I just wonder if you ever have used any of the fuzzy components in there?
Erin Welker: I have not, I have played with them when they initially came out just to get a feel. Maybe they need to rename them? Nobody likes the idea sometimes of a fuzzy lookup, in my experience to spend, we wanted it to be an exact match.
Greg Low: So I suppose one of those challenges yes that’s right. Where I was sort of heading I suppose is trying to help with things like de-duplication and I think that is a really really difficult problem. So in the organizations you are working with is de-duplication a challenge?
Erin Welker: It has been, I am trying to, I know it has I am trying to think back to the last time I dealt with that. Even the last place I spent a very long time at they had some keys or some primary keys on their source systems to prevent that from happening. I think there was a place that I encountered that.
Greg Low: Yes I was thinking more higher level of the data, the sort of challenge for example a good challenge was that so many organizations are month would said me for five copies of the same newsletter or the same whatever they simply can’t work out that I am one person.
Erin Welker: Oh what you were saying.
Greg Low: Yes I see the approaches for some of those transformations are useful slightly as one said in trying to help resolve some of those things. Maybe I have data that doesn’t match up and I am trying to find things that could possibly be related to it but I suppose my message is they are not an end in themselves there might be one tiny little component in how you have to deal with a bigger problem.
Erin Welker: Well all of it again this kind of goes back to when we encounter something back earlier in your career that has been problematic. I find that I tend to go back to that. My first career was at a hospital system and that was absolutely a problem where you have patients in there 10, 20 times and I actually worked on a patient consolidation and we worked and worked to find their combinations of things phonetic searches and ultimately it was so horrific the thought of potentially combining to patients that were in fact not the same that we erred way on the other side. We ultimately I believe we presented an application to the business users, we think these two users are in fact the same person you decide. You consolidate them!
Greg Low: I think the challenge in a lot of organizations, there are humans there that could do quite a good job of that. Usually I find they can’t explain what they just did. They might look at to set the patient details and they will immediately apply an enormous number of little algorithms in their head to decide whether it is the same but they could never tell you what they just did
Erin Welker: Yes I love the way you put that. What I would like to think the reason why we will as humans never be replaced completely. But we may be just kidding ourselves!
Greg Low: Actually one approach that I did see that had me totally fascinated, was I got to go to Tech Fest a couple years back, it is an internal Microsoft event where they get the people from Microsoft research and they put them in the conference center and in little booths. And they get people from the product group and they mix them all together and it is an interesting thing. The people who are in the booths are not used to it. A lot of these people sorry to be harsh but like researchers who almost live in a cupboard and slide the pizza under door top people they are not used to being in a conference booth sort of thing. You could just see that a lot of them felt uncomfortable being there but it was fascinating to watch because the product people would come in and say why don’t you try this and that and conversely they would be saying oh heck I could use that that looks really interesting.
One that had me totally fascinated was there was a guy there using spreadsheets to build integration services packages and for de-duplication and what he was doing was he had two sets of details about a customer’s site side-by-side and he would get a human to go through yes no yes no and so on down the list and when the human would finish he would actually read that in and spit out a package that would have come up with the same answer.
Erin Welker: Wow!
Greg Low: I thought that was in insanely interesting approach is sort of like a direction for may be doing some of these things where I think this is an interesting idea is the idea that you may have some sort of human training and of course the absolute magic is how do you turn that into a package. The idea that if found something that you got wrong you could just add another couple rows tick and cross and then regenerate the package that is a fascinating sort of direction to head down on I thought.
Erin Welker: Yes, yes. My brain has kind of gone off into another direction potentially using data mining to say okay I have consolidated these two but I didn’t consolidate these two what is the difference?
Greg Low: Yes what is the difference? Yes I think the whole de-duplication area is an intractably difficult problem to a lot of people. I think if any of the listeners have any interest in trying to find them and interesting career there is if you can make inroads in that it is such a global problem. It is amazing!
Erin Welker: You heard from Greg Low. He has created a whole new profession.
Greg Low: Yes indeed that is getting up to about time Erin but where will people come across you? Or see you? Or what you have coming up?
Erin Welker: I don’t plan any travel any time soon and I do hope to go to the PASS summit in Charlotte in October.
Greg Low: Yes a change of venue this year instead of in Seattle where they are off to Charlotte.
Erin Welker: Yes I know PASS likes to mix it up every now and then. I mean there is obviously a very good reason for holding it in Seattle as often as they do but it helps some people geographically to go to another part of the country. I am not sure what it does to you? Is it better always it worse?
Greg Low: It is probably slightly worse but not a big deal. By the time.
Erin Welker: What is another couple of hours half?
Greg Low: We hit LA whether I go north of Seattle all across to Charlotte it is not really the end of the world either way. It is relatively short flight no matter where it is compared to where we are coming from. When it used to move around a lot one of the things I really did like was it was actually really interesting to see the different cities. I had a ball even in Denver for example when the summit was the I still very much we collect the big blue Bear standing at the front.
Erin Welker: I am glad you said that because that was my very first image.
Greg Low: It is wonderful seeing all these places yes I think so that was kind of good. I totally get their whole Seattle thing to because it is so valuable to have a large number of the Microsoft staff there and of course that is where that is going to happen. It is a complete trade-off, I think a mixture of the two is interesting and it is interesting to get across the country as well. But for me that is simply just an outsider coming in and getting to see more of the country.
Erin Welker: Yes I assume when you come you combine various trips to make it a little bit longer.
Greg Low: Yes sometimes that is the case but I must admit I do have times when I have gone just over for two days or something.
Erin Welker: Wow! My hat is off to you!
Greg Low: Yes I have had some really really long days and it is funny some of the length of the flights we do sometimes is we start in the wrong city and I remember having to go to Alicante in Spain one day and having to start in Canberra and it was just like horrible. I had to fly from like Canberra to Melbourne to Hong Kong to London to Madrid to Alicante. It was like wow is this day ever going to end and it was you know that does happen a bit. Yes it is a long way to go for just 2 day events or things like that. It is a mighty big trip. It is great as I said it is a wonderful thing to get together. I encourage people if they have been to one of the summits I think it is clearly the main SQL server event that happens in the year. It is a wonderful chance to meet just about anybody really.
Erin Welker: Yes we probably could both go off the deep end and talk about the SQL server community. There is no community like this, it is the only one I have been involved with and I talk with other people and it is just a special community to be part of. So that is where we can meet up with all our friends.
Greg Low: Another one that has struck me just this week to just as a final note on the SQL things too. I had to do some work with DB2 last week again, for all the criticisms that I hear people make of documentation and things and stuff like that in SQL server. We have it so good, we just have it so very very good compared to what I see in other products.
Erin Welker: Yes it has been so long again since I have worked with another product that I guessed you forget how good you have it until you use something else.
Greg Low: For example navigating around I had to download some OLEDB providers and it was an interesting challenge to try and find the OLEDB providers in the IBM site and try to find them in the Microsoft site. Let me tell you there is no comparison in terms with what is involved in both of those exercises. It is just amazing! So I take my hat off to all the guys in the documentation work, it is kind of an unsung hero for a lot of that and yes there are issues and things but overall I have not struck another product that has the level of quality of it documentation that is actually therefore SQL Server.
Erin Welker: The community contribution as well to me of course I having been around for so long, I remember the day the only thing I had to rely upon was not SQL books online because they didn’t have them but just the hardbound documentation which I am sure if I had to print that off now I can’t imagine how much space that would take. We had so many good contributors out there that are willing to share the things they are willing to learn with the rest of us. I can’t remember the last time I had a question that I couldn’t get easily just by Googling up there.
Greg Low: Magic, listen thanks very much for your time today Erin, that is wonderful!
Erin Welker: Absolutely I so thoroughly enjoyed it.
Greg Low: That’s good and I look forward to seeing you again soon.
Erin Welker: Okay.
Greg Low: Thanks.
Phone: 1300 SQL SQL (1300 775 775) l International +61 1300 775 775 l Fax: +61 3 8676-4913
Copyright 2017 by SQL Down Under | Terms Of Use | Privacy Statement