Gadi Peleg is the Senior Program Manager on the Data Quality Services team, where his most recent work has been on data matching, knowledge based data driven quality and user experience. Gadi is now involved with the next generation of Data Quality Services. He joined Microsoft in August 2008 with the acquisition of Zoomix Data Mastering Limited where he served as a VP of professional services.
Greg Low: Introducing Show 49 with guest Gadi Peleg.
Welcome, our guest today is Gadi Peleg. He is the Senior Program Manager on the Data Quality Services team, where his most recent work has been on data matching, knowledge based data driven quality and user experience. Gadi is now involved with the next generation of Data Quality Services. He joined Microsoft in August 2008 with the acquisition of Zoomix Data Mastering Limited where he served as a VP of professional services. So welcome Gadi!
Gadi Peleg: Thank you Greg, thank you for having me on the show.
Greg Low: Excellent, what I would get you to do, the first time anybody is on the show. What I will get you to do is explain how on earth you ever came to become involved in SQL Server?
Gadi Peleg: That was interesting because, as you mentioned I joined Microsoft in 2008 with the acquisition of Zoomix. This is basically the first time I was exposed anything related to data quality aspect, the quality aspect of data. Zoomix was a small startup based in Jerusalem and they focused mainly around data quality for product data. We didn’t have really means of cleansing data per say but we had very, very good tools based on machine learning for standardizing and restructuring product descriptions and also classify products, so we have a hybrid like a Master Data Management plus Data Quality tool and it was for Microsoft. For Microsoft it was very, very interesting because it was again machine learning based and a tool that basically minimized the labor for data stewards to normalize and standardize product descriptions. This is basically how I joined Microsoft.
Greg Low: What sort of size company was Zoomix?
Gadi Peleg: Zoomix was a very small company about 20-25 people. Many had customers in Europe, we had a large customer in the US. It was a very unique tool and based on very unique algorithms and again these algorithms enabled to reduce time and labor on working on databases, making databases better.
Greg Low: Great and given this is the first show we have done on Data Quality Services, maybe if start at the beginning. Why should we be interested in it?
Gadi Peleg, Well, I think when you are trying to define Data Quality, you have to connect Data Quality with business. When you look at Data Quality, Data Quality is basically how good the data is for business usage. So when you talk about business then you obviously talk about quality of business and we know everybody knows that Data Quality impacts business directly and just in the US itself companies lose more than $600 million dollars year just because of data quality issues. So it impacts on many aspects of the businesses like marketing campaigns, billing and many more aspects.
Greg Low: Just examples of the sort of quality issues that you are thinking of?
Gadi Peleg: Data Quality can be measured based on very accurate data dimensions like accuracy, consistency, completeness and duplicate. I can give you examples of each one, accuracy is how accurate semantically the data is. Consistency is about having the same values across all the data sets in the same area of presentation. Completeness is whether your data is missing values, for example if you have an address record and it is missing a zip code then you know obviously someone will have to complete this value. Duplicate is about duplicate that you have with your data, so if you have a list of customers and your customers are a duplicate then you obviously you start sending duplicate mails to the same customer. Basically these dimensions help users to understand the quality of the data and become proactive on cleansing this data to prepare it for business and for BI.
Greg Low: I suppose one of the questions that come up then is, what sort of measures do you see in place people could use to assess how accurate their data is or how high a quality their data is? I am just presuming one of the things you need to know is how bad things are and if you are trying to fix things to have some sort of measure on how successfully you have been in terms of fixing it.
Gadi Peleg: That’s true and that’s an excellent question because it connects me directly to Data Quality Services and to the solutions that we provide within SQL Server 2012. So how do you measure accuracy? Basically you have to measure. You measure accuracy against a reference set of data, so you know how your data deviates from this reference data. There are many solutions to data provide this reference list or list or lexicons or every company has its own terminology and Data Quality Services came up with a very unique solution which is called knowledge bases, where every user can create its own knowledge base or knowledge bases to represent different types or different kinds of data for example if you have HR data, product data or any other type of data you can basically capture the semantics of this data in the form of data domains in its knowledge base. Let’s assume you have an employee’s data set, you can create a Knowledge Base with different data domains in it and capture semantics about your data in the form of values. For example like in, create synonyms between cities like NYC and New York will be synonyms.
Greg Low: Actually I should probably get you to describe in terms of the product the meaning of domain, the way you are using that.
Gadi Peleg: Sure, domain is an area of data that captures all kind of rules and values to correct a specific type of data. If I connect it to a table or file it, it is a column basically. So I can take a column and basically map it to a domain that belong to the same domain of data so obviously I would cleanse a specific column with a country domain. I have to create a domain that matches the column type and the column type of data so this domain can basically keep values and keeps through the user as to this domain and think domain as an object, as an element within a knowledge base that basically contains all these rules and values and relations between values and think of it also as quality policy. So once you create a domain within a knowledge base it persists there and you just maintain it. If you select a New York is the leading value and NYC is the synonym. It will remain like that for all the datasets that it cleanses, so you always cleanse with New York so every time you see NYC just replace it with New York that’s how you keep consistency in the data.
Greg Low: Actually, one of the things I was going to ask was the distinction about you imagine as the one who maintains that in the organization because one of the things I find when I talk to people about Data Quality Services. If you look at something for example, like matching values or substitutions of values those sorts of things are people who say yes I can just build the transformation to do that. I am imaging that part of the power of these sorts of tools is the ability that I can build the infrastructure that applies those transformations but it could be somebody with the knowledge of the domain itself is the one who modifies those substitutions that will be made.
Gadi Peleg: That’s true Greg. So basically we can identify two personas working with Data Quality Services: one that is a domain expert, another one is a user with this knowledge which basically using the policy or the knowledge base to cleanse the data without even knowing or having the expertise in a specific data domain. The person that creates the knowledge is usually a domain expert, someone that understands skews, part numbers, even addresses, cities, countries in the world and by building this knowledge you basically provide the tool or the semantics for the user to cleanse the data.
Greg Low: Is the thinking that the person who is the knowledge or the user who has that knowledge? Is it the client tool for Data Quality Services the actual tool you imagine they will use for doing that?
Gadi Peleg: Well that’s another great question. I imagine Data Quality, the client tool as a tool for creating the knowledge and it really depends on the way the organization operates or whatever the organization is actually doing Data Quality. We can do Data Quality while you move data from the layer where you acquire data sources downstream to the data ware house. Or you can do a project type of a Data Quality where you basically take chunks out of the Master Data Management tool and you cleanse it and you put it back to MDM or you work with the data ware house. So it really depends where you want to do the Data Quality. Going forward, we want to provide external experience for the cleansing so this will allow information workers to easily use it to cleanse data. Imagine that you can in Excel select your cleansing agent which is the knowledge base or even an external provider of knowledge which I will explain soon. In almost in no time you just click a button and you cleanse your data. This is something we are already working on.
Greg Low: That is a really interesting idea, if I look at something like Master Data Services in SQL Server. The interface was one of the things that I absolutely struggled with in 2008 R2 was the user interface and when you are dealing with those sorts of lists of things. Excel seems like a very natural interface to do that sort of work.
Gadi Peleg: It’s very natural, it’s basically a tool that is the most popular tool for managing data.It already has capability like sorting, functions, anything that you can reuse and having the Data Quality Services integrated with MDS or Excel admin, it is almost the ultimate solution because you bring data from MDS and you match it with DQS and you publish it back to MDS. Going forward you will have capabilities in Excel so the integration is matched later there.
Greg Low: I thought in SQL Server 2012, the add-in for Excel for Master Data Services takes it from being awkward user face to now quite pleasant actually, very good. So that sort of integration with Data Quality Services in the future would be excellent. How do you see the general process like somebody what’s the process from start to finish where somebody would start to implement Data Quality Services?
Gadi Peleg: So it starts with the knowledge base in order to cleanse the data, even to match the data you have to have knowledge about the data. So algorithm by themselves the similarity algorithm can actually cleanse the data, you have to cleanse the data with something. Basically you start by creating knowledge by using your data sources, you may have different data sources so you create knowledge bases according to you data sources. You test your knowledge bases using a cleansing project through the DQS client application and once you feel that the knowledge base and once you know that the knowledge base is comprehensive enough and covers your data then you can put it in production. Even add automation to it by using a DQS cleansing component in SSIS which basically connects to that knowledge base and cleanses the data on the fly. I do see a development phase of knowledge then usage by either project or automation through SSIS.
Greg Low: Indeed, so in terms of that sort of knowledge base would you see the typical starting point is going to be data that is sitting in some existing table somewhere?
Gadi Peleg: Actually yes. I think the reason another layer may be before it, maybe once before it is the way to discover your data and see where your data is and analyse it before you do anything with it. Some kind of profiling and it can do it also with DQS when you start creating a knowledge base. We have profiler embedded in the DQS client showing you information about different processes. So when you do knowledge discovery, that is a process for example that acquires data from your sources into the data domains. We provide profiling information about your data like what is the level of completeness, uniqueness, how many values populated the domain, how many values are invalid based on the domain rules that you created. There are all kinds of profiling information that you can use to assess the quality of your data. You start by pinpointing an area that you want to start cleansing and you create a knowledge base for that and you work your way up towards the cleansing.
Greg Low: In terms of if we start with missing data, is that just measuring whether the data is NULL or not or whether there is some value there? Or is it more richer that?
Gadi Peleg: Well, it’s richer than but less complete solution so it starts by validating NULL. In a domain you can basically define NULL equivalency, so you know it’s very common that users put or IT put all kinds of NULL values like 999, CBDs, others so you can basically link them to a DQS NULL. That’s another view we have in every domain, this way we can count completeness much better. So this is providing validating.
Greg Low: That’s very good, the sort of thing I often do see is exactly that. Somebody will be loading up some data as part of an ETL process and maybe it requires some column that isn’t present in the source data but is required in the database. So maybe it is a postal code or zip code and so they will literally shove into the data 9999s, so there is data there but it is actually the same thing as if there wasn’t data there.
Gadi Peleg: That is exactly true, but we have good meaning now measuring this NULL equivalency so that is one aspect of NULL. Another one is enrichment, but today we have a one of enriching data by creating composite domain which is a basically a container that contains individual domains that belong to the same area. Let me explain this, this is a topic that is very hard to explain but let me try. So basically think of an address that is a composite of different domains. So an address contains city, street, zip code and so forth so think of each column as a domain by itself but if you look at an address it’s one object. You can’t really separate the street from the city, from zip code so it’s a composite type of domain. Looking at the process of creating a knowledge base, you create these individual domains for each column you have then you link them into a composite domain. Now you create, once you have a composite domain you create composite domain rules saying that each city for example is Seattle then state must be Washington, so if you data says Seattle, New York it will be automatically correct to the correct value. This is kind of cross domain correction and there are other means of the enriching the data this is currently the only way is using external data providers. That is another powerful proposition we have is the way to connect a composite domain or a domain in your knowledge base to a different company all together so if I want to clean my addresses I can send my data to Melissa data which is an expert in US addresses. I don’t have to even create the values in the domain.
Greg Low: Actually that is a very interesting one as well, as I have clients in Australia who do cleansing of addresses and so on. I must admit it sort of intrigues me that the way they work today they work in a very batch orientated mode. They tend to get big batches of data from their clients and they will do cleansing on that involving significant IP then they will send it back to them again. I am thinking all these people down the track they all need to become service providers online rather than people who deal with these sorts of things in a batch mode.
Gadi Peleg: True, so this a good example of why you shouldn’t use a domain expert for analysis because somebody already has a huge database and they are experts in it. So you don’t need you and I to create domains with all the cities in the world and all the addresses in the world or it just doesn’t make sense. You can update this information all the time so looking at the process the best thing to do is subscribe through the Windows Azure market to one of these services. For example there is a service called Locate this provides support to 140 countries, addresses, international addresses. So you can connect your composite domains to locate and cleanse your data in no time without investing a second knowledge in your domain so that is another channel of IP. I mean you have your own IP, your organization IP which is about products, people and areas like that and analysis like you can use external IP like Melissa data or other services that we have on data markets.
Greg Low: What is the current charging or monetization basis for those things? Like when you set up the client or make a connection to the service, how is the sort of subscription or whatever, what is the current thinking around that sort of thing?
Gadi Peleg: The current model is you subscribe, you open Windows Azure data market page and you see all the DQS, Data Quality Services category there and all the offerings that belong to this category. It is a subscription based model that you subscribe. You subscribe by transaction if it’s monthly or I don’t recall the exact model but you pay per subscription per usage. Once you subscribe through the data market, when you open the DQS client application all these services that you subscribe onto the data market will show all the DQS client. The user that uses these services is a very simple experience. It just uses the knowledge base that is defined with these services. It’s like a black box, it just connects the knowledge base, you map it to your Excel file or table and you just run it and then you go and have an interactive step where you manage your results and that is it basically. It’s very simple.
Greg Low: Actually one of the things if people have that sort of intellectual property how complicated is it to get to be set up as a server?
Gadi Peleg: To be set up as a service or someone that provides it?
Greg Low: To be set up as a server, yes one of these providers like Melissa or somebody. How complicated is that?
Gadi Peleg: So this process is done against with a data market team there is a whole business engagement there with SLAs. There is a test phase and so forth, there is a kind of on boarding experience to become a reference data provider and basically what it does. What the reference data provider does is it has a big repository of data which is up to date and they are responsible for the data and they update the data. They create a service around it to look up records that users send and send the best results back. So you have to develop some kind of service around it.
Greg Low: I presume that would be ideal if that was hosted in Windows Azure or something like that?
Gadi Peleg: It would be ideal if you consider performance and another aspects of it. Yes it is, we have some thinking around that and we will see what the future will bring.
Greg Low: Yes, ok. Listen that does raise the question of performance, some of the noises I see with early feedback. I see some people concerned when they are starting to put a lot of data through Data Quality Services around performance. What is your sort of feeling as to how big an impact these sorts of things have and I was thinking when you have something like an external service like a Melissa or somebody like that just what sort of throughputs are things are even achievable. I am sort of worried about how big an impact do you see that having?
Gadi Peleg: Well, we have for the services for the reference data services. They have to inherit to a performance number that we provide them. For example they have to send a response within 10 seconds this is the latest and it is much faster than that. Basically we have a strict number that they have to comply to. Regarding DQS server itself it is basically it will consume any resource you give it. The larger the box is or if you use many boxes it will consume basically everything. It is a linear process the more memory and CPU you give it, it will basically consume it.
Greg Low: Yes.
Gadi Peleg: We proof grade the performance in the last update CU1.
Greg Low: In Cumulative Update 1 there was a significant change in performance?
Gadi Peleg: A significant change in performance both for cleansing, discovery and matching.
Greg Low: Interesting.
Gadi Peleg: The best practice I can give you here on the air is if someone is using the SSIS component it is good practice to parallel a few packages using the DQS cleansing components so the more you parallel the better the performance is.
Greg Low: So in that case, you would definitely suggest applying Cumulative Update 1 in that case.
Gadi Peleg: Definitely.
Greg Low: Ok. Listen, one thing I did want to return a little bit to, was you started to mention composite domains and in my thinking when I think about composite domains I think about things like maybe I have got car manufactures and there is a valid list of those and there is also a valid list of car models but only certain ones can go together. Is that the sort of things that is able to be dealt with in some way?
Gadi Peleg: Yes, exactly so. So basically in the composite domain you create cross domain rules so for a car type of Mecedes you can have only a V6 engine. So if you have a V8 engine it would basically correct it to V6 so we would mark it as invalid depending on the composite domain rule. So it is a conditional type of rule based on the domains you have in the composite domain.
Greg Low: How complicated can the rules be? How can you express the rules or the pattern matching or something along those lines?
Gadi Peleg: Ok. That’s a good question. Let’s distinguish between matching rules and cleansing rules. Let’s talk about the cleansing rules. In each domain you can basically author domain rules which is only rules that validates the data. It doesn’t correct the data on the domain level. On the composite domain level you can correct data using the composite domain rules. For matching there is an experience we call matching policies, so when you edit a knowledge base or create a knowledge base you have three activities you can select from. One is Domain Management that’s where you create domains, where you create composite domains, you create the values. Another one is knowledge discovery, that is a guided process that basically acquires knowledge from your source data and puts it in the domain. In this process we also discover syntax errors for example spelling mistakes based on the statistical algorithm. It is really based on the sample of data that is used for discovery. The third activity is matching policy, that’s again another guided experience in the DQS client. It allows you to create matching rules and test them before you publish the knowledge database for users to consume. So basically in this experience, you define the domains you want to match on. So domains represent the columns, so you define the weight you want to contribute for each domain and you define the type of similarity you want to use for each domain. We made it very simple, so you can select either exact match or similarity match without allowing the user to select the NULL algorithm. This tool is for IW than a very technical person so we made it very very easy to use.
Greg Low: That is interesting, with the matching are there options like case sensitive matching as opposed to non-case insensitive? Or is mostly just a fuzziness sort of about the matching?
Gadi Peleg: Actually we don’t consider cases because we keep the data which is not visible to the user, we keep it in one case. So the algorthim is agnositic of cases and other things. It is also language agnostic. We used an algorithm that basically we can match any type of data.
Greg Low: I was just sort of thinking that I can imagine rules that I might want to have that say that are to do with say capitalization of words.
Gadi Peleg: I think formatting is something that in some cases it I agree it can be useful. Maybe in product data, generally thinking when you match data you match on the syntactic value and not on the formatting.
Greg Low: I was just thinking it is still and aspect of quality where I can imagine in some scenarios the casing of something could actually be the problem rather than the value itself.
Gadi Peleg: The casing could be the problem, since the target or goal of matching is not to check formatting but to check again on the syntactic level we really don’t care about it. You can control formatting also from the DQS client and export your data in any type of casing you want. This includes standardization of the data.
Greg Low: What about rules where I can say, I was just thinking rules that look at things that maybe I have columns that logically that contain sub columns that you know it needs to be three letters followed by five digits or something like that. Is there anyway of building those sorts of rules or would that be just done back in normal components in Integration Services?
Gadi Peleg: Yes, we don’t go to that level. You refer to matching right just be sure.
Greg Low: Yes I was almost thinking about almost applying regular expression type checks or something like that to the data as well to be able to say that maybe the things in this column should look like particular shape.
Gadi Peleg: Yes we don’t have it. We had some kind of capability in Zoomix using regular expression and data patterns because it is mostly usable for product data. That is where you see much more data patterns than in customer data. Customer data has very few data parts and is usually strings and zip code and a little bit. You can identify by regular expression, but otherwise the data patterns or regular expressions I think in my opinion is more for product data usage. The accuracy with the matching is so good that the algorithm that we have, we didn’t invent the wheel. We are using an algorithm that probably everybody is using. It is good enough to match your data, it gives high accuracy. We also treat the data, we normalize the data in memory. We move certain characters and we break it to small pieces and that is how we match it. It is forward algorithm and we provide very good accuracy.
Greg Low: Welcome back. So Gadi outside SQL Server and Data Quality Services, is there a life outside?
Gadi Peleg: Yes actually there is a big life outside. I have four kids that keep me in shape.
Greg Low: Any sort of sports or hobbies that you are involved with?
Gadi Peleg: There are two areas that I it’s like a childhood hobby: aviation and music. I play musical instruments like trumpet, harmonica and guitar and a few more.
Greg Low: That’s excellent.
Gadi Peleg: Aviation is something I grew up with. I love aero planes so this is a good place to watch aero planes. As you know Boeing is here and you can see all over the place.
Greg Low: I have thoroughly enjoyed trips to the Boeing factory. My favorite is I remember them saying that they had a policy is that if you break it you buy it. Which I though was pretty good. The other one I like is that they told me that with things like a 747 when there is easy payment plan. There is one third deposit and one third when they paint it in your colors and one third cash on takeoff. That was the deal. In around the US I was going to say, that is one of my favorite things in the US is the number of things that there are good aviation places. Have you managed to get round to see many of those?
Gadi Peleg: No actually, I travelled a lot in the US I have lived in many places in the US but I didn’t have a chance to see airfields or airshows. Not yet, no.
Greg Low: The other ones I was thinking of was things like some of my most favorite places in the world like things like Air and Space Museum in Washington. In fact my real favorite in Washington is the annex they have built out at Dallas Airport where they have the really big things that use to be in the city. I know last time I was there was everything from a Blackbird to the Concord and they also have got the World War II things like Maccies and Spitfires and Zeroes and all those sorts of things. Things that you would never see anywhere else.
Gadi Peleg: Actually, there is a nice museum right here in Seattle, Aviation Museum.
Greg Low: Of course, the Museum of flight.
Gadi Peleg: Museum of flight, yes you can see the Blackbird over there, all kinds of F15, Phantom all kinds of old aero planes and so forth.
Greg Low: Actually one I got taken to that I really enjoyed in Seattle was even though there was a formal museum there was also the museum of flight restoration center. I don’t know if you have been to that one but I really like that because it was the one where they build and fix all the things that end up in the museum of flight and you can really get into things and climb and look all over them and so on much more than in the restoration center.
Gadi Peleg: Oh I have never heard about it.
Greg Low: That’s worth of look. I must admit with the Space things the other thing that I really liked in the Air and space was all the space type things. They just had everything in Washington that you have ever seen. Particularly the old Apollo things that I grew up watching, all the Lunar landar, the little buggies. To somebody who grew up watching it on TV it was almost a dream world. To be able to see the actual things was truly breath taking. They also had the whole thing like when they did US - Soviet link up in space and they have got all the space craft sitting there and they had Skylabs because they build two of them and the first one worked and so they put it in there. I mean that whole museum, there is nothing like it.
Gadi Peleg: Yes it’s breathtaking.
Greg Low: I have been in a few other countries where there are some very interesting ones. I know the Deutsches museum in Germany is certainly some amazing aviation things there. My wife has some relatives in Houston and it is always interesting to go and visit the NASA facilities down there or Cape Canaveral. I tend to bore my wife with that, but I know my favorite thing down there was a thing called Rocket park. When they built the Apollo program because they build Apollo 18 but didn’t ever launch it. The entire rocket is sitting in a big shed, there sort of laying on its side and pulled apart in sections.
Gadi Peleg: Wow.
Greg Low: So it’s interesting the tour we were on stopped there for about 15 or 20 minutes but I was there for like 3 hours just walking around the rocket. You just never see something like that anywhere. It was extraordinary.
Gadi Peleg: It’s really once in a life time. Yes it truly amazing.
Greg Low: I think around the country there are just so many rich things to go and look at. If you have an interest in aviation, it’s a good place to be.
Gadi Peleg: We have a very very nice museum in Israel, that is where I grew up. If one day you have a chance I truly recommend it is in the south or Israel. It is very nice you can see aero planes with the history of the Israeli air force and it is really, really interesting.
Greg Low: Well that’s on my list of things to do. Listen one of the questions I had as well, is which editions of SQL Server does Data Quality Services ship with? Or is it only Enterprise edition or is there anything in the lower editions?
Gadi Peleg: It appears on the BI edition and Enterprise.
Greg Low: Are there any restrictions in the BI compared to the Enterprise or is it exactly the same in both editions.
Gadi Peleg: I don’t recall the differences, I believe that Enterprise has the full side and BI focuses on BI tools and I think it is also something to do with memory consumptions and then and CPU we can use in each edition and I don’t recall the settings.
Greg Low: So it might be higher performing in the Enterprise edition.
Gadi Peleg: Yes, Yes.
Greg Low: That’s good. Well listen when you go to integrate this we were saying that Integration Services and there is now DQS now called DQS client component. The component that is sitting inside there for connecting to Data Quality Services. With that component are there any sort of Best Practices or things or just maybe we should just talk about at first what comes out of that component and what goes into it?
Gadi Peleg: Ok, so we have two client components: one is the DQS client and basically enables to create a knowledge base and maintain a knowledge base and run cleansing projects and matching projects.
Greg Low: We should say to people that is just the Windows type application that is a client UI for working with the projects.
Gadi Peleg: It is installed through the SQL Server setup. The other client component is the DQS cleansing component in SSIS. Which is a very simple UI and basically allows you to connect to the DQS server instance and select a knowledge base and map your instance columns to the domains in the knowledge base and that is it. Once you run it, it is basically using the knowledge in the knowledge base to cleanse your data.
Greg Low: In terms of the output side of that component, what information do you get back out the other end?
Gadi Peleg: Ok, so the output of the cleansing component in SSIS and the DQS client is very similar, it is identical. The nice thing about the SSIS component is that you can channel the different outputs to different destinations. We mark data as invalid, corrected or suggested values that we didn’t automatically correct because it didn’t reach a certain threshold. So you can channel different outputs to different destinations and basically push the data. Let’s take the corrected data and push it directly through MDS system and take the suggested or new records that the knowledge base didn’t recognize to different destinations and have someone look at it. You know what, from the DQS client application that is the nice thing about it. Once you run a package in SSIS package you can open the project with a DQS client and look at the result.
Greg Low: So you see that as just an interactive process where the knowledge base just keeps getting better and better over a period of time.
Gadi Peleg: Yes, why it is because once you open an SSIS package which is a project from inside the DQS client. Whatever you approve in the cleansing project, you can look back to the knowledge base and expand the knowledge. So next time you run the same SSIS package, it is already using the enhanced knowledge base.
Greg Low: The way you split that out coming out of the component, is there a column that indicates whether it was new or something or maybe use a conditional split or something to?
Gadi Peleg: Yes, there is a column called a Record Status that indicates whether it was correct or invalid and so forth. Based on these columns you add a conditional split component transformation and you split the table to different destinations.
Greg Low: Yes, you also mention that when the product is installed, selected as one of the options as part of SQL Server installation. I notice you get the Data Quality client but my recollection is it doesn’t actually set up the Data Quality database yet or things. So there is a further stage you need to go through?
Gadi Peleg: Yes that’s true, so once you install both components, obviously you install the DQS server from the machine, the client or whatever machine you want to use. But once you install a DQS server and setup is done you have to run an executable file called DQS installer and this installer will install three databases that belong to the DQS server. This installer registers the assemblies that the server uses. You have to run this installer.
Greg Low: When you apply Cumulative Update 1, does that require anything run in edition, or is it taken as part of the Cumulative Update?
Gadi Peleg: No, that’s an excellent question. You have to run the installer again but with a flag that basically upgrades the database. The syntax is like DQS installer – upgrade.
Greg Low: Yes.
Gadi Peleg: You have to run it with the flag. You have all the recommendations on MSDN so it is approachable but yes.
Greg Low: Now one of the things I noticed also, that recently there was a whitepaper published as well on Data Quality Services Performance Best Practices.
Gadi Peleg: That’s true, we published two white papers: one was for the RTM version and one for the CU1 version. The CU1 whitepaper basically depicts the new performance numbers comparing to the RTM numbers. I would recommend you go through this whitepaper and get the new performance to understand it better.
Greg Low: I did notice that in that whitepaper, there were also some recommendations about how to structure projects to work with ongoing projects. I remember for example, try and use language and geography based specifications and so on.
Gadi Peleg: I don’t recall that, I remember there were some Best Practices, one of which I remember. I don’t know if it is in this whitepaper but for example when you create matching rules try to use the Prerequisite flag on the geographical domains like city and state because usually the level of uniqueness is very low there. It doesn’t make sense to use it for fuzzy matching if your data is clean. This flag basically boost matching performance really greatly, so it is called the Prerequisite flag and it is for the domain that you add to your domain rules. Sorry your matching rules.
Greg Low: Yes. You also mention fuzzy matching there, is that very much the same algorithm that was already available in Integration Services? Is it very similar to the fuzzy components that was in there or was it a different algorithm?
Gadi Peleg: It is very similar, the big different between the fuzzy grouping in SSIS and our matching is our matching. Our matching, we call it semantic matching because it is using the knowledge base. The more knowledge you have in your knowledge base, the faster the matching will run and more accurately it will run. Let’s take an example that I already use that is from the music world. Lady Gaga, when you compare Lady Gaga and Gaga the similarities between these two terms would be very, very, very small. So no similarity function in the world will provide higher similarities than I don’t know maybe 30%. Which is not enough for a match. Ok, but once you create a synonym between Lady Gaga and Gaga that is a 100% match. If we have it in the knowledge base that will be 100% match records. So you basically get the new match there. So semantic matching, the fuzzy grouping SSIS component doesn’t use this type of similarity. We use a similar algorithm, but we use a knowledge base on top of it. So it is much more accurate and is faster.
Greg Low: Yes that sounds very good. Are there any other general best practices you find in terms of working with Data Quality Services?
Gadi Peleg: There is some rule before you go to match your data, it is always best practice to cleanse your data and it is obvious. Once you cleanse your data and your data is consistent your matching will be basically you can create a very simple matching policy using an exact match to match your data when your data is already sanitized and cleansed. So it is always good practice to do cleansing before you do matching and once you are done with cleansing you use the output of the cleansing for the matching. The nice thing about DQS is we don’t actually touch the original columns, for each source column we create an output column and that is where we place the correct value. So you can take these output columns as input you are matching projects.
Greg Low: I notice that in the Integration Services component, the original columns seem to flow through and then there are the additional columns that are the output columns plus the thing that indicates what has occurred.
Gadi Peleg: That’s true, and that’s true for the cleansing project in the DQS client, so we have the original column, the output column, the reasoning, the state whether it is correct, correct invalid suggestions and more. For example if you use reference data providers like Melissa data, they will append additional columns this is then enriching that we talked about before. For example, if you send you addresses, they will return your correct address plus additional meta data about the address like longitude, latitude and so forth.
Greg Low: So the Melissa one actually does geo location of the data as well?
Gadi Peleg: Yes, yes it appends many more columns and that’s true about most of the services. So you pay Nexus for correcting data and enriching your data.
Greg Low: Much richer data than what you were looking at in the first place. When you go to configure the component I suppose when do you pick the provider that you are going to work against that then determines the available output columns does it at that point?
Gadi Peleg: When you pick the service, it basically we create the same output for each source column that you have to create the output column but upon export we add the additional columns that these services provide.
Greg Low: Yes I was thinking in terms of the Integration Services component, your input columns coming in and them going out as well. You get the indication on what’s happened, but if the thing enriches it in some way, is it when you configure the component. I presume that is the point it then suddenly provides additional output columns?
Gadi Peleg: Yes, true, true. Yes there is an Advanced tab on the SSIS component. That is where you can append additional columns from the service or not.
Greg Low: Yes. That’s good. Listen one of the things that I know I always ask about and I know it’s not in the first version. Is there any sort of extensibility model or anything like that?
Gadi Peleg: Unfortunately not. We didn’t have a chance to put APIs, public APIs out. It is part of our thinking for you know for our roadmap. One thing that I can talk about because it is public and I don’t recall there is a public use case for that. We have a very interesting partner in Germany and they basically integrate dynamic CRM with DQS on the SSIS component. So basically whatever the SSIs component they take the cleansed data and they show it to the user on their CRM page. The user can select whether to incorporate the cleansed data for that specific context for example. There is a way to use the SSIS component as an extendable area but we don’t specifically have APIs for this release.
Greg Low: So I suppose that is the other question. Apart from the Data Quality client application and the component that comes in Integration Services. Do we need to treat the Data Quality database as a black box, or are there other ways I can interact with that database?
Gadi Peleg: I think that is best to treat it as a black box, many of the data or information is serialized. It is very dangerous to touch any table there because the knowledge base basically does schema tables. There is a logic when you publish or edit a knowledge base there is a logic for keeping temporary tables so once you start to mess around with those table that is where you can destroy things. I think it is quite dangerous and we should treat it as a black box.
Greg Low: I was thinking more about reading it. I was thinking about if I wanted to execute a query that told me things like the number of unique values about a particular domain or something like that. Those sorts of things, is there any way to query that sort of thing?
Gadi Peleg: I don’t recall if there is a way to do it, but I don’t know when once the numbers are stored if it is binary and obviously it can. I think that anyone can reverse engineer it and query any table there but we don’t provide any best practice there.
Greg Low: Is there any thinking down the track about how that might or the sort of things you might eventually build or add on? I am sort of wondering if there is a developer surface in mind at all?
Gadi Peleg: Yes, we do have in mind. Right now we are thinking about the next generation. I can’t really discuss too much about it. I can say there is a developer surface in mind, basically we want to promote more automation and that is basically it. I can’t really discuss more.
Greg Low: That’s ok. If people have ideas on how it might expanded is the Connect site the best site for placing those?
Gadi Peleg: They can use the Connect site, anyone can also post on the DQS blogs, forums that belongs to DQS. Anyone can write there, add suggestions and feedback. We do monitor that, we try to answer as fast as possible and these are the media channels we use basically.
Greg Low: In terms of interaction with the team, DQS forum?
Gadi Peleg: DQS forum and blogs, yes. And also if someone wants, a customer wants to become a TAP customer, technically a preview customer they can approach Microsoft to be included in that list. Then interaction with the Product Group is much tighter.
Greg Low: Yes, that’s great. What I will do, is actually I will find the forum and blogs addresses and I will put them in the notes for the show as well so there is a direct link on those. In terms of other things coming up that we should look forward to. Are there presentations on this on any of the upcoming events do you know or I presume Tech Ed North America there will probably be something.
Gadi Peleg: Tech Ed North America is one of the events. I don’t recall, I don’t have the list in front of me but there are events we always attend. Like PASS or any Microsoft event but I just don’t have the list with me.
Greg Low: That’s great, I can find the ones that are coming. Beyond that Gadi, is there anywhere where people will see and hear from you upcoming next year or so.
Gadi Peleg: I use to present at PASS and at other events. I am not sure if Matthew. I think he had a show with you a while back.
Greg Low: Oh Matthew Roach.
Gadi Peleg: Yes Matthew Roach, he is our great presenter about MDS and DQS so I think he is a really great presenter and you can see him at almost any event in the world. He is travelling a lot now.
Greg Low: I think you did really well getting Matthew on the team. He is an old friend of the show and myself. I think he is a very passionate presenter and always entertaining.
Gadi Peleg: Yes and he is delivering the messaging very, very well. For myself, I am focusing on features on DQS next generation. It is no secret we are considering a cloud experience but not just DQS but I am also talking about many teams in Microsoft. This is the next generation.
Greg Low: That’s great. Well listen, thank you so much Gadi for your time. We will talk to you again soon.
Gadi Peleg: Thank you so much for having me on the show, thank you so much.
Greg Low: That’s great.