Hilary Cotter has been involved in IT for the last 20 years. Originally from Canada, he has lived in New Zealand, currently lives in the eastern United States. Worked as programmer, LAN and web administrator, and over the past five years exclusively as a DBA. Been Working with SQL Server since v.6. Been a SQL Server MVP since 2001, very interested in Microsoft search technologies and SQL Server replication. He completed a book on SQL Server 2000 transactional and snapshot replication in 2004, and is working on a book on SQL 2000 merger implication for third quarter 2005.
Greg Low: Introducing show number two with guest Hilary Cotter.
Welcome, our guest this evening is Hilary Cotter. Hilary has been involved in IT for the last 20 years. Originally from Canada, he has lived in New Zealand, currently lives in the eastern United States. Worked as programmer, LAN and web administrator, and over the past five years exclusively as a DBA. Been Working with SQL Server since v.6. Been a SQL Server MVP since 2001, very interested in Microsoft search technologies and SQL Server replication. He completed a book on SQL Server 2000 transactional and snapshot replication in 2004, and is working on a book on SQL 2000 merger implication for third quarter 2005. Welcome Hilary.
Hilary Cotter: Thank you, Greg.
Greg Low: Must admit, the book you wrote on transactional replication and snapshot replication is one of the largest I have in the library. Wonder if it was sold by the pound, found it very informative. Good experience to put together?
Hilary Cotter: Hard at times since I would go off on tangents. There were some areas I really fell down upon, didn’t do a lot of work on recoverability. Performance section really sucks. Discovered whole new worlds in both those sections since then. For the 2005 one I hope to really work on recoverability, disaster recoverability. How to recover your replication solutions if something goes south.
Greg Low: Back when I taught a v.6 class, replication was one of the things the students found difficult to get the labs working, let alone anything else. By the time we got to SQL Server 7, the wizards that appeared seemed to make life easier.
Hilary Cotter: Can’t remember the wizards in six and 6.5 too much. In six they had a feature where you could diagram your replication typology. Beyond that I don’t remember a whole lot about wizards. Problem with replication is it can fall down in many areas. Dependent on security and network typology. If you don’t network, or if you’re replicating over Internet or WAN, between domains that aren’t trusted, there are problems to look into. You could break into it so many different ways. Traps all over the place, if you can get over the traps, it’s really quite solid. Then if your network goes down, it’ll break, the first time you see it break you get upset, so you need to learn how to maintain it, has to be solid. I feel your pain.
Greg Low: What are the main things people get wrong with or have difficulties with when dealing with SQL Server 2000 replication?
Hilary Cotter: Think the identity range management is a problem. There’s different kinds of problems. First is if you have a big snapshot, that will probably time out on you. What it does first is make an estimate on how long it’ll take to generate your snapshot. If it’s wrong, your agents will go suspect. In deploying your snapshot, when it sends data to your subscriber, it’s DCP-ing your subscriber, sometimes there’s no communication, if there’s no communication when it’s DCP-ing the data, there’s no communication between districts, agents, and replication sub-system, and that’s gross copy, it can mark your agents suspect. So there’s a lot of problems with large snapshots, they’re problematic. If it fails you need to restart from the beginning.
Greg Low: I was at just this week was struggling with 30-gigabyte snapshot and a one-megabyte link. In the end it seemed to me more of a design problem, generating a snapshot of what was seven years of history, only had a month and a half of active data.
Hilary Cotter: What I recommend for situations like that is to replicate just the schema. Replicate that to local box. Take that schema and pre-apply it on your subscriber, then apply the snapshot through… Tricky to set up. Sending over the wires isn’t a good idea if you’re big.
Greg Low: With the large ones, I suppose one of the options is to send a piece of media instead.
Hilary Cotter: Option for that. With a pull subscription, dialogue that says…
Greg Low: From an alternate location?
Hilary Cotter: Yes, that’s the one to use.
Greg Low: What sort of things do you find… what problem I hear all the time is people whose replication agents stop, and they haven’t set up anything to monitor them.
Hilary Cotter: That’s problematic, especially if you’re using pull-subscriptions. With pull-subscriptions, agents run on subscribers. Have to visit each subscriber to see what’s failed. Replication monitor group, go to tools, show replication monitor group. Style will pop up, can add each agent or distributor you wish to monitor, add on your subscribers, then see at a glance which subscriber has failed on you. Drill down, right into the particular agents, drill down into a particular pull-subscription agent and restart it, maintain it, or do maintenance on it. Set to a larger value. Visibility is a problem, other problem you’ll run into is when your agents fail, you probably want to have them restart every five minutes, if you’re running continuously. You want agents to restart themselves, either schedule them to run every five minutes, and run continuously. Or, when your agent fails, have it restart, looping back to job one. Then the thing you want to do is visit each subscriber and get a “repli-errors” table locally to look over a repository and examine these errors to see which ones you’re getting. You may find great loss of links, you will never know that until you clack the errors.
Greg Low: Each distribution agent has their own even processing or alerting type options. Which do you tend to configure normally?
Hilary Cotter: Find the one I use is replication alerts. First, I tend to work in a distributive environment, replicate over a lot of WANs. Maybe replicating over ISTM-link to Guam. These links go down a lot, generate a lot of errors. If we use the alerting function to alert us, we’re just getting a lot of noise. We want to know that this link is going down, but not every ten minutes. Option in alerting is to only have it alert you once a day, once every so many time periods. Normally what you do is write to the bent-log, and then have a net-IQ handle these events. Net-IQ has a more robust system of handling noise. Say something like if it goes down more than 20 times a day, alert me. One of the things I’m worried about in these distributive environments, since I’m using Pull everywhere, is visibility. What I don't know is how often these guys are failing, I don't know how often my subscribers talk to my publishers. If you’re offline for more than 48 hours, your subscription can expire. I want to know what is expiring. Collect history locally, run these jobs which finds out when the last time I spoke with this subscriber and the publisher. If I go beyond a certain time period, I raise an alert. In SQL 2005 you have an option for this. Alert is raised if your subscriber has not talked to your publisher within 80 percent of the threshold for your subscription. By default, your subscription expiration time period is 72 hours, there’s an alert that will be raised, go in and configure it, raised within 80 percent of that time period.
Greg Low: One of the questions with 72-hour number, I have come across people who have had long weekends, found that it comes on in their time. If you increase that size is that really affecting anything other than the size of the distribution database?
Hilary Cotter: Here’s the real issue here. Set the same attention period to the same value as your history. Your subscription will expire on the lesser of the two – history or your distribution retention. If there’s no history coming from your subscribers, by default your subscription will expire in 48 hours. If your subscriber doesn’t talk to your publisher within 72 hours your subscription will expire. The first thing to bite you will be history retention, should bump that up to 72 hours. Address your question, retention period is how long replication Metadata is going to stay in distribution database. What this is is the commands and transactions in Microsoft repli-commands and Microsoft repli-transactions. For main subscriptions, they are purged as soon as plied on all subscribers. For non-Microsoft they hang around for that retention period.
Greg Low: Because they don’t know which subscribers have received it?
Hilary Cotter: That’s it. You want this to be as small as possible. If it’s too short you’re not giving yourself enough of a window to get a subscriber up and running. These distributed typologies replicating out to Guam during hurricane season, Guam might be offline for two to three days, Guam always a problem for us. We were able to have a package to get our subscription out to Guam no problem, we did run it with two days there. If you run it anonymously, want it small, that will mean less commands in your distribution database, might have to wrestle with deploying snapshots to experience subscribers. With a well-connected typology, and if you have lots of links, might want to make it larger. With merger implication, want to try to make it as small as possible. Famous deposit we have here with Kraft Foods, Nabisco, the company that makes Oreos. They have 1,400 trucks, when the truck driver goes into the warehouse, they pick up their pocket PC from a cradle, they can see all their orders they’re supposed to do for that day, go out to their customers, give them two crates of Oreos, the customer, the shopping center may reject one of the crates, these details are entered into the pocket PC, back in the cradle it will causate from the PC to the head office. They pick up new orders over the Internet as they drive, I believe. The point is for these pocket PCs involved in merger replication typology, their retention period is five days, down from 14. Set to five because they found it was the optimal retention period for them. Not too much merger replication Metadata, but enough so they didn’t have to employ the snapshot to their pocket PCs too frequently.
Greg Low: In general then with SQL Server 2000, what are the most basic mistakes people make with replication?
Hilary Cotter: Most basic mistake is using merge replication when they shouldn’t be.
Greg Low: Because they think they want bi-directional but haven’t looked at partitioning the data or something?
Hilary Cotter: Because they’re using bi-directional between two servers. What works best for them is bi-directional transactional. Only good if you have a single pair. One publisher, one subscriber, not one publisher and two to three subscribers. Find that a lot of people have a central office and five branch locations. They go for central publisher or central subscriber typology, what they should be doing is remote controlling into the central office. Works best if they have good links, if they don’t they need distributive typology. Ran into people who are really interested in disaster recovery, it’s critical. They use replication, and when you set up replication in a disaster recovery environment where you have one destination server and a source server, and they have to be linked, it’s sometimes hard to set up this replicated environment. Especially with identity keys, foreign key constraints, a few other things. Triggers, foreign key constraints, updates and deletes. Primary keys.
Greg Low: Often the primary keys end up becoming a unique index.
Hilary Cotter: Exactly. For pure disaster recovery, one server to another, I recommend they use some remote control solution. The customers are hitting this one server, if that goes down, they’re failed over to the disaster recovery site. I find people using replication when it’s the wrong solution for their particular problem. The other problem is their agents fail, this is their first time setting it up, they don’t know what to do. They get a little spooked by the agents failing them all the time.
Greg Low: Do you think people can ever configure agent profiles and they leave them at the defaults?
Hilary Cotter: The defaults really aren’t that bad. People involved in replication’s knowledge evolves, what happens is they start off as a newbie, get to a certain point, then discover profiles and they start running the continuing data consistency failure profile. What happens is they don’t know that their subscribers are getting more and more out of synch with their publishers. I don’t like to see people running to us, there’s certain profiles, you can set up a de-bugging profile, run with no history at all, create a profile with no history, get 10-15percent per demand performance increase. If there’s any failure there’s not history, your agent is marked with one x and a red circle. You change the profile for de-bugging, get some history and find out what the error is all about. So yeah, the defaults are not bad.
Greg Low: As you would hope.
Hilary Cotter: Right. They’ve done a lot of work in 2005. For instance, one of the things that plagues every replication newbie is when your network goes down. In 2005 the replication agents audit the network link, they will restart themselves when it goes down. With the distribution agent there is an option called the independent agent option. If you have multiple tails, which are related by DRI, these agents will be assigned a distribution agent. What will happen is several of these will be used for each group of articles so you can get three, four, or five things of data going to a subscriber at a single time. The goal of SQL 2000 was to share these agents.
Greg Low: Let’s actually take a break for a few minutes, when we come back we’ll talk about SQL Server 2005.
Greg Low: Hilary, could you tell us about yourself, you’re a SQL Server MVP?
Hilary Cotter: I live in the States.
Greg Low: In the eastern United States, right?
Hilary Cotter: Yes. Started off as a programmer, getting tired of maintaining other peoples’ code, even my own code. Found as a programmer I was doing tremendous hours, land of mysteries were always the people out the door at five. Started moving more and more into LAN administration. Interested in it. From there I started having to maintain some databases. We were working with SIBIS, from there started getting involved in SQL Server. Purely system administrator part. My T-SQL is not the best, I have worked as a SQL developer, it’s not my strong point. Always been on the administrative end of SQL Server. These things like full-text search, replication. So I think I started moving out of LAN administration in 1998, started getting more involved in web technologies, got involved with IS, started working with Microsoft searchers like site-service search and index server, which is now index and services. Working with that and SharePoint, open source search engine out there called Boosting. Working for a large media company, we’re looking at search engines to handle our search services. We get into autonomy, it has not worked out for us. We’re looking at who’s seen it in a big way. Not sure how that’s going to work out for us. At the source we should be able to fix the particular problems you come across.
Greg Low: Funny you mentioned search technologies. One thing that fascinates me as I watch the industry, doesn’t matter if you’re in the developer end or the DBA end, people tend to like very exact and precise questions and answers. The dichotomy is the users like fuzzy questions and imprecise answers.
Hilary Cotter: Not quite true. Depends on what your search is. As DBAs, we deal with… If we search in “book” we want hits to just “book.” Other people who aren’t involved in technical environments are looking for fuzzy searches, if they search “book” they want it to come back to “books.” So the people who really want imprecision are the general public. We in the technical world, we want precise matches, we want the Google kind of search where if you search a type of book, you only get hits to “book.” You’re talking more concept-based search. We don’t know what we’re looking for. Looking for backup for a problem with transaction log ballooning, with large transaction logs. How can we search large transaction logs and get hits to documents which use the phrases “big transactional logs,” “balloon transactional logs?” How does the search engine do a conceptional-based search?
Greg Low: What was fascinating me. People look and say I can apply like clauses. It’s not that at all, different declensions, driving is the same as drives and ballooning and large.
Hilary Cotter: Declensions is possessive. Microsoft groups declension and stemming. Forgot the second here. Break into declension and conjugation. Group them together in stemming. The declension is possessive and gender. Stemming is verb types. Book, booking, booked. The kind of a search which does those fuzzy searching is a free-text search. More expensive, too fuzzy, not a natural way to search. The default for SharePoint and SharePoint services is a free-text search.
Greg Low: Think I was talking to Darren, who I work with, he was describing said all the sites now if you talk to them they say they want a Google-type search.
Hilary Cotter: That’s a strict search, not fuzzy.
Greg Low: Interesting that’s how they describe it.
Hilary Cotter: Google’s algorithm isn’t the best out there. The one they’re using returns matches with… the technologies are 20 years old. If your site has a lot of links to it, not your page, then will that site be ranked higher than a site that doesn’t? If you do a search on Google, and compare the results from a search on MSN.com, the results are different, sometimes quite different. If you’re looking for something out of the ordinary you’ll get better results on MSN than Google. On Google that site ranking skews hits too much. The question to ask is does the “I am feeling lucky” button work well for you on Google? If it does, if the kinds of searches you do, if you get good results from I’m feeling lucky, Google is your search engine. If not, you should have searched on MSN.com.
Greg Low: In terms of replication, what are the things you’re looking forward to seeing replication-wise in SQL Server 2005?
Hilary Cotter: Worked with it a little while, still having trouble getting all the physics. I think there’s going to be a period when people try to figure out when to use which particular feature. Something called peer-to-peer replication.
Greg Low: Brand-new typology, yes.
Hilary Cotter: What this is is it’s bi-directional replication within transactional replication. Good for more than a single pair, so you can have three, four, or five units within your replication solution. What they are intended to be used for is when each of these modes has a distinguishable data set, there’s nothing unique about each of these modes, they each have the same data sets. Designed for scale-out read solutions. If you have 1,000 uses banging away at a box, it would make a better performance if you put 10 boxes in the mix, and have each of the 100 users banging away at a single box. More bang for your buck with 10 lighter-weight servers than a single monolithic large horsepower solution. And peer-to-peer replication is designed for scale-out. I think we’ve done a lot of work on particular pain points. Taken replication monitor, a separate executable now, you can go into replication monitor and check on… get all sorts of statistics on particular replication solutions.
Drill down on different replication subscribers, find out which particular article is experiencing the most uploads and downloads, which are candidates for upload only or download only feature on a per-article basis. Good window into the statistics for your system. For the transactional implications. A lot of people are interested in how old their data is when it hits the subscriber. Time how long it takes to get to my subscriber. There is a problem with that, hitting on your data flow patterns, if they’re all over the place, you can expect your latency to be all over the place too. Vary widely from one to several minutes. By it checking a token, you’re getting a brief window into what the replication performance was like, and at a time period, so you can collect over a variety of peers, get a little bit of history. Volatile measure, I’m concerned people will say “oh, my latency is 22 seconds,” when the next minute it can be several minutes. It’s on depending on single commands to batches from your publisher. How your publishers perform at that time. Your replication solution will slow down. Overall these things have really worked out. Really good window to your replications. Auto-sensing agents, they’ll spin up on demand, the ability to replicate… All the demands are now replicated. Not sure what happens if you try to alter a primary key.
Greg Low: Area I’m interested in. That seems to be one of the big bones of contention, when you’re doing complicated schemas, changes that are involved in the replication, more so than are involved in the existing SharePoint repli-add column.
Hilary Cotter: You’d like to think that at least primary keys would be stable. I thought that was true, but hopefully when the developers start pushing something out the door every couple weeks they’re not going to be modifying their primary keys. Change in data raises some questions. Replication can’t handle it, fine, but you have to wonder what will… Raises questions about what the developers are doing.
Greg Low: Question I get asked quite a bit is when people have come from other environments. One thing they don’t like is that developers have to be aware if the table is replicated at all, they would prefer an arrangement where the developers can work with the tables oblivious to the fact that the tables are part of replication. They say there should be a distinct break between the two. Don’t see any easy way to do that currently.
Hilary Cotter: In SQL 2005 it’ll be less painful. Should be handled seamlessly. Hesitant to say something like that, in case someone will do something unanticipated by Microsoft. This might be bullet-proof, I don’t know how they’re doing this. They might be using schema-based trigger styles to replicate the schema changes.
Greg Low: Mentioned in the show that I’m interested in the possibility with DBL schema triggers. Could lend itself to be one of the interesting uses for managed code. A lot of the code in the DBL triggers might end up being string-processing type code. We’ll be interested to see.
Hilary Cotter: Right.
Greg Low: In general with SQL Server 2005, apart from anything in replication, what are the main things you’re looking forward to?
Hilary Cotter: Well, there’s always the challenge of working it, midway through that right now. I’m a newsgroup junkie, I know the pain points, and they’ve done a lot of work, more than five years to address these pain points, tremendous job of addressing them. Tried to divorce replication as much as possible from the underlying architecture the network that it sits on. It’s more tolerant of network failures. Network failures and subscriber/publisher load. In some cases replication can fail, bases can go suspect if your subscriber publishes under an extreme load. Tremendous amount of work there to simplify it, make it less of a management problem, completely re-worked the wizards, they’re really simple right now. Before, they were always bewildering. Now the DBAs are going to be a little bewildered by how to deploy a subscription. Once you get your basic navigation around replication wizards, it will be a lot simpler for you.
Greg Low: Think sometimes getting a head around what the agents are and where they run is kind of an issue. What I enjoyed in your book, detailed description of what sits where.
Hilary Cotter: Also changed the security log. Very secure by default now, listen to lots of questions as to “how do I work with it?” Find the security for your agents, people are going to say “what do I use?” I myself haven’t quite worked that out yet. Looks like the accounts you’re using just need rights to… You’d log into the snapshot agent with, it’s just that base security, it’s not any more… You don’t need rights to the database to run these agents under.
Greg Low: What about cross-domain support? I know that’s been one of the other difficult areas when we have untrusted domains.
Hilary Cotter: The way I normally work that is through pass-identification. By default when you log on to an application which is using NT security, the first thing it does is try to log on with a local machine account. If that doesn’t work, it bumps it up to the domain level. If you have a local machine account with the same name on both sides of your replication solution, the domain security check will be bypassed in favor of a local admin, a local machine check. That’s the way I recommend you get around trusted domains. If you ran the same account, same passwords, different domains, it will bypass the actual domain check and use the local credentials. That’s another way of getting around it. It only comes in place in pull-subscriptions. Has to use security context to connect to the publisher and connect that snapshot share. That is where the security context comes in from pull. From push, it’s just the distribution agent connecting to the snapshot share which is on the publisher.
Greg Low: Presume on an Internet connection, FTP is the logical solution for snapshot files?
Hilary Cotter: Right, raises a lot of interesting questions. When they hear FTP they freak and say it’s not secure. It’s not necessarily insecure. When you do anonymous identification with FTP, the password doesn’t fly. Holding up your snapshot to anyone, do you care about anyone potentially accessing your FTP site and downloading your snapshot? Most cases you don’t care. If you’re a bank, you care very much. What some people do is use NT security for the FTP account, but that means the password is going to fly. That means if someone traps that password with a phishing attack on your router, they may get that password you’re using. They may be able to use that password if you use the same password. Then they may have access to your machine. Most security breaches are caused locally, someone is in your building, on your machine already. A lot of people who are upset with using FTP have no problem using POP. And your password travels there. There are some misnomers. A lot of people don’t like using FTP, I can understand that. There have been some significant breaches using FTP. If you’re a secure bank, using FTP is out of the question. What you do is use BPN, Microsoft has a BPN module that ships on the operating system. It’s slow. With SQL Server 2005 replication, with merge publications you can synchronize over the Internet with IS. With SQL CE you connect with a web serve to download your snapshot and to connect to your SQL Server. Basically it’s similar to that, through HTTPS. With merge replication you have a highly secure way of downloading your snapshots and connecting to your merge publisher for merge replication. It’s all replication traffic for large replication.
Greg Low: That’s great. That’s getting us towards time. It would be great to hear what’s happening in your world in the next few months? Any more books?
Hilary Cotter: Trying to push this merge replication book out the door, about halfway through that right now. Some things are killing me. Figuring out how things work. One of those things you look at and struggle with why did they make that decision, and then it dawns on you why, and it’s a brilliant solution to the particular problem they’re dealing with. You have so much respect for the merge replication developers. Also speaking on full-text search at PASS, gear up for that. Writing a few articles for some magazines. Always interesting, writing helps to solidify your thinking. When you want to write something you want to be accurate. Something will come up and you’ll go on a tangent and discover something interesting. Trying to quantify performance impact of storing retreated blogs in your database as opposed to the file system. I’ve been going up with large and larger blogs. Looking at 10,000, it looks like it’s three times… You get three times more hits per... Your solution can handle three times more hits per second if you store your blog on the file system and the database, and retrieve the blog from the file system, as opposed to retrieving the blog from the file system. Want to explore that further when you use text and row, what’s the difference between SQL 2000 and SQL 2005.
Greg Low: Did speak myself at PASS Munich a few weeks back, very enjoyable. Looking forward to presenting the same session at PASS in Dallas at the end of the year. Look forward to seeing you there. Thank you, Hilary.
Hilary Cotter: You’re welcome, it’s been a pleasure.