Tuesday 29 December 2009

Best Practices in SSIS

I have collected Best Practices in SSIS from Various Websites........following are the best practices
Here are the 10 SSIS best practices that would be good to follow during any SSIS package development
1. the most desired feature in SSIS packages development is re-usability. In other ways, we can call them as standard packages that can be re-used during different ETL component development. In SSIS, this can be easily achieved using template features. SSIS template packages are the re-usable packages that one can use in any SSIS project at any number of times. To know more about how to configure this, please see
http://support.microsoft.com/kb/908018

2. Avoid using dot (.) naming convention for your package names. Dot (.) naming convention sometime confuses with the SQL Server object naming convention and hence should be avoided. Good approach would be to use underscore (_) instead of using dot. Also make sure that package names should not exceed 100 characters. During package deployment in SQL Server type mode, it is noticed that any character over 100 are automatically removed from package name. This might result your SSIS package failure during runtime, especially when you are using ‘Execute Package Tasks’ in your package.

3. The flow of data from upstream to downstream in a package is a memory intensive task, at most of the steps and component level we have to carefully check and make sure that any unnecessary columns are not passed to downstream. This helps in avoiding extra execution time overhead of package and in turn improves overall performance of package execution.

4. While configuring any OLEDB connection manager as a source, avoid using ‘Table or view’ as data access mode, this is similar to ‘SELECT * FROM , and as most of us know, SELECT * is our enemy, it takes all the columns in account including those which are not even required. Always try to use ‘SQL command’ data access mode and only include required column names in your SELECT T-SQL statement. In this way you can block passing unnecessary columns to downstream.

5 In your Data Flow Tasks, use Flat File connection manager very carefully, creating Flat File connection manager with default setting will use data type string [DT_STR] as a default for all the column values. This always might not be a right option because you might have some numeric, integer or Boolean columns in your source, passing them as a string to downstream would take unnecessary memory space and may cause some error at the later stages of package execution.

6 Sorting of data is a time consuming operation, in SSIS you can sort data coming from upstream using ‘Sort’ transformation, however this is a memory intensive task and sometime result in degrade in overall package execution performance. As a best practice, at most of the places where we know that data is coming from SQL Server database tables, it’s better to perform the sorting operation at the database level where sorting can be performed within the query. This is in fact good because SQL Server database sorting is much refined and happens at SQL Server level. This in turn sometime results overall performance improvement in package execution.

7 During SSIS packages development, most of the time one has to share his package with other team members or one has to deploy same package to any other dev, UAT or production systems. One thing that a developer has to make sure is to use correct package protection level. If someone goes with the default package protection level ‘EncryptSenstiveWithUserKey’ then same package might not execute as expected in other environments because package was encrypted with user’s personal key. To make package execution smooth across environment, one has to first understand the package protection level property behaviour, please see
http://msdn2.microsoft.com/en-us/library/microsoft.sqlserver.dts.runtime.dtsprotectionlevel.aspx . In general, to avoid most of the package deployment error from one system to another system, set package protection level to ‘DontSaveSenstive’.

8 It’s a best practice to take use of Sequence containers in SSIS packages to group different components at ‘Control Flow’ level. This offers a rich set of facilities
o Provides a scope for variables that a group of related tasks and containers can use
o Provides facility to manage properties of multiple tasks by setting property at Sequence container level
o Provide facility to set transaction isolation level at Sequence container level.
For more information on Sequence containers, please see
http://msdn2.microsoft.com/en-us/library/ms139855.aspx.
9 If you are designing an ETL solution for a small, medium or large enterprise business need, it’s always good to have a feature of restarting failed packages from the point of failure. SSIS have an out of the box feature called ‘Checkpoint’ to support restart of failed packages from the point of failure. However, you have to configure the checkpoint feature at the package level. For more information, please see
http://msdn2.microsoft.com/en-us/library/ms140226.aspx.

10 Execute SQL Task is our best friend in SSIS; we can use this to run a single or multiple SQL statement at a time. The beauty of this component is that it can return results in different ways e.g. single row, full result set and XML. You can create different type of connection using this component like OLEDB, ODBC, ADO, ADO.NET and SQL Mobile type etc. I prefer to use this component most of the time with my FOR Each Loop container to define iteration loop on the basis of result returned by Execute SQL Task. For more information, please see
http://msdn2.microsoft.com/en-us/library/ms141003.aspx & http://www.sqlis.com/58.aspx.
SSIS: Suggested Best Practices and naming conventions
I thought it would be worth publishing a list of guidelines that I see as SSIS development best practices. These are my own opinions and are based upon my experience of using SSIS over the past 18 months. I am not saying you should take them as gospel but these are generally tried and tested methods and if nothing else should serve as a basis for you developing your own SSIS best practices.
One thing I really would like to see getting adopted is a common naming convention for tasks and components and to that end I have published some suggestions at the bottom of this post.
This list will get added to over time so if you find this useful keep checking back here to see updates!
If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output. This may save unnecessary SORTs later in the pipeline which can be expensive. Setting this value does not perform a sort operation, it only indicates that the data it sorted.
Rename all Name and Description properties from the default. This will help when debugging particularly if the person doing the debugging is not the person that built the package.
Only select columns that you need in the pipeline to reduce buffer size and reduce OnWarning events at execution time
Following on from the previous bullet point, always use a SQL statement in an OLE DB Source component or LOOKUP component rather than just selecting a table. Selecting a table is akin to "SELECT *..." which is universally recognized as bad practice. (
http://www.sqljunkies.com/WebLog/simons/archive/2006/01/20/17865.aspx). In certain scenarios the approach of using a SQL statement can result in much improved performance as well (http://blogs.conchango.com/jamiethomson/archive/2006/02/21/2930.aspx).
Use SQL Server Destination as opposed to OLE DB Destination where possible for quicker insertions I used to recommend using SQL Server Destinations wherever possible but I've changed my mind. Experience from around the community suggests that the difference in performance between SQL Server Destination and OLE DB Destination is negligible and hence, given the flexibility of packages that use OLE DB Destinations it may be better to go for the latter. Its an "it depends" consideration so you should consider what you prefer based on your own testing.
Use Sequence containers to organise package structure into logical units of work. This makes it easier to identify what the package does and also helps to control transactions if they are being implemented.
Where possible, use expressions on the SQLStatementType property of the Execute SQL Task instead of parameterised SQL statements. This removes ambiguity when different OLE DB providers are being used. It is also easier. (UPDATE: There is a caveat here. Results of expressions are limited to 4000 characters so be wary of this if using expressions).
If you are implementing custom functionality try to implement custom tasks/components rather than use the script task or script component. Custom tasks/components are more reusable than scripted tasks/components. Custom components are also less bound to the metadata of the pipeline than script components are.
Use caching in your LOOKUP components where possible. It makes them quicker. Watch that you are not grabbing too many resources when you do this though.
LOOKUP components will generally work quicker than MERGE JOIN components where the 2 can be used for the same task (
http://blogs.conchango.com/jamiethomson/archive/2005/10/21/2289.aspx).
Always use DTExec to perf test your packages. This is not the same as executing without debugging from SSIS Designer (
http://www.sqlis.com/default.aspx?84).
Use naming conventions for your tasks and components. I suggest using acronyms at the start of the name and there are some suggestions for these acronyms at the end of this article. This approach does not help a great deal at design-time where the tasks and components are easily identifiable but can be invaluable at debug-time and run-time. e.g. My suggested acronym for a Data Flow Task is DFT so the name of a data flow task that populates a table called MyTable could be "DFT Load MyTable".
If you want to conditionally execute a task at runtime use expressions on your precedence constraints. Do not use an expression on the "Disable" property of the task.
Don't pull all configurations into a single XML configuration file. Instead, put each configuration into a separate XML configuration file. This is a more modular approach and means that configuration files can be reused by different packages more easily.
If you need a dynamic SQL statement in an OLE DB Source component, set AccessMode="SQL Command from variable" and build the SQL statement in a variable that has EvaluateAsExpression=TRUE. (
http://blogs.conchango.com/jamiethomson/archive/2005/12/09/2480.aspx)
When using checkpoints, use an expression to populate the CheckpointFilename property which will allow you to include the value returned from System::PackageName in the checkpoint filename. This will allow you to easily identify which package a checkpoint file is to be used by.
When using raw files and your Raw File Source Component and Raw File Destination Component are in the same package, configure your Raw File Source and Raw File Destination to get the name of the raw file from a variable. This will avoid hardcoding the name of the raw file into the two separate components and running the risk that one may change and not the other.
Variables that contain the name of a raw file should be set using an expression. This will allow you to include the value returned from System::PackageName in the raw file name. This will allow you to easily identify which package a raw file is to be used by. N.B. This approach will only work if the Raw File Source Component and Raw File Destination Component are in the same package.
Use a common folder structure (
http://blogs.conchango.com/jamiethomson/archive/2006/01/05/2559.aspx)
Use variables to store your expressions (
http://blogs.conchango.com/jamiethomson/archive/2005/12/05/2462.aspx). This allows them to be shared by different objects and also means you can view the values in them at debug-time using the Watch window.
Keep your packages in the dark (
http://www.windowsitpro.com/SQLServer/Article/ArticleID/47688/SQLServer_47688.html). In summary, this means that you should make your packages location unaware. This makes it easier to move them across environments.
If you can, filter your data in the Source Adapter rather than filter the data using a Conditional Split transform component. This will make your data flow perform quicker.
When storing information about an OLE DB Connection Manager in a configuration, don't store the individual properties such as Initial Catalog, Username, Password etc... just store the ConnectionString property.
Your variables should only be scoped to the
containers in which they are used. Do not scope all your variables to the package container if they don't need to be.
Employ
namespaces for your packages
Make log file names dynamic so that you get a new logfile for each execution.
Use ProtectionLevel=DontSaveSensitive. Other developers will not be restricted from opening your packages and you will be forced to use configurations (which is another recommended best practice)
Use annotations wherever possible. At the very least each data-flow should contain an annotation.
Always log to a text file, even if you are logging elsewhere as well. Logging to a text file has less reliance on external factors and is therefore most likely to contain all information required for debugging.
Create a new solution folder in Visual Studio Solution Explorer in order to store your configuration files. Or, store them in the 'miscellaneous files' section of a project.
Always use
template packages to standardize on logging, event handling and configuration.
If your template package contains variables put them in a dedicated namespace called "template" in order to differentiate them from variables that are added later.
Break out all tasks requiring the Jet engine (Excel or Access data sources) into their own packages that do nothing but that data flow task. Load the data into Staging tables if necessary. This will ensure that solutions can be migrated to 64bit with no rework. (Thanks to Sam Loud for this one. See his comment below for an explanation)
Don't include connection-specific info (such as server names, database names or file locations) in the names of your connection managers. For example, "OrderHistory" is a better name than "Svr123ABC\OrderHist.dbo".
Here are the advanced SSIS best practices.
Get your metadata right first, not later: The SSIS data flow is incredibly dependent on the metadata it is given about the data sources it uses, including column names, data types and sizes. If you change the data source after the data flow is built, it can be difficult (kind of like carrying a car up a hill can be difficult) to correct all of the dependent metadata in the data flow. Also, the SSIS data flow designer will often helpfully offer to clean up the problems introduced by changing the metadata. Sadly, this "cleanup" process can involve removing the mappings between data flow components for the columns that are changed. This can cause the package to fail silently - you have no errors and no warnings but after the package has run you also have no data in those fields.
Use template packages whenever possible, if not more often: Odds are, if you have a big SSIS project, all of your packages have the same basic "plumbing" - tasks that perform auditing or notification or cleanup or something. If you define these things in a template package (or a small set of template packages if you have irreconcilable differences between package types) and then create new packages from those templates you can reuse this common logic easily in all new packages you create.
Use OLE DB connections unless there is a compelling reason to do otherwise: OLE DB connection managers can be used just about anywhere, and there are some components (such as the
Lookup transform and the OLE DB Command transform) that can only use OLE DB connection managers. So unless you want to maintain multiple connection managers for the same database, OLE DB makes a lot of sense. There are also other reasons (such as more flexible deployment options than the SQL Server destination component) but this is enough for me.
Only Configure package variables: If all of your package configurations target package variables, then you will have a consistent configuration approach that is self-documenting and resistant to change. You can then use expressions based on these variables to use them anywhere within the package.
If it’s external, configure it: Of all of the aspects of SSIS about which I hear people complain, deployment tops the list. There are plenty of deployment-related tools that ship with SSIS, but there is not a lot that you can do to ease the pain related to deployment unless your packages are truly location independent. The design of SSIS goes a long way to making this possible, since access to external resources (file system, database, etc.) is performed (almost) consistently through connection managers, but that does not mean that the package developer can be lazy. If there is any external resource used by your package, you need to drive the values for the connection information (database connection string, file or folder path, URL, whatever) in a package configuration so they can be updated easily in any environment without requiring modification to the packages.
One target table per package: This is a tip I picked up from the great book
The Microsoft Data Warehouse Toolkit by Joy Mundy of The Kimball Group, and it has served me very well over the years. By following this best practice you can keep your packages simpler and more modular, and much more maintainable.
Annotate like you mean it: You've heard of "test first development," right? This is good, but I believe in "comment first development." I've learned over the years that if I can't describe something in English, I'm going to struggle doing it in C# or whatever programming language I'm using, so I tend to go very heavy on the comments in my procedural code. I've carried this practice over into SSIS, and like to have one annotation per task, one annotation per data flow component and any additional annotations that make sense for a given design surface. This may seem like overkill, but think of what you would want someone to do if you were going to open up their packages and try to figure out what they were trying to do. So annotate liberally and you won't be "that guy" - the one everyone swears about when he's not around.
Avoid row-based operations (think “sets!”): The SSIS data flow is a great tool for performing set-based operations on huge volumes of data - that's why SSIS performs so well. But there are some data flow transformations that perform row-by-row operations, and although they have their uses, they can easily cause data flow performance to grind to a halt. These transformations include the OLE DB Command transform, the Fuzzy Lookup transform, the Slowly Changing Dimension transform and the tried-and-true Lookup transform when used in non-cached mode. Although there are valid uses for these transforms, they tend to be very few and far between, so if you find yourself thinking about using them, make sure that you've exhausted the alternatives and that you do performance testing early with real data volumes.
Avoid asynchronous transforms: In short, any fully-blocking asynchronous data flow transformation (such as Sort and Aggregate) is going to hold the entire set of input rows in memory before it produces any output rows to be consumed by downstream components. This just does not scale for larger (or even "large-ish") data volumes. As with row-based operations, you need to aggressively pursue alternative approaches, and make sure that you're testing early with representative volumes of data. The danger here is that these transforms will work (and possibly even work well) with small number of records, but completely choke and die when you need them to do the heavy lifting.
Really know your data – really! If there is one lesson I've learned (and learned again and again - see my previous blog post about real world experience and the value of pain in learning ;-) it is that source systems never behave the way you expect them to and behave as documented even less frequently. Question everything, and then test to validate the answers you retrieve. You need to understand not only the static nature of the data - what is stored where - but also the dynamic nature of the data - how it changes when it changes, and what processes initiate those changes, and when, and how, and why. Odds are you will never understand a complex source system well enough, so make sure you are very friendly (may I recommend including a line item for chocolate and/or alcohol in your project budget?) with the business domain experts for the systems from which you will be extracting data. Really.
Do it in the data source: Relational databases have been around forever (although they did not write the very first song - I think that was Barry Manilow) and have incredibly sophisticated capabilities work efficiently with huge volumes of data. So why would you consider sorting, aggregating, merging or performing other expensive operations in your data flow when you could do it in the data source as part of your select statement? It is almost always significantly faster to perform these operations in the data source, if your data source is a relational database. And if you are pulling data from sources like flat files which do not provide any such capabilities there are still occasions when it is faster to load the data into SQL Server and sort, aggregate and join your data there before pulling it back into SSIS. Please do not think that SSIS data flow doesn't perform well - it has amazing performance when used properly - but also don't think that it is the right tool for every job. Remember - Microsoft, Oracle and the rest of the database vendors have invested millions of man years and billions of dollars[1] in tuning their databases. Why not use that investment when you can?
Don’t use Data Sources: No, I don't mean data source components. I mean the .ds files that you can add to your SSIS projects in Visual Studio in the "Data Sources" node that is there in every SSIS project you create. Remember that Data Sources are not a feature of SSIS - they are a feature of Visual Studio, and this is a significant difference. Instead, use package configurations to store the connection string for the connection managers in your packages. This will be the best road forward for a smooth deployment story, whereas using Data Sources is a dead-end road. To nowhere.
Treat your packages like code: Just as relational databases are mature and well-understood, so is the value of using a repeatable process and tools like source code control and issue tracking software to manage the software development lifecycle for software development projects. All of these tools, processes and lessons apply to SSIS development as well! This may sound like an obvious point, but with DTS it was very difficult to "do things right" and many SSIS developers are bringing with them the bad habits that DTS taught and reinforced (yes, often through pain) over the years. But now we're using Visual Studio for SSIS development and have many of the same capabilities to do things right as we do when working with C# or C++ or Visual Basic. Some of the details may be different, but all of the principles apply.



References:
http://blogs.msdn.com/sqllive/archive/2007/05/21/sql-server-integration-services-ssis-10-quick-best-practices.aspx
http://bi-polar23.blogspot.com/2007/11/ssis-best-practices-part-1.html
http://bi-polar23.blogspot.com/2007/11/ssis-best-practices-part-2.html
http://blogs.conchango.com/jamiethomson/archive/2006/01/05/SSIS_3A00_-Suggested-Best-Practices-and-naming-conventions.aspx


Thanks,
Narasimha

No comments:

Post a Comment