Brainstorming Into the Business Intelligence Clouds

Part II of my Business/Enterprise Intelligence Series, check out part 1.

In the previous part of this ongoing series of getting the business intelligence and analytics pieces of the enterprise intelligence world figure out, I hit on the first step of the entire process. Collection of available data points. Not the desired data points, nice to haves, or might haves, but the available data points.

In this entry I will step past that and get into a little more of the technical depth of how these data points will be connected in the enterprise systems. More to the point, how they are not connected and will need to be in the enterprise.

A Quick Review, The Systems

  • Point Of Sale (POS) w/ 300+ stores
  • Webtrends Analytics tracking the Awe Widgets Inc Website
  • Internal Accounting Software (IAS)
  • In-house Built Customer Lists for Sales w/ Excel & Access

With the four systems we have a huge amount of actionable data points. So how do we connect all of these data points? That is the thorny question that comes up. Again, just like during the discovery of what data is available, we have a simple first step for this process.

Figure out the origin and destination!

That is really it. Of course, as anyone would concede, there is a lot in between. Without knowing where we are starting and where we need to end up though, finding the in between is in vain. Sure, one can make the mistake as many business intelligence projects do, and start building architecture before anyone knows or actually can use the architecture. This is a severe mistake, and I myself have literally seen millions of dollars get wasted only to have to start a project over, right in the middle of the original project. Do NOT become one of those projects, find the origin and destination!

In this case I am again, going to use my creative side and determine the operations of this company. Awe Widgets Incorporated is currently using SQL Server and has in house skills developed among their staff. That provides an easy option of moving towards SQL Server Reporting Services (SSRS) and working with SQL Server Integration Services (SSIS) for the ETL bits. Whatever other platforms are in between will be easily connected with these two tool stacks. So now I know my multiple originations, and my single point destination.

Moving on to the fun bits.

Now we have to figure out how we're going to get from our origination to our destination. This is where the trip becomes interesting. Just to make sure things stay clear, and list out what we have that we are working with.

Originations
Excel & Access (Office 2007) *.mdb, *.xlsx, and *.csv/*.txt data stores
Internal Account Software (IAS) This one is a prospective can of worms.  Proprietary layouts, de-normalized & normalized data, and all sorts of redundant, non-atomic data.  This sounds like an accounting package right?  :p
Webtrends Analytics Data Exchange Web Services (DX) Webtrends web services provide REST style architecture, with the ability for data to be retrieved in XML, JSON, HTML, or other formats (we can add more if need be, just let us know).
Point of Sale System (POS) This system provides two daily exports, one at 6:00am and one at noon for processing.  The export format is *.csv.

Destination
SSIS & SSRS SQL Server Integration Services used to connect SQL Server Reporting Services, with the core underlying data stored in SQL Server.

So now we are starting to get somewhere.  We now where we are, what we have, and where we want to go.  Time to wire some things up, so stay tuned.  That will be in the next entry.  Also, if you are planning on attending Webtrends Engage in New Orleans, but sure to look up the Enterprise Intelligence session that I will be presenting with Heather Crince of Webtrends and Tony G. of Orbitz?

If you missed the previous entry in this series, check out Where is the Other Data Tracking?!  Where are My Acronyms?!?!

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Where Is The Other Data Tracking?! Where Are My Acronyms?!

Where is the business intelligence?

This blog entry may instigate just a bit.  It will also be a little long for a blog entry. You've been warned.  I suppose though, if you know me & the work I do, that is not really something new.  I see something wrong, broken, or otherwise and I am likely to point it out and describe it in detail.

As I roll into 2010 coding, implementing, and rocking with Webtrends, I have noticed something lacking in the analytics industry.  I will add the clause that obviously Webtrends has people thinking about these things and actively working on this topic, but what I want to point out is a general issue.  Where is the other data, where is the existing data?

It seems, even though some company's kind of get to a certain point in connecting data points, not many really do.  The biggest reason is that most companies are just a few steps away from actually being able to do so.  The other even larger reason is, many do not realize what data should or should not be connected.

When someone starts pulling CRM (Customer Relationship Manager/Management), Analytics, POS (Point of Sale), ERP (Enterprise Resource Planning) data, and other sources into a single reporting repository we finally have real business intelligence.  Otherwise so many entities stumble through the land mines of data confusion.  I see this so much it really drives me crazy sometimes.

So how can a company or entity identify and connect these points of data?  It often starts with a ridiculously simple step.  At risk of oversimplifying things, let me just state the first step in getting out of the data confusion land mines is to first figure out your data.  Ask these things:

  • What data does the business have?
  • What data is currently used and available?

Do NOT ask what data you want, do NOT ask what may not be.  What you want to know first, and so many companies make this mistake, is to know what you know.  Do not, at the early stage of business intelligence information gathering start asking too many hypotheticals.  I promise the risk of failure increases exponentially for every hypothetical data point added.

Once you have identified what data is available, start figuring out how the data is related.  Once you understand the data you can then, and only then, make the huge leap to determining what data you want and how to get it to where you want.

Let me draw this out in a real world example.  Beware; I am using my creative mind now!

What we have so far, for Awe Widgets Incorporated, is several data points.

  • Point of Sale/POS Systems in 300+ stores.
  • Web Analytics (by Webtrends of course) tracking all sorts of great data points on the Awe Widgets Incorporated Website.
  • Internal Accounting Software (Almost ERP, not really)
  • In-house Built Customer Lists for Sales.

So there we go, four key pieces of tracking.  So how would they work together?  With a little further analysis (my key creative side now analyzes Awe Widgets Incorporated internal structure) and we find a few connections.

Correlation, POS to Webtrends Analytics

The POS System has a tracking identifier for customers which we can use to sync up with logged in users tracked via Webtrends Analytics.  This data can be used to derive who is and is not in stores purchasing.  In addition trending could follow the user flow to derive some actionable decisions on how to encourage online or store front shopping.  Just these two data points being connected add a lot of value.

Correlation, Internal Account Software ties to POS

Another data point tie in with the aforementioned POS & Webtrends data is the Internal Accounting Software (IAS).  The IAS holds information related to each sale, and other correlated information about how sales are going for the quarter, year, and other performance indicators.

Correlation, In-house Customer Lists for Sales

The sales department, in aggressive technical fashion has built a number of customer lists in Excel & Access.  The Access Application has a partially updated data store with a server based Excel file holding the updated piece of data about each of the sales person's current sales.  I know, I hear it now, every developer that is familiar with this scenario screaming, "OMG, you have your data in Excel AND Access, and it is supposed to have integrity, and be aaaaaaaaaaaaaaaaggggggggggggggggghhhhhhhhhhh noooooo!"  But you know, this @#$% happens.  : )  When things are like this, solutions get creative.

Tying Together the Pieces

Alright, this is when the awesome nerd bits start to happen.  But I have covered enough for this entry.  In the following entries on this topic I will step through this first data finding mission and start discussions on how to connect these sources and get that data mart, warehouse, or other middle tier piece into action.   I will continue on and lead into how the data can finally start telling a real story.  Because in the end, the real story is, somebody needs actionable data to act upon.  Does it really matter where it is?

Check out Part II of this series

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Schedules, Tickets, and Analysis? HighBall Thoughts

OLAP?

I got to thinking about my HighBall Project and thought, wouldn?t it be awesome to use some of my know how to wire up some serious cubes and analyze the results from site usage all the way to scheduling analysis?  It?s an idea, but I?m not really sure how I would do it yet.  I do however have some ideas in regards to scheduling and how a nice cube could be used to analyze effective scheduling usage.  Sounds like something TriMet could even use ? if they don?t already do these types of analysis.  ;)

Here?s my thought.

Take a schedule as the time dimension.  That?s simple enough.  Now take X number of routes as opposing dimensions and use ridership counts, peak load, etc as the fact table sums and such.  The data should align accordingly to the apex of routes and concentration of route needs by riders.  This type of data could be used to find out where ridership peaks and drops and how to fill gaps or even where to increase service.

?not really sure, got a ways to go before I try it, but it is an interesting thought for analysis.  Eventually I?ll give it a shot.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 4/22/2009 at 6:47 AM
Tags: , , ,
Categories: Business Intelligence and Analytics | Highball
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Notice #1 - Web Analytics Wednesday on Wednesday

So, if you live in the Portland area and are interested in web analytics or analytics or just want to meet, discuss over a few beers and chit chat on cool geek topics - swing into Henry's on the 23rd.  We all have a great time and you'll meet some of the top people in the analytics industry!

Check out more info on the event.  Plus, I'll be there!  Big Smile [:D]

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 7/9/2008 at 7:02 AM
Categories: Keeping Up | WebTrends | Web Analytics | Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Forget SQL Management Studio, Stick to BIDS

Ok, so lately there has been a decent bit of frustration from our crew about how to deploy a Analysis Services Project from BIDS to XMLA without going into SQL Management Studio and going through that whole mess.  Well my fellow coworker Xander got a solution figured out for this.  One of the other key side effects is that one doesn't actually need the database/cube itself physically on the box once this is done, but simply the project itself (it helps to have a cube to push to).

First just setup the actual XMLA build process.  To do this click on Tools and then External Tools.

Click on Add and in the Title, Command, Arguments, and Initial Directory enter the following;

  • Title:  Build XMLA
  • Command: Microsoft.AnalysisServices.Deployment.exe
  • Arguments: "-enter project name here-.asDatabase" /d /o:..\..\Project.xmla
  • Initial Directory: $(ProjectDir)\bin
  • Check the "prompt for arguments" option.
  • Check the "output window" option.

When the prompt comes up during execution of the Build XMLA you must remember to enter your project name exactly so that the database name matches.

Select OK and then click on Tools again.  You will now see a Build XMLA option on the Tools menu.

Now open a new project.  Since it is the standard template example, I'm going to open up AdventureWorks

Make sure just before executing this that you have done a clean build of the project.  Without a build the necessary *.aspDatabase file won't be available.  Once you have opened and built the solution then click on the Tools -> Build XMLA option.  The argument prompt will appear.  Enter the project name as shown in the image.  Now you should see the output window display the status.

After execution navigate to the project root and you will find the "Project.xmla" file.  Now you're good to go without fighting with SQL Management Studio all the time, nor needing a physical installation of your database/cube.

del.icio.us Tags: ,,,

Technorati Tags: ,,,
Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 3/20/2008 at 3:15 PM
Categories: WebTrends | Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (2) | Post RSSRSS comment feed

SQL Server Analysis Services Authentication Freak Out

[Rant On]

What is the deal really with the anti-user intuitive situation with analysis services when it decides that the data connection to a data source needs to use some arbitrary user authentication!  I had my SSAS cube set to process against a nice normal authenticated account and somehow, someway it went and stuck some completely unrelated user in the account login for the data source.  Now how is that supposed to help anything?

[Rant Off]

Needless to say, I'm looking forward to SQL Server 2008.  I hope it truly is cleaned up a good bit and as streamlined for cubes as I've been hearing.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 2/25/2008 at 10:45 AM
Categories: Rants | Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Pre-existing Cube into Visual Studio Project?

I got to wondering, and realized it is not straight forward or even possible.  How does one go about bringing an existing SQL Server Analysis Services Cube into a new Visual Studio Solution or Project?

...this one I'll have to dig on, so far, nothing.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 2/5/2008 at 9:18 AM
Categories: Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (1) | Post RSSRSS comment feed

Common Sense Cardinality

I've been hearing a lot of discussion about cardinality this and cardinality that. With the topic becoming such a common point of discussion. I decided I would write up something.

First some definitions so that we all have a clear cut idea of what cardinality actually is.  Sometimes cardinality isn't fully understood at a basic level.  With some of the points I'm going to touch on a full understanding of the basis of cardinality will be needed.

Cardinality : 1. Cardinality is part of formal set theory.  A cardinal number is a type of number defined in such a way that any method of counting sets using it gives the same result.  2. Cardinality is a notion of the size of a set which does not rely on numbers.  It is a relative notion.  For instance, two sets may each have an infinite number of elements, but one may have a greater cardinality.

The second definition is the one that I will discuss further.  I must note, that it is odd that even before anyone is confused, the definitions are in opposition with each other.  Cardinality does or does not include numbers, both specifically defined by use.  For a little more information, check my previous entry Tip o' The Day.

Now on to some meat of this topic.  When designing cubes and planning the various dimensions of data, cardinality is a key concept that must be taken into account when doing so.  High cardinality high count items make bad dimensions, high cardinality low count and low cardinality low count items make great dimensions. Now before digressing and grumbling about how this might be a stupid statement, hear me out further.

High Count, High Cardinality Data Sets

If you have a large count data set that has no limits, and a high cardinality this is either a questionable data set to turn into a dimension.  These types of data sets; e-mail addresses, user names, or other high cardinality items are not good data sets to use as dimensions.  Sometimes there may be exceptions, but rarely would that be the case.  Generally when there is high cardinality, counts, or other measures and facts are the candidates for these data sets.

High Count, Low Cardinality Data Sets

These data sets usually make a decent dimension because of the ability to pare them down to a set numbers of unique values.  Keep in mind, that whatever a dimension is slicing into, should be visibly readable.  An example would be business departments, or business zones.  Each business department might have thousands or even millions of data points, but when a distinct value is derived from the set of data the cardinality is low enough that one ends up with a low number of actual unique items, making slicing much easier for the people viewing the actual reports.

Mix and Match of High Count and Cardinality Degrees

In both cases above, high count, low or high cardinality, often data sets can be stuck into either category.  Take the birthday of a user database for example.  There are 365 possible birthdays per year, not a good way to slice data.  But if you break it down to just the month or year, you end up with 12 months or x number of years.  This is a perfect example of something to use for a dimension.

On the same note one might have user data, but then have the user data of a particular department noted.  If the user count and user information could be derived and rolled up via the department the break out of user departments into a dimension makes sense.

There are dozens of other ways to look at data.  One of the interesting ideas I heard recently of a high count, high cardinality data set was e-mails.  The e-mails for a particular set where being tracked.  A user wanted to know where the domain of the e-mails where originating from, which individually wasn't something you'd want a Cube Processing to have to go through.  So instead of trying to derive the origination of each e-mail we had in the data set, we pared the data down to purely just the domain, removing the actual user name part of the e-mail.  From there we where able to pare down the e-mail domain originations into a clear and discernable dimension.

So when it comes to cardinality there can always be more than meets the eye.  Take a second look at high count data sets to make sure they're really high count, sometimes they can be pared down to reasonable amounts of data.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 1/23/2008 at 10:04 AM
Categories: Discussion Points or Ideas | Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Unit Tests Don't Work on Cubes

I've been building web sites for a while now.  I've been writing code, object oriented style for the better part of a decade.  Unit testing I've been doing for all of about 1-2 years to various degrees.  Nothing super advanced yet, and slowly I'm getting some of the more advanced unit testing methods figured out.  Not so much advanced, but just newer methods for testing in different situations.

So now I've figured out web service unit tests, database unit tests, and all sorts of other ways to unit test things.  So my question is, how does one unit test a cube?

An SSAS (SQL Server Analysis Services) Cube can be built and then exported out into XMLA for running against another server to create the cube.  Also it is good for a kind of backup mechanism.  Cubes also can have MDX segments, which might be easier to test.  But amid all of these pieces, and all of the "code" written for cubes, what is a good way to test a cube?  How does one break the parts of a cube into units to be able to test in the first place?

Off hand I'm just going to trace over some ideas I stumbled into while riding the bus out to Scott Hanselman's talk on ASP.NET MVC.

The First Unit Testing Idea for Cubes

I could develop a custom framework using ADOMD.NET that would literally test points of the cubes, dimensions, and such things like that.  Adding the ability to just insert attributes onto classes that are utilizing the cube or some other such association.  Building something like that would work well if one was writing code directly to the cube.  However...

The Second Idea is an MDX Test Framework

This idea sounds much better, but I actually don't have the first idea on how I would prospectively implement it.  Anyone out there ever try to implement such a thing with MDX?

So that is my first two ideas...  and I've arrived at my destination, thus am done thinking of idea for this entry.

Does anyone else have any ideas or thoughts?

kick it on DotNetKicks.com

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 1/8/2008 at 5:56 PM
Categories: Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (3) | Post RSSRSS comment feed

Index, Fragmentation, Statistics - A List

This entry is going to be a collected list of practices for index, fragmentation, statistics, and general VLDB table and view maintenance.

First off, a few obvious index practices.

  1. If a non-clustered index is not used, toss it.  When a table has a record inserted, the duplicate of that is inserted into the non-clustered index.  Each time that record is deleted the same must occur in the index and the same goes for updates.  Most horrendously, if a clustered index is created then the non-clustered index has to be rebuilt.  Major IO hog, if it isn't used, no point in having it.  For information on how to check if a non-clustered index is being used, check out this blog entry on the SQL Server Storage Engine.  At some point I might shrink that into a short entry and post it as a tip o' the day.  I've had to reference it a number of times and would rather just the quick bullet points.
  2. When checking VLDBs, and especially VVVLDBs, for consistency in sizes exceeding hundreds of gigabytes or even multiple terabytes there are a few options that one can follow:
    1. Use the WITH PHYSICAL_ONLY option for CHECKDB.  It will run similar to DBCC CHECKALLOC and read and audit every allocated page in the database.  It will skip logical errors, inter-page checks, and some other things like the DBCC CHECKCATALOG.
    2. Break up the checks by partition.  The DBCC CHECKFILEGROUP run on a read-only filegroup every week or two works great.  On a read-write filegroup DBCC CHECKFILEGROUP should be run every day or every few days.
    3. Break up the checks into smaller groups by doing DBCC CHECKCATALOG or DBCCCHECKTABLE.  Break them into different groups and then segment the job execution of these checks to different maintenance windows.
  3. Logical fragmentation only affects read-ahead performance, only a rebuild/defrag of indexes will help.
  4. Low page density affects UI throughput and memory usage, which could be a sign of page-splits.  Check this frequently.
  5. If defrag is done instead of rebuild, make sure to update stats.
  6. Be cautious when doing large index maintenance jobs if log shipping or DBM is used, index rebuilds are always full-logged when DBM is present.  (DBM = Database Mirroring)

A few items mentioned by Peter Sampson - Technical Director at Centerstance;  Blogs: Moving to Mac and SQL Janitor, over a recent pint o' Beer at Bailey's Taproom.

  1. Regardless of OLTP or OLAP almost any table in SQL Server should...
    1. always have a clustered index on the table.
    2. have a clustered index design that either very rarely or never has the clustered index updated.  SQL Server does a DELETE and an UPDATE if you update a column with a clustered index!
    3. use non-clustered indexes in conjunction with the clustered index since they are interleaved.
  2. DBCC DBREINDEX rebuilds the table and indexes depending on the parameters passed.  This is best done on a regular schedule, often by setting a job to execute the task.
  3. FILLFACTOR is a critical parameter for DBREINDEX, make sure it isn't just set to 100 or some other erroneous value that will cause problems.  FILLFACTOR sets the free space to leave during the rebuild of the Index B-tree structures and is a reverse representation.  100 == no free space, 90 = 10% free, and so on.  For read-only tables do a FILLFACTOR = 95 or perhaps 90.  For other stuff FILLFACTOR = 85 is a good starting point since it gives you some head room for INSERTs and UPDATEs after you run DBREINDEX.
  4. DBCC UPDATE STATS collects new statistics but does not remove fragmentation.  It takes less time than DBREINDEX and can be useful if you've done a lot of transactions and don't quite have time for DBREINDEX.
  5. Peter:  "Auto-update stats I no longer trust to provide accurate statistics over an extended period of time.  Got burned up with this, leave auto-update stats enabled but supplement with regular DBREINDEX or UPDATE STATS as needed."

Someone asked me during a discussion, and it is often forgotten what VLDB stands for.  To alleviate confusion, VLDB stands for very large database.  When it has extra Vs it means it is REALLY frikkin huge!

A last few links for reference material:

Best practice when optimizing indexes on SQL Server 2005

Best SQL Server indexing strategies

TOP 10 SQL Server Indexing Tips to Improve Performance

Hope that's useful.

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Posted by: Adron
Posted on: 12/26/2007 at 11:54 AM
Categories: How-To, Samples, and Such | Business Intelligence and Analytics
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed