Data Management & Warehousing
  • Welcome
    • Contact Us
    • QR Codes
    • Privacy & Cookie Policy
    • Internal
    • Websites
  • White Papers
  • Presentations
  • Blog
  • Our Services
    • Rates
Home » 2012 » September

Monthly Archives: September 2012

Addressing Business Intelligence Data Quality

Posted on 24 September 2012 by David M. Walker Posted in Editorial Leave a comment

A friend of mine posted a link to Michael W. Dobson’s TeleMapics Blog entitled Google Maps announces a 400 year advantage over Apple Maps about Apple’s problems with the release of their own mapping solution with iOS6. These problems are well documented elsewhere but Michael’s post highlights problems not just associated with mapping data but common across many, if not all, business intelligence solutions. I have been lucky enough to work on projects that include telematics data and spatial analysis. One project in particular is riddled with data quality problems but the machine generated telematics data is the least of their problems – it is the data quality in the main operational system and people’s understanding of it that causes the issues.

Michael identifies five issues that exist with Apple’s current offering from a data quality perspective but the description can simply be mapped to any business intelligence article.

  • Completeness – Data that is absent and some data that is included but seems to have erroneous attributes and relationships. There can be described as omissions in the data but there will also be errors of commission where the same data is represented more than once (usually due to duplication by multiple data sources).
  • Logical Consistency – the degree of adherence to logical rules of data structure, attribution and relationships. There are a number of sins included here, but the ones that appear to be most vexing to Apple are compliance to the rules of conceptual schema and the correctness of the topological characteristics of a data set. An example of this could be having a store’s name, street number and street name correct, but mapping it in the wrong place (town).
  • Positional Accuracy – is the data in the right place? This is most commonly expressed by the way that data is associated with artificial hierarchies added to the data warehouse. It may appear to be a small error in terms of how much data is in the wrong place and the overall total still adds up. This may not seem like a large problem to developers but the perception of the data by people who use the lower levels of the hierarchy for their reporting will immediately and irreversibly be tainted by the perception of inaccuracy.
  • Temporal Accuracy – particularly in respect to temporal validity – is the data still meaningful at this point in time? Combining data from different temporal contexts without  understanding the consequences can lead to a distortion of the reported results.
  • Thematic Accuracy – is it sensible to put certain sets of data together? The ability to join data in large data warehouses does not mean that different pieces of data should be joined? How do the users understand what is valid or invalid when it comes to using the data?

Recommendations

1. Step back and re-engineer your approach to data quality as a core business intelligence competency.
2. Understand that you have a problem and that you are unlikely to have the experience or know-how to fix it.
3. Employ experts – consultants in the short-term but permanent staff tasked with maintaining the quality in the long-term.
4. Ensure that you have the experience in management to control the outcome of their development efforts. You need to hire someone who knows data quality, management and how to build winning teams.
5. Become active in crowdsourcing. Find a way to harness local business knowledge and invite your users to supply local information, or at least lead you to the local knowledge that is relevant.
6. Don’t hire a high power consulting group to solve this problem. This would be the biggest mistake you make, but it is one that big business seems to make over and over. As an alternative, I suggest that you look to people who actually know something about these applications.

When consulting we often urge people to put data quality at the heart of the process and to instigate Continuous Data Quality Processes. Our white paper How Data Works can provide insight into the types and causes of data quality issues. Other information can be found using the Data Quality tag on this site.

I hope that Michael W. Dobson will forgive my plagiarism of his concepts but he has neatly addressed some of the most common and pressing issues in business intelligence data quality.

Business Intelligence Data Quality Project Management

Building An Analytical Platform

Posted on 23 September 2012 by David M. Walker Posted in Articles Leave a comment

Reprint of an article published by SAP about a recent piece of work

I have recently been asked to build an analytical platform for a project – but what is an analytical platform? The client (a retailer) described it as a database where they could store data, a front end where they could do statistical work ranging from simple means, standard deviations, etc. through to more complex predictive analytics which could be used to (for example) analyse past performance of a customer to assess how likely it is that the customer will exhibit a future behaviour, or using models to classify customers into groups and ultimately to bring these two together into an area known as decision models. The customer had also come up with an innovative way to resource the statistical skills needed by offering work placements to Masters students studying statistics at the local university and getting them to work with the customer insight team to describe and develop the advanced models; all they needed was a platform to work with.

From a systems architecture and development perspective we could describe the requirements in three relatively simple statements

  1. Build a database with a (very) simple data model that could be easily loaded, was capable of supporting high performance queries and did not consume a massive amount of disk space. It would also ideally be capable of being placed in the cloud.
  2. Create a web-based interface that would allow users to securely log on, write statistical programs that could use the database as a source of data and output reports and graphics and well as populating other tables (e.g. target lists) as a result of statistical models.
  3. Provide a way in which to automate the running of the statistical data models, once developed, such that they can be run without engaging the statistical development resources.

Of course time was of the essence and costs had to be as low as possible too – but we’ve come to expect that with every project!

Step 1: The database

Our chosen solution for the database was a Sybase IQ database, a technology our client was already familiar with. Sybase IQ is a ‘column store’ database. This means that instead of storing all the data in its rows as many other databases do the data is organised on disk by the columns. For example if a column contains a field for country it will have the text of each country (for example) ‘United Kingdom’ stored many times. In a column store database the text is only stored once and given a unique ID. This is repeated for each column and therefore the ‘row’ of data consists of a list of IDs linked to the data held for each column.

This approach is very efficient for reporting and analytical databases. Firstly by replacing text strings with an identifier significantly less space is used. In our example ‘United Kingdom’ would occupy 14 bytes, whist the ID might only occupy 1 byte – consequently reducing the storage for that one value in that one column by a ratio of 14:1 – and this is compression effect is repeated for all the data. Furthermore because there is less data on the disk the time taken to read the data from disk and process it for queries is significantly reduced which consequently massively speeds up the queries too. Finally each column is already indexed which again helps the overall query speed.

Another incidental but equally useful consequence of using a column store database such as Sybase IQ is that there is no advantage in creating a star schema as a data model – instead holding all the data in one large ‘wide’ table is at least as efficient. This is because by storing each column with a key it means that the underlying storage of data is a star schema. Creating a star schema in a column store database rather than a large single table would mean incurring unnecessary additional join and processing overhead.

As a result of choosing Sybase IQ as a column store database we are able to have a data model that consists of a number of simple single table data sets (one table for each different type of data to be analysed) that is both quick and easy to load and to query.

It should be noted that this type of database solution is less efficient for OLTP (Online Transaction Processing) type applications because of the cost of doing small inserts and updates. However this is not relevant for this particular use case.

The solution can only be deployed on a Linux platform. We use Linux for three reasons; firstly RStudio Server Edition is not yet available for Windows, secondly the availability of pre-compiled packages for all elements of the solution on Linux which reduces the install effort and finally hosted Linux environments are normally cheaper than Windows environments due to the cost of the operating system licence.  We chose CentOS because it is a RedHat derivative that is free.

One additional advantage of this solution for some organisations is the ability to deploy it in the cloud. Since the solution only requires files to be remotely delivered and all querying is done via a web interface it is possible to use any co-location or cloud based hosting provider. Co-Location or cloud deployment offers a low start-up cost, reduced systems management overhead and easy access for both data delivery and data access. The system simply requires SSH access for management, FTP, SFTP or SCP for file delivery and the RStudio Web Service port open. RStudio Server uses the server login accounts for security but can also be tied to existing LDAP infrastructure.

Step 2: Statistical Tools and Web Interface

There are a number of statistical tools in the market, however most are very expensive, prohibitively so in this case and in addition the associated skills are hard to come by and expensive. However since 1993 an open source programming language called R  for statistical computing and graphics has been under development. It is now widely used among statisticians for developing statistical software and data analysis and used by many universities and is predicted to be the most widely used statistical package by 2015. The R Project provides a command line and graphical interface as well as a large open source library of useful routines and is available as packaged software for most platforms including Linux.

In addition there is a second open source project called RStudio which provides a single integrated development environment for R and can be deployed on a local machine or as a web based service using the servers security model. In this case we have implemented the server edition in order to make the entire environment web based.

Therefore in two simple steps (download and install R, followed by download and install RStudio) we can install a full web-based statistical environment. Note that some configuring and pre-requisite packages may be required depending on your environment however these are well documented on the source websites and in general automatically download if using tools such as ‘yum’

The next step was to get access to the data held in our Sybase IQ. This proved to also be very easy. There is a Sybase white paper that describes the process that can be simply stated as:

  • Install the R JDBC package
  • Setup a JDBC connection
  • Establish your connection
  • Query the table

We now have an R object that contains data sourced from Sybase IQ that we can work with! And what is amazing is that I have built the platform from scratch in less than half a day’s work.

At this point data has to be loaded and the statisticians can get to work – obviously this is more time consuming that the build and over the days and weeks the analysts created their models and produced the results.

For this exercise we used our in-house ETL tool to create a repeatable data extraction and load process, however it would have been possible to use any of a wide range of tools that are available for this process.

Step 3:  Automatically running the statistical models

Eventually a number of models for analysing the data had been created and we were ready to move into a production environment. We automated the load of the data into the agreed single table structure and wanted to also run the data models.

Sybase IQ has the ability to create User Defined Functions (UDF). These C++ programs talk to a process known as Rserve that in turn executes the R program and returns the results back to Sybase IQ. This allows R functions to be embedded directly into Sybase IQ SQL commands. Whilst requiring a little more programming experience to set up it does mean that all processing can be done within Sybase IQ.

Conversely it is possible to run R from the command line and call the program that in turn uses the RJDBC connection to read and write data to the database.

Having a choice of methods is very helpful as it means that it can be integrated with the ETL environment in the most appropriate way. If the ETL tool requires SQL only then the UDF route is the most attractive, however if the ETL tool supports host callouts (as ours does) then running R programmes from a command line callout is quicker than developing the UDF.

Building An Analytical Platform - Architecture

Building An Analytical Platform – Architecture

Conclusions

Business Intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data.

Traditionally building an analytical platform has required an expensive infrastructure and a considerable amount of time to setup and deploy.

By combining the high performance, low footprint of Sybase IQ with the open source R & RStudio statistical packages it is possible to quickly and easily deploy an analytical platform in the cloud for which there are readily available skills.

This infrastructure can be used both for rapid prototyping on analytical models and in running completed models on new data sets to deliver greater insight into the data

Analytics Business Intelligence R RStudio Sybase Technical Architecture

Inspirational Data Visualization

Posted on 14 September 2012 by David M. Walker Posted in Editorial 3 Comments

Data visualization has come of age – there is now no excuse for not only having good information but also for presenting it in an engaging and informative way. Here we look at a number of sites that can inspire and inform the way you present your information.

Nathan Yau has a blog called Flowing Data that gathers together some of the best data visualizations on the web. He has also produced a book called Visualize This.

David MacCandless is another blogger at Information Is Beautiful who creates his own infographics and reports on others. He has also published a book. In the US, the book is called ‘The Visual Miscellaneum: A Colorful Guide To The World’s Most Consequential Trivia” whilst in the UK, the book is called “Information Is Beautiful”.

Stephen Few is a consultant and eductator. His Perceptual Edge edge website contains a wealth of information, white papers, blog and he is the author of three books: “Show Me the Numbers: Designing Tables and Graphs to Enlighten”,  “Information Dashboard Design: The Effective Visual Communication of Data” and “Now You See It: Simple Visualization Techniques for Quantitative Analysis”.

Alberto Cairo teaches Information Graphics and Visualization at the School of Communication at the University of Miami. His blog The Functional Art and new book of the same name  and is described by Nathan Yau as “a healthy mix of theory and how it applies in practice, and much of it comes from Cairo’s own experiences designing graphics for major news publications.”

Hans Rosling is an inspirational speaker on both his topic (Global Health) and the subject of data visualisation. He has given a number of talks for TED: Ideas Worth Spreading including this talk about health and poverty and developed a suite of tools to help people understand good visualization of statistics at GapMinder

Hans Rosling bridges the gap between data visualization and information delivery. We conclude with two more examples of information delivery where data visualization is embedded in a wider context to deliver a message.

And once we have good data visualisation we can move on to great information delivery by combining inspirational data visualisation in the narative provided to the user.

To this end the London based RSA also has one of the most successful YouTube channels called RSAnimate. These are animations of current topics that are certainly beyond my skill level to produce but show just how engaging information can be

PhDComics also used the style of RSAnimate to explain the Higgs Boson particle

The Higgs Boson Explained from PHD Comics on Vimeo.

Oh and one last thought – choose the right font for your work – Typography for Lawyers has lots of good advice on this subject.

Data Visualization Graphics Reporting

Archives

Random Quote

I wish to God these calculations had been executed by steam.

— Charles Babbage (1792-1871)

Contact Details

Tel (UK): 01458 850 433
Skype : datamgmt

Registered Office

Manchester House, High Street
Stalbridge, Sturminster Newton
Dorset, DT10 2LL
United Kingdom

Company Registration

Registered in England and Wales
Registration Number: 3526504
VAT Registration: GB724448236
Directors: DM & HJ Walker

© 1995-2013 Data Management & Warehousing
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.Accept Read More
Privacy & Cookies Policy