In a recent webinar, we explored good data management practices, and the challenges associated with maintaining reliability, accessibility, and the overall quality of captured data once the data strategy is in place. In this article, Ian Pay, Head of Data Analytics and Tech at ICAEW, returns to address some of the unanswered questions from the webinar.
Data quality culture
How do you persuade leadership and all departments to take data quality seriously and recognise the investment that’s needed, rather than leaving it to IT to sort out?
Creating a culture of data quality in an organisation can take time and requires buy-in from all levels and all departments. The first step is to get different teams talking to each other to realise how much they depend on data from different parts of the organisation. Often teams will prioritise the quality of the data that they need for their own activities, without an awareness of the up or downstream impact of data that they may place a lower priority on. Ultimately, the priority of particular categories of data should be based on a combination of organisational strategy, and organisational risk.
Persuading senior managers and leadership of the importance of good data can be as simple as asking them if they have the information they need to be able to make effective decisions, and whether they trust the reliability and accuracy of that information. If the answer is 'no', as it invariably is, then you have implicit buy-in to work with them to solve that problem. Working closely with IT to help with the accessibility of data (not the management of the data itself) can also shine a light on the division of responsibility.
The expectation gap when it comes to data analysis is a very real challenge for many organisations, and not an easy one to resolve. Part of this can be due to the 'black box’ approach of some analytics functions - a request goes in, and a few days later the analysis pops out! So, bring those raising the requests along on the journey, be transparent about the steps required, make it a conversation rather than a transactional service. Again, if senior management can be encouraged to place value on the importance of good data quality, and can be made to realise where time (and therefore money) is being lost in trying to address poor data quality on an ad hoc basis, then they may be more willing to invest in the transformation that's required to achieve it.
What’s the best structure – a single, central function that sources data and distributes reports, or a decentralised ‘self-service’ approach?There is no right or wrong answer on whether data solutions should be centralised or self-service, as it depends on the nature and the needs of the business.
Where you currently have a central team handling data requests, there is definite value in spending time analysing those requests - if there are common themes, then this can form the beginnings of a business case for a targeted solution. If lots of teams require similar data, then it makes sense to invest centrally in standardised reporting on that data rather than allow individuals to develop their own self-service reports which may lack consistency, creating a 'single version of truth' challenge. Conversely, empowering individuals to 'answer their own questions' can lead to significant organisational efficiencies, as well as a cultural shift in data literacy.
The timing of any centralised reporting is also key as some teams may need 'real-time' information while others need reports that are accurate as at specific points in time (e.g. for financial reporting) - as always, it comes back to understanding the business requirements and the problem that is being solved.
Ownership of data always comes into play at this point - it is rarely as simple as saying that data is owned by 'the business'. Ultimately, the best approach is usually for those with the best understanding of the data to be responsible for it - in other words, each functional area should own its own data - with a single oversight function to agree common ways of working and standards, to provide a steer on organisational priorities, and to co-ordinate resources, capabilities and training.
So, bring those raising the requests along on the journey, be transparent about the steps required, make it a conversation rather than a transactional service."
Data governance
It would be fair to say that data governance frameworks are a mixed bag. Unfortunately, many organisations don't have such frameworks in place, and many of those that do haven't updated or maintained them in several years - in the world of data that makes them positively archaic!
A good data governance framework should consider more than just security and compliance. It should define roles and responsibilities towards data across the organisation, it should govern collection, usage, accessibility, retention, and deletion, it should define common approaches to data structures, quality, and integration, and also how all of these features are monitored.
Organisations that do this successfully tend to have one major advantage, and one major cultural difference. In both cases they are usually startups - the advantage is that they do not have any of the baggage of legacy tools and systems, and the cultural difference is one of agility and strategy. They have really thought about how to use their data effectively, to a strategic advantage, and focused less on the rules and regulations, and more on building a culture of the innovative (but responsible) use of data. And that tone often comes from the top.
How do we strike a balance between too much or too little data? And how do we know when data is no longer required?
Many of us have probably seen those reports that get sent around every week, every month, and thought "is anyone actually reading this?".
So, as always, the question starts from the position of the business requirements, which are never static. Having it baked into the processes for producing reports, that those reports are reviewed at least a few times a year, that stakeholders are bought into that review process to not just say "yes" or "no" to the report as a whole, but actually go through the report metric by metric and consider what they need. This review can and should be tied into organisational strategy and operational planning. And if reports are hosted, for instance on the Power BI platform, it is possible to get quite detailed usage statistics which can feed into the conversation. Once the reporting requirements are defined, then the question of data collection should fall into place.
We all have the family member who is a hoarder - so, in the world of data we all know the person in the organisation who's been there for years and still has every single file they've ever needed. Except, if we're being totally honest, that's literally all of us when it comes to data. Striking the balance between over and under-collection of data is essentially impossible, but the best way to think of it is to constantly strive for better. On every project, ask if we got the data we needed first time, if it was too much or too little, and take it from there.
Also, it is fair to say that most people don't like filing, so you sort of have to take that responsibility away from them and automate it as far as possible. When data comes into an organisation, it should be tagged, it should have a life expectancy, and we shouldn't be afraid to archive (not delete, or at least not immediately!) data that has reached the end of its shelf life. The advantage of archiving is that it gives a window in which you can essentially 'test' whether the data is still needed or not, it doesn't usually count towards cloud storage allowances, and generally if no action is taken it will then be deleted after a set period. With the AI-driven world we are now in where data is so critical to training models, I do understand the fear of deleting something in case it 'might' be useful down the line, but we have to just balance that need with the regulatory need, and also the cost of storage to the organisation. And maybe it isn't as black and white as deleting, maybe it’s more about minimising, retaining an aggregated or scaled down sample of the information, than the full raw data.
It is fair to say that most people don't like filing, so you sort of have to take that responsibility away from them and automate it as far as possible."
Data quality processes
Broadly the best process for moving data from source to downstream systems is an automated one - an API (Application Programming Interface), or EDI (Electronic Data Interchange). Many web-based applications support APIs to send and receive data in a variety of formats, while often larger ERP systems support EDI standards for the exchange of common financial and non-financial data in a consistent format. Certainly, systems that support EDI can exchange data efficiently and reliably, as there are agreed formats and protocols. But a well-documented and stable API can be just as effective. With tools like Microsoft Power Automate, Databricks, Fivetran and many more, it is easier than ever to build pipelines that connect data to and from different systems.
Systems that are designed to handle EDIs and/or APIs will typically have built-in checks and logging on all incoming and outgoing data, to ensure it adheres to the expected formats and meets basic quality standards. Data that is received that fails these checks will typically be quarantined in some way, to enable human review and issue resolution. AI is increasingly playing a role in this, with anomaly detection capabilities appearing in a range of accounting software solutions.
When automated data transfer is not possible, best practice would be to implement similar manual processes - before a file is sent and when a file is received, it should be reviewed and checked against a set of data quality standards to verify its completeness, accuracy, validity and other key criteria relevant to the data's purpose. To support this, it is generally recommended that alongside the file itself, some sort of checksum or record count is provided to ensure, at a basic level, that the file has been sent and received successfully.
How do you document transformations that are made to correct errors in datasets?
Documenting transformations is really important, regardless of the purpose of the transformation. Fortunately, most tools designed to process data at scale have some sort of audit trail functionality, which means the tool itself makes up 90% of the documentation. The bit that will be missing from that is the 'why' - the reason for the transformation. Never assume this will be obvious to others! So, some accompanying narrative (in an appendix or on a 'supplementary information' page in a Power BI report) to outline any assumptions or manipulations can often be valuable.