Occasionally Asked Questions
What is raw data and what does data cleaning mean?
Raw data, such as that straight from a sensor or from human input is usually dirty. It can take many forms, often making it a challenge to clean and process. Some common examples of dirty or unclean data are:
- Duplicate Records: This happens when the same data is entered more than once in the dataset. It can lead to inaccurate results and skew data analysis.
- Incorrect or Inaccurate Data: This could be due to typographical errors, inaccurate entry, outdated information, or even fraud. An example could be a wrong address, incorrect sales figure, or an outdated email address.
- Missing Data: In some cases, information might be missing from certain fields. For example, a dataset with customer information might have missing values for fields like 'Phone Number' or 'Email Address'.
- Inconsistent Data: Data inconsistency can occur when different formats or standards are used in the data set. For example, dates might be entered as 'dd/mm/yyyy' in some places and 'mm-dd-yy' in others.
- Outliers: Outliers are values that differ significantly from other observations in the data set. These can be due to entry errors or may represent valid but extreme observations.
- Irrelevant Data: Sometimes, data collected might not be relevant for the analysis at hand. Such data should be identified and removed.
- Misformatted Data: This includes data that is not in the required or standard format. An example might be a phone number that is entered without the area code.
- Mixed Types of Data: This is common in fields that are supposed to contain one type of data (e.g., numbers), but contain another (e.g., text). For example, a field like 'Age' containing the entry 'twenty' instead of '20'.
So then what is data validation and processing?
Data validation is the process of checking and ensuring that the data collected or being used is clean, accurate, and reliable. It is a crucial step in data processing and can significantly impact the results of any data analysis or data-driven decisions. Incorrect or dirty data can lead to erroneous conclusions and decisions, making validation a critical process. Data validation can include a range of techniques, such as:
- Range Checks: Ensuring the data falls within an expected range of values. For example, age cannot be negative or more than the oldest known human age.
- Consistency Checks: Checking that data is consistent within the dataset. This could involve checking that codes, abbreviations, and classifications are used consistently.
- Uniqueness Checks: Verifying that all data entries in a particular column are unique where applicable. For example, in a database with user IDs, each ID should be unique.
- Existence Checks: Verifying that essential data is present and not left blank
- Format Checks: Checking that the data is in the correct format. For example, date fields are in the date format, and phone numbers have the right number of digits.
- Cross-Checks or Referential Checks: Checking the data against existing data or sets of rules to ensure it is valid.
Data processing refers to the conversion of raw data into a meaningful format that can be used to perform various functions or support decision-making within an organisation. It involves a series of operations used to collect, transform, and interpret data. Once the data has been cleaned and validated, processing can take place. Depending on the specific requirements, this could involve sorting data, calculations, translations, summaries, analyses, and other operations.
Lintol offers all users the capability to perform cleaning, validating and processing actions in a user-friendly, accurate, consistent and reliable manner.
Does Lintol work with real time streaming?
Lintol primarily favours batch uploads i.e. if you have a dataset already and want to clean, validate or process it then Lintol is ready for you. However, we have recently been working on real time streaming and are currently able to perform actions on high latency data that arrive in batches; for example the government covid data updates were being used as a testbed for this feature. Soon we will move onto low latency data where live streaming of high frequency data such as machine data in the manufacturing industry, can be performed.
How much does Lintol cost?
Every project is different, requiring different data strategies dependent on client requests so contact us for a quote. We are working on releasing Lintol as a subscription-based platform with processors for sale on a marketplace, so keep an eye out!
Who were your previous clients?
We have worked with public sector bodies such as the Department for Infrastructure, Department of Finance, Belfast City Council and the Open Data Institute
Do you offer any other services?
Yes! We offer consulting services, particularly for companies wishing to know more about and embark upon their digital transformation journey. This can be in any sector but we have had most experience with the manufacturing sector. Our CEO does most of this consulting so check out her LinkedIn profile for more information on her experience.