Data Types
In the broadest sense, data is any input that can be structured to provide information and drive action. However, “anything” is too broad for practical work. To navigate data science effectively, we categorize data based on its format and how it can be analyzed.
1. Unstructured Data
Unstructured data is increasingly vital in modern science. While we often think of it as simple “string” or “character” fields—such as a “Name” column or an “Other, please describe” box—it encompasses much more.
Modern Machine Learning allows us to extract value from complex unstructured sources, including:
Visuals: X-rays and fundoscopy images.
Audio: Cough recordings or voice patterns.
Sensors: Real-time data from wearables.
While it may seem like magic, it follows a systematic method of converting these “raw” formats into structured data that models can interpret.
2. Structured Data
Structured data is a more formal entity, typically stored in a “rectangular” format with variables in columns and cases (or observations) in rows. This data is generally split into two types:
Categorical
Nominal/ Categorical : Data that presents as a category with no inherent rank (e.g., Hair Color or “Diseased/Non-Diseased”). These are typically expressed as counts and percentages.
Ordinal: A special type of categorical data where there is a logical order (e.g., Mild, Moderate, Severe). When visualizing this data, maintaining that specific sequence is crucial.
Numeric (Quantitative)
Continuous: Variables that can take on decimal values (e.g., weight or blood glucose). These are typically represented by a mean or median.
Discrete: Variables that cannot take on decimal places (e.g., Number of Children). Since you cannot have 1.5 children, it only makes logical sense to treat these as whole integers.
Software-Specific Nomenclature
While the principles remain constant, the “names” change depending on the tool you use.
| Concept | Idea | R Term | Other Terms |
| Free Text | Any Text that is typed out in an unstructured way | Character | String |
| Categorical | Data that can only take a specific set of values (Structured) | Character | String |
| Ordinal | Categorical Data that can take a predefined order - example Mild, Moderate, Severe | Factor | |
| Continuous | Numeric Data that can take decimal places | Numeric | Double, Float |
| Discrete | Numeric Data that cannot take decimal places | Integer | Int |
In database systems like SQL, terminology becomes even more intricate, with specific types based on precision (total digits) and scale (digits after the decimal). Databases use even more precise datatypes like varchar(50), float(50) etc.
The bottom line is that it makes sense to understand how your software processes data and then select the correct data type. This is helpful while cleaning, visualizing or modelling your data and saves you a lot of frustration later in the process.