Data Types

In the broadest sense, data is any input that can be structured to provide information and drive action. However, “anything” is too broad for practical work. To navigate data science effectively, we categorize data based on its format and how it can be analyzed.

1. Unstructured Data

Unstructured data is increasingly vital in modern science. While we often think of it as simple “string” or “character” fields—such as a “Name” column or an “Other, please describe” box—it encompasses much more.

Modern Machine Learning allows us to extract value from complex unstructured sources, including:

While it may seem like magic, it follows a systematic method of converting these “raw” formats into structured data that models can interpret.

2. Structured Data

Structured data is a more formal entity, typically stored in a “rectangular” format with variables in columns and cases (or observations) in rows. This data is generally split into two types:

Categorical

Numeric (Quantitative)

Software-Specific Nomenclature

While the principles remain constant, the “names” change depending on the tool you use.

Concept Idea R Term Other Terms
Free Text Any Text that is typed out in an unstructured way Character String
Categorical Data that can only take a specific set of values (Structured) Character String
Ordinal Categorical Data that can take a predefined order - example Mild, Moderate, Severe Factor
Continuous Numeric Data that can take decimal places Numeric Double, Float
Discrete Numeric Data that cannot take decimal places Integer Int

In database systems like SQL, terminology becomes even more intricate, with specific types based on precision (total digits) and scale (digits after the decimal). Databases use even more precise datatypes like varchar(50), float(50) etc.

The bottom line is that it makes sense to understand how your software processes data and then select the correct data type. This is helpful while cleaning, visualizing or modelling your data and saves you a lot of frustration later in the process.