Skip to main content

Semi-Structured Data Processing

What is semi-structured data?

Semi-structured data refers to data that does not follow the rigid schema of a traditional relational database, but still preserves some level of structure or organization. Unlike completely unstructured data, semi-structured data contains patterns that make it possible to parse and analyze it without enforcing a strict schema upfront.

Common examples of semi-structured data are JSON (JavaScript Object Notation) and XML (eXtensible Markup Language). Both are widely used to represent information in a flexible hierarchy.

Practical examples include JSON documents with nested lists, objects, and variable attributes, or XML files with nested tags that represent hierarchical information.

Examples of semi-structured data

When working with APIs, it is very common to receive nested data. In other words, information is organized as a hierarchy, like boxes inside other boxes. A practical JSON example is shown below:

{
"user": {
"id": "123456",
"name": "Alice",
"email": "alice@example.com",
"birthday": "1990-05-15",
"purchases": [
{
"productId": "789",
"productName": "Interesting Book",
"price": 29.99,
"purchaseDate": "2023-02-23"
},
{
"productId": "456",
"productName": "Fun Mug",
"price": 12.99,
"purchaseDate": "2023-01-10"
}
]
}
}

An XML example is shown below:

<empresa>
<nome>ABC Ltda.</nome>
<departamentos>
<departamento>
<nome>Vendas</nome>
<funcionarios>20</funcionarios>
</departamento>
<departamento>
<nome>TI</nome>
<funcionarios>15</funcionarios>
</departamento>
</departamentos>
</empresa>

Analyzing data in these formats can be challenging, which is why unnesting or flattening is a common step. In Dadosfera, you can do this with Python or SQL by using the Intelligence module or the Query module.