Dataset Structure

Schema Overview

The data types are selected to best suit a dataframe or SQL database for analysis.

Field name Recommended type Description Sample values
Gender Categorical / String The biological sex of the patient. Male, Female
Age String (Mixed) The age of the patient at the time of the study.
Note: Requires cleaning (see Data Fields).
000Y, 27Y, –
Modality String The imaging method used.
Currently, all visible entries are X-ray.
XRAY
Description String A descriptive label of the body part and view/projection. Chest AP, L-spine LAT, right, BERIUM MEAL FT
Size_raw String The file size as displayed in the UI. 7.10 MB, 86.85 MB
Size_bytes Float / Int (Derived) The file size converted to a standard numerical unit for analysis. 7100000, 86850000

Data fields & quality notes

A detailed breakdown of the fields in the dataset: 

Gender

  • Type: Categorical
  • Observations: Standard binary classification observed so far (Male, Female). Check for Other or Unknown in the full set.

Age

  • Type: String (requires parsing)
  • Data Quality Issues:
    • Nulls: Represented as a dash -. These should be converted to NaN or None.
    • Formatting: Values include a unit suffix (e.g., 000Y). You will need to strip the “Y” to perform numerical analysis.
    • Outliers: 000Y likely indicates infants (less than 1 year) or a placeholder for unknown birth dates.

Modality

  • Type: Categorical
  • Observations: The sample shows only XRAY. If this dataset expands to include MRI or CT, this field will be critical for filtering.

Description

  • Type: Text
  • Observations: Contains both the anatomical region (e.g., “Chest”, “Pelvis”, “L-spine”) and the projection/view (e.g., “AP” for Anteroposterior, “LAT” for Lateral).
  • Standardization: Typos or spacing variations are present (e.g., Chest AP vs ChestSupine AP vs C-SpineAP). Normalization may be required for NLP tasks.

Size

  • Type: String
  • Observations: Includes the unit (e.g., “MB”). For analysis, this should be split into a numerical value and a unit, or normalized to a single unit (e.g., Bytes).

Usage & considerations

Usage & Considerations 

Primary use cases

  • Training computer vision models for specific anatomical regions (e.g., "Chest X-ray classification"). 
  • Analysis of data storage requirements (using the Size column). 
  • Demographic distribution analysis of patient population. 

Privacy & ethics

  • While names are not visible, the combination of Age (if specific), Gender, and specific timestamps (if added later) can be quasi-identifying. Ensure HIPAA/GDPR compliance before public release. 

Preprocessing needs

  • Age Normalization: Convert 000Y to 0 and - to null. 
  • Text Cleaning: Separate "Body Part" and "View" from the description column (e.g., split "Chest AP" into BodyPart: Chest, View: AP). 

Unlock your true
speed to scale 

Accelerate what data and AI can do together.