Dataset Structure

Schema Overview

The data types are selected to best suit a dataframe or SQL database for analysis.

Field name Recommended type Description Sample values
Gender Categorical / String The biological sex of the patient. Male, Female
Age String (Mixed) The age of the patient at the time of the study.
Note: Requires cleaning (see Data Fields).
027Y, 050Y, 007M
Modality String The imaging method used. Currently, all visible entries are CT (Computed Tomography) CT
Description String A descriptive label including anatomical region, protocol details, contrast phases, and patient category. Abdomen^AV_Abd_Arterial_Venous_CE (Adult), Head^AV_Head_Plain_Trauma (Adult)
Size_raw String The file size as displayed in the UI. 264.56 MB, 33.23 MB
Size_bytes Float / Int (Derived) The file size converted to a standard numerical unit for analysis. 264560000, 33230000

Data fields & quality notes

A detailed breakdown of the fields in the dataset: 

Gender

  • Type: Categorical
  • Observations: Standard binary classification observed so far (Male, Female). Check for Other or Unknown in the full set.

Age

  • Type: String (needs parsing)
  • Data Quality Issues:
    • Mixed Units: Age values use different units – years (e.g., 027Y, 050Y) and months (e.g., 007M). Standardization required to convert all values to a common unit for analysis.
    • Leading Zeros: Age values are zero-padded (e.g., 027Y for 27 years, 007M for 7 months). Convert to integers after parsing.
    • Age Range: The visible sample shows ages from infants (7 months) to elderly (82 years), indicating a comprehensive all-ages dataset suitable for pediatric and adult studies.
    • Pediatric Consideration: Presence of month-based ages indicates inclusion of pediatric/infant cases, which may require separate analysis due to different anatomical characteristics and radiation dose considerations.

Modality

  • Type: Categorical
  • Observations: The sample shows only CT (Computed Tomography). CT combines X-ray imaging with computer processing to create detailed cross-sectional images, providing superior anatomical detail compared to plain radiography.

Description

  • Type: Text
  • Observations: Contains highly structured information including anatomical region (Abdomen, Head, Thorax), imaging protocol details (Plain, Arterial, Venous, CE for Contrast Enhanced), scan type (Trauma, HRCT for High Resolution CT), and patient category (Adult). Uses caret (^) as delimiter between anatomical region and protocol details.
  • Protocol Information: Descriptions indicate multi-phase studies (e.g., “AV_Abd_Arterial_Venous_CE” indicates both arterial and venous contrast phases), which are essential for comprehensive vascular and organ assessment. HRCT (High Resolution Computed Tomography) indicates specialized lung parenchyma imaging.
  • Standardization: Well-structured naming convention following anatomical region^protocol_details pattern. Parse using “^” delimiter to separate anatomy from protocol, then further parse protocol details using underscores. Extract contrast phase information (Plain, Arterial, Venous, CE) for analysis of imaging techniques.

Size

  • Type: String
  • Observations: File sizes range from approximately 33.23 MB to 264.56 MB in the sample. Includes the unit (e.g., “MB”). For analysis, this should be split into a numerical value and a unit, or normalized to a single unit (e.g., Bytes).
  • File Size Distribution: CT datasets show significantly larger file sizes compared to plain radiography (DX/CR) and even MRI in some cases. This reflects the volumetric nature of CT (typically 200-800 slices per study) and higher spatial resolution. Multi-phase contrast studies (arterial + venous) will have proportionally larger file sizes.
  • Storage Implications: Abdominal CT with multi-phase contrast imaging can generate 200-400+ MB per study. Storage and bandwidth requirements for CT are substantially higher than other modalities, important for PACS infrastructure planning.

Usage & considerations

Technical characteristics of
computed tomography (CT)

Image acquisition
technology

Modern multi-detector CT (MDCT) systems use rotating X-ray tubes and detector arrays to acquire volumetric data. Current generation scanners feature 64-320 detector rows, enabling rapid whole-body imaging in 5-20 seconds. The X-ray tube rotates 360° around the patient while the table moves continuously through the gantry, creating a helical/spiral acquisition pattern.

Cross-sectional
imaging

CT generates true axial slices (typically 0.5-5mm thickness) through the body, eliminating the superimposition problem inherent in plain radiography. Data can be reconstructed in any imaging plane (axial, coronal, sagittal, oblique) and at various slice thicknesses post-acquisition without additional radiation exposure.

Contrast enhancement
protocols

Intravenous iodinated contrast agents enable multi-phase imaging to assess vasculature and organ perfusion. Arterial phase (25-35 seconds post-injection) optimizes arterial visualization and hypervascular lesion detection. Venous/portal venous phase (60-80 seconds) provides optimal solid organ parenchymal enhancement. Delayed phases (3-10 minutes) assess urinary tract and lesion washout patterns.

Hounsfield units
(HU)

CT images are quantitative, with tissue density measured in Hounsfield Units. Air = -1000 HU, water = 0 HU, bone = +400 to +1000 HU. This standardized density scale enables automated segmentation, lesion characterization, and bone density assessment. Window/level settings optimize visualization of specific tissues (lung window, soft tissue window, bone window).

Radiation dose
considerations

CT delivers higher radiation doses than plain radiography but provides vastly more diagnostic information. Typical effective doses: head CT 1-2 mSv, chest CT 5-7 mSv, abdomen/pelvis CT 10-15 mSv. Modern dose reduction techniques include automatic exposure control, iterative reconstruction algorithms, and low-dose protocols for specific indications.

Clinical
applications

CT is the primary modality for trauma evaluation (head, chest, abdomen), acute abdominal pain, pulmonary embolism, stroke assessment, cancer staging, and vascular imaging. Head CT without contrast is the first-line study for acute trauma and stroke. Chest HRCT is the gold standard for interstitial lung disease. Multi-phase abdominal CT evaluates liver lesions, renal masses, and pancreatic pathology.

Primary use cases

  • Training 3D convolutional neural networks for volumetric organ segmentation (liver, kidneys, lungs, pancreas) and lesion detection across multiple anatomical regions.
  • Developing AI models for trauma triage, including automated detection of intracranial hemorrhage, pneumothorax, solid organ injury, and skeletal fractures.
  • Multi-phase contrast enhancement analysis for tumor characterization, vascular mapping, and perfusion assessment requiring temporal sequence modeling.
  • Quantitative imaging biomarker development leveraging Hounsfield Unit measurements for bone density, liver fat quantification, emphysema scoring, and coronary calcium scoring.
  • Pediatric-specific model development using age-stratified data, accounting for different anatomical proportions, contrast protocols, and radiation dose optimization in children.
  • Cross-modality learning combining CT with MRI, PET, or ultrasound for complementary diagnostic information and model generalization studies.

Privacy & ethics

  • While names are not visible, the combination of Age, Gender, specific anatomical protocols (especially trauma studies), and timestamps could be quasi-identifying. Ensure HIPAA/GDPR compliance before public release.
  • CT images contain extensive identifiable anatomical features, particularly facial structures in head CT. Defacing algorithms should be applied to cranial CT datasets. Dental patterns visible in head/neck CT can be identifying and may require masking.
  • Embedded metadata in DICOM files may contain patient identifiers, technologist notes, or other PHI. Comprehensive DICOM header de-identification beyond basic demographic fields is essential.

Preprocessing needs

  • Age Handling: Parse age values to extract numeric values and units. Convert all ages to consistent unit (months or years). Handle mixed formats: strip "Y" or "M" suffix, remove leading zeros. Create age_numeric and age_unit columns. For analysis involving age groups, convert months to years (007M → 0.58 years) or vice versa.
  • Description Parsing: Split description field using "^" delimiter into anatomical_region and protocol_details. Further parse protocol details to extract: imaging_technique (Plain, CE), vascular_phase (Arterial, Venous), scan_type (Trauma, HRCT), patient_category (Adult, Pediatric if present). This structured parsing enables filtering by specific protocols for targeted model training.
  • Contrast Phase Extraction: Identify contrast enhancement phases from protocol details (Plain/Non-contrast, Arterial, Venous/Portal Venous, Delayed). This is critical for models learning phase-specific pathology appearance and for temporal sequence analysis in multi-phase studies.
  • Size Normalization: Parse size strings to separate numerical values from units (MB). Convert to Bytes for consistent computational analysis. Note that CT file sizes correlate with number of images/slices and reconstruction algorithms used.
  • Pediatric/Adult Stratification: Use age data to stratify into pediatric (<18 years) and adult (≥18 years) cohorts. Consider further pediatric subdivisions (infant <1yr, child 1-12yr, adolescent 13-17yr) as anatomical characteristics, contrast doses, and radiation protocols differ significantly by age group.

Unlock your true
speed to scale 

Accelerate what data and AI can do together.