Dataset Structure

Schema Overview

The data types are selected to best suit a dataframe or SQL database for analysis.

Field name Recommended type Description Sample values
Gender Categorical / String The biological sex of the patient. Male, Female
Age String (Mixed) The age of the patient at the time of the study.
Note: All values are missing in CR Set (see Data Fields).
Modality String The imaging method used.
Currently, all visible entries are X-ray.
XRAY
Description String A descriptive label of the body part and view/projection. HEAD
Size_raw String The file size as displayed in the UI. 26.13 MB, 728.24 KB
Size_bytes Float / Int (Derived) The file size converted to a standard numerical unit for analysis. 26130000, 728240

Data fields & quality notes

A detailed breakdown of the fields in the dataset: 

Gender

  • Type: Categorical
  • Observations: Standard binary classification observed so far (Male, Female). Check for Other or Unknown in the full set.

Age

  • Type: String (requires parsing)
  • Data Quality Issues:
    • Complete Missing Data: All age values in the CR Set are represented as dashes (-). These should be converted to NaN or None for analysis.
    • Comparison with DX Set: Unlike the Digital Radiography (DX) dataset which includes age values with year suffixes (e.g., 27Y, 000Y), the CR Set has no age information available.

Modality

  • Type: Categorical
  • Observations: The sample shows only XRAY. Computed radiography (CR) is a digital imaging method for plain X-rays using photostimulable phosphor plates, distinct from traditional film-based radiography.

Description

  • Type: Text
  • Observations: In the CR Set sample, all visible entries show “HEAD” as the anatomical region. This suggests a more homogeneous dataset compared to the DX Set, which contains varied anatomical regions (e.g., Chest, Pelvis, L-spine).
  • Standardization: While the sample is consistent, verify if other anatomical regions exist in the full dataset. If variations in naming conventions appear, normalization may be required for NLP tasks.

Size

  • Type: String
  • Observations: File sizes range from approximately 728 KB to 26.13 MB in the sample. Includes the unit (e.g., “MB”, “KB”). For analysis, this should be split into a numerical value and a unit, or normalized to a single unit (e.g., Bytes).
  • File Size Distribution: The CR Set shows files predominantly in the KB to low MB range (728 KB – 26.13 MB), which may indicate different imaging parameters or compression compared to the DX Set.

Usage & considerations

Technical characteristics of
computed radiography (CR)

Photostimulable phosphor technology

CR systems use imaging plates coated with photostimulable phosphor (typically barium fluorohalide compounds) that store X-ray energy as a latent image. These plates are then scanned by a laser to release stored energy as visible light, which is captured by photomultiplier tubes and converted to digital signals.

Workflow process

Involves multiple steps: exposure → cassette transport to reader → laser scanning → image processing → plate
erasure for reuse. This process typically takes 30-90 seconds from exposure to image display, slower than DX but faster than film processing.

Image quality

Spatial resolution of 2.5-5 line pairs per mm, generally lower than DX flat-panel detectors. Dynamic range of 10,000:1 allows good visualization of both dense and soft tissue structures. Image quality can degrade with plate wear or incomplete erasure.

Advantages

Flexibility in cassette sizes and portability make CR ideal for bedside and operating room imaging. Easier retrofitting into existing film-based infrastructure. Lower initial equipment costs compared to DX systems.

Common applications

Portable radiography, emergency department imaging, orthopedic imaging, and facilities transitioning from film. Particularly suitable for lower-volume settings or departments requiring flexible cassette systems.

Primary use cases

  • Training computer vision models for head/cranial X-ray analysis (given the predominance of HEAD in the sample).
  • Developing AI models robust to CR-specific image characteristics (noise patterns, spatial resolution limitations).
  • Analysis of data storage requirements for computed radiography systems.
  • Demographic distribution analysis of patient population (gender only, as age is unavailable).
  • Comparative studies between CR and DX modalities for image quality, AI model generalization, and workflow efficiency.

Privacy & ethics

  • While names and ages are not visible, the combination of Gender and specific timestamps (if added later) could potentially be quasi-identifying. Ensure HIPAA/GDPR compliance before public release.
  • The absence of age data reduces re-identification risk but limits demographic analysis capabilities.

Preprocessing needs

  • Age Handling: Convert all dashes (-) to null/NaN. If age data becomes available later, implement the same normalization as the DX Set.
  • Size Normalization: Parse size strings to separate numerical values from units (KB, MB). Convert to consistent units (Bytes) for computational analysis.
  • Description Verification: Verify if the full dataset contains additional anatomical regions beyond HEAD. If so, implement text cleaning to separate "Body Part" and "View" (similar to DX Set preprocessing).
  • Modality Consistency: Confirm all entries are XRAY/CR. This field can be used to filter or combine with other modality datasets (DX, MRI, CT) in multi-modal studies.

Unlock your true
speed to scale 

Accelerate what data and AI can do together.