About Data Formats
Each bit of data in an event must be written in a supported data format. A data format is essentially a C++ class, where a class defines a data structure (a data type with data members). The term
data format can be used to refer to the format of the data written using the class (e.g., data format as a sort of template), or to the instantiated class object itself. The
DataFormats
package and the
SimDataFormats
package (for simulated data) in the CMSSW CVS repository contain all the supported data formats that can be written to an Event file. So, for example, if you wish to add data to an Event, your
EDProducer module? must instantiate one or more of these data format classes.
Data formats (classes) for reconstructed data, for example, include
Reco.Track
,
Reco.TrackExtra
, and many more. See the Reference Manual section
RECO data tier
for the full listing.
About Data Tiers
Event information from each step in the simulation and reconstruction chain is logically grouped into what we call a
data tier. Examples of data tiers include
RAW
and
RECO
, and for MC,
GEN
,
SIM
and
DIGI
. A data tier may contain multiple data formats, as mentioned above for reconstructed data. A given dataset may consist of multiple data tiers, e.g., the term
GenSimDigi
includes the generation (MC), the simulation (Geant) and digitalization steps. The most important tiers from a physicist's point of view are probably
RECO
(all reconstructed objects and hits) and
AOD
(a smaller subset of
RECO
). The following table gives an overview.
E.g., the RAW data tier collects detector data after online formatting plus some trigger results, while the RECO tier collects reconstructed objects.
Data Tier Listing
Event Format |
Contents |
Purpose |
Data Type Ref |
Event Size (MB) |
DAQ-RAW |
Detector data from front end electronics + L1 trigger result. |
Primary record of physics event. Input to online HLT |
|
1-1.5 |
RAW |
Detector data after online formatting, the L1 trigger result, the result of the HLT selections (HLT trigger bits), potentially some of the higher level quantities calculated during HLT processing. |
Input to Tier-0 reconstruction. Primary archive of events at CERN. |
|
1.5 |
RECO |
Reconstructed objects (tracks, vertices, jets, electrons, muons, etc.) and reconstructed hits/clusters |
Output of Tier-0 reconstruction and subsequent rereconstruction passes. Supports re-finding of tracks, etc. |
RECO & AOD |
0.25 |
AOD |
Subset of RECO. Reconstructed objects (tracks, vertices, jets, electrons, muons, etc.). Possible small quantities of very localised hit information. |
Physics analysis, limited refitting of tracks and clusters |
RECO & AOD |
0.05 |
TAG |
Run/event number, high-level physics objects, e.g. used to index events. |
Rapid identification of events for further study (event directory). |
|
0.01 |
FEVT |
Full Event: Term used to refer to RAW+RECO together (not a distinct format). |
multiple |
|
1.75 |
GEN |
Generated Monte Carlo event |
- |
|
- |
SIM |
Energy depositions of MC particles in detector (sim hits). |
- |
|
- |
DIGI |
Sim hits converted into detector response. Basically the same as the RAW output of the detector. |
- |
|
1.5 |
The
Data Type Ref column entries point to the
CMSSW Reference Manual
, which is not complete.
Data Tiers: Reconstructed (RECO) Data and Analysis Object Data (AOD)
RECO data contains objects from all stages of reconstruction. AOD are derived from the RECO information to provide data for physics analyses in a convenient, compact format. Typically, physics analyses don't require you to rerun the reconstruction process on the data. Most physics analyses can run on AOD data.
RECO
RECO is the name of the data-tier which contains objects created by the event reconstruction program. It is derived from RAW data and provides access to reconstructed physics objects for physics analysis in a convenient format. Event reconstruction is structured in several hierarchical steps:
- Detector-specific processing: Starting from detector data unpacking and decoding, detector calibration constants are applied and cluster or hit objects are reconstructed.
- Tracking: Hits in the silicon and muon detectors are used to reconstruct global tracks. Pattern recognition in the tracker is the most CPU-intensive task.
- Vertexing: Reconstructs primary and secondary vertex candidates.
- Particle identification: Produces the objects most associated with physics analyses. Using a wide variety of sophisticated algorithms, standard physics object candidates are created (electrons, photons, muons, missing transverse energy and jets; heavy-quarks, tau decay).
The normal completion of the reconstruction task will result in a full set of these reconstructed objects usable by CMS physicists in their analyses. You would only need to rerun these algorithms if your analysis requires you to take account of such things as trial calibrations, novel algorithms etc.
Reconstruction is expensive in terms of CPU and is dominated by tracking. The RECO data-tier will provide compact information for analysis to avoid the necessity to access the RAW data for most analysis. Following the hierarchy of event reconstruction, RECO will contain objects from all stages of reconstruction. At the lowest level it will be reconstructed hits, clusters and segments. Based on these objects reconstructed tracks and vertices are stored. At the highest level reconstructed jets, muons, electrons, b-jets, etc. are stored. A direct reference from high-level objects to low-level objects will be possible, to avoid duplication of information. In addition the RECO format will preserve links to the RAW information.
The RECO data includes quantities required for typical analysis usage patterns such as: track re-finding, calorimeter reclustering, and jet energy calibration. The RECO event content is documented in the Reference Manual at
RECO Event Content, RECO data tier
.
AOD
AOD are derived from the RECO information to provide data for physics analysis in a convenient, compact format. AOD data are usable directly by physics analyses. AOD data will be produced by the same, or subsequent, processing steps as produce the RECO data; and AOD data will be made easily available at multiple sites to CMS members. The AOD will contain enough information about the event to support all the typical usage patterns of a physics analysis. Thus, it will contain a copy of all the high-level physics objects (such as muons, electrons, taus, etc.), plus a summary of the RECO information sufficient to support typical analysis actions such as track refitting with improved alignment or kinematic constraints, re-evaluation of energy and/or position of ECAL clusters based on analysis-specific corrections. The AOD, because of the limited size that will not allow it to contain all the hits, will typically not support the application of novel pattern recognition techniques, nor the application of new calibration constants, which would typically require the use of RECO or RAW information.
The AOD data tier will contain physics objects: tracks with associated Hits, calorimetric clusters with associated Hits, vertices, jets and high-level physics objects (electrons, muons, Z boson candidates, and so on).
Because the AOD data tier is relatively compact, all Tier-1 computing centres are able to keep a full copy of the AOD, while they will hold only a subset of the RAW and RECO data tiers.
Reference Documentation for RECO and AOD Data Format Packages
Doxygen-generated reference documentation on data format packages is provided for every release. It is accessible at the following links:
These links provide a list of all packages related to the RECO and AOD data formats within the CMSSW repository. Links there point to the package documentation. Starting from the
CMSSW Documentation Main Page
, you should in principle be able to research packages using any of the views presented (Data, Functional, Detector). E.g.,
-- Referrence :
WorkBookCMSSW? CMSSWWorkBook
--
DongHoMoon - 13 Nov 2007