YADM dataset
Intro / scope
Section titled “Intro / scope”The YADM dataset is a standardized export of repeated “You and Decision Making” sessions (cohorts) used to validate analytics automation and power the analytics dashboard.
Exact reproduction spec (read with Desktop/YADM_Data/Analytics/_build-exports.mjs)
Section titled “Exact reproduction spec (read with Desktop/YADM_Data/Analytics/_build-exports.mjs)”Single source of truth for byte-identical output is the script; this block lists rules in pipeline order so the current state can be re-derived without spelunking.
- Inputs: every
*.jsoninYADM_Data/(not dotfiles), sorted by filename → one logical session per file. - Constants:
ORIGIN = https://imd.sylva.ac; CSV = UTF‑8 BOM + CRLF;escapeCsvCellstrips/replaces embedded newlines in all cells. - IDs:
userID/projectID= first 6 chars of URL‑safe base64(SHA‑256(kind + ':' + raw)). - “Latest” session (canonical poll titles + layout): the session with greatest
activityMaxMs; tie‑break by greatersessionDateMs. getMeta(nameKey): prefer latest session’sanswerNameToMeta; else the longestheaderamong sessions for that key.pollSeq:latest.indices.pollOrder, then append polls that appear only in older files; if an extra poll is Calories or Butterfly and Trees exists, insert it immediately after Trees, then append remaining extras.- Raw answer column order: walk
pollSeq; for each poll, ordered keys from latest then other sessions; Estimate poll keeps only keys whose prompts belong to the first 15 distinct prompts (unique prompt text); drop keys never answered in any session; always append butterfly calories keys fromcollectButterflyNameKeysForExportif missing (may add empty columns). - Reorder answers:
reorderAnswersTreesButterflyTrust→ non‑trust columns split trees vs butterfly viaclassifyTreesButterfly(prompt); then trust columns; output trees | butterfly | trust. - Enrichment (mutates meta):
enrichEstimatePromptsPerKey→enrichEstimatePromptsBySlot→enrichRailroadFerrariPrompts→applyEstimateBoundsOverrides→applyTreesButterflyOverrides(needs butterfly linked‑number map). - Tail moves:
reorderAnswersTailMoves— “Compared to your peers…” → second‑last answer column; Decision types… → last. - Trust columns: first trust poll’s first two answer widgets moved to answer indices 2–3 (after column 1).
- Layout tweak: answer columns 28–32 (1‑based, answer columns only) moved to after column 7 (
reorderAnswersMoveCols28Through32AfterCol7). - Short names:
buildPollShortNames(pollSeq);buildAnswerShortNames→ Estimate‑N, Trees‑N (titleTrees), Butterfly‑N (title Calories/Butterfly), elsePollShort‑k/ suffixed; append ‑legacy if nameKey ∉ latestanswerOrder. - Scores column order:
Trust,Trust 2,DeltaTrust, then metrics in fixed metric key order mapped to latest poll titles: Calibration (score_ab), Ferrari, Cards, Railroad, Anchoring, Estimate, thenTotalScore,Performance. Trust score column uses parsed first‑trust answer; Trust 2 / DeltaTrust from last‑trust answer keys; other cells fromscoresByMetric. - Rounding: numeric scores and derived fields → 2 decimal places; empty string if missing / non‑finite.
- Row filters: Scores / Participants — named user (first+last) and ≥1 non‑empty score after fill; Answers — same users but row emitted only if ≥1 non‑empty answer; History — all sessions; sessionDate = calendar day of earliest activity timestamp (never
publishedAt); participantCount = rosternamedUserCountif set else counted named participants with any answer. - Outputs:
answers.csvuses five header rows + data row 6;meta.csv= scalar row + blank line +answersQuestionCatalogtable;YADM_Data.xlsxrenames reserved sheet History →Session_history; zipYAMD_Data.zip= five CSVs.
Participants columns = firstName, lastName, then scoreColsOrdered identical to Scores, then reportLink, sessionLabel.
Processing stages in code
Section titled “Processing stages in code”The exporter (_build-exports.mjs) runs main() in twelve commented stages: load sessions → pick latest → pollSeq → build answer column order → trees/butterfly/trust reorder → enrich prompts → tail moves → trust columns + block move → short names & catalog → score column plan → emit rows → write files / XLSX / zip. Match those comments to the numbered rules above.
- Source: manually downloaded Results JSON exports (one JSON per cohort session) from the Sylva admin Results view.
- Output location on local machine:
YADM_Data/Analytics/(CSV/JSON/JS artifacts andYAMD_Data.zipbundle). - Sessions: 30 sessions in the current folder set.
- Participants (filtered): 994 (named participants with at least one score; see rules below).
- Polls / scores (current canonical set):
Trust,Trust 2,Calibration,Ferrari,Cards,Railroad,Anchoring,Estimate. - Derived score fields:
DeltaTrustandTotalScore(see Scores section). - Identifiers:
userIDandprojectIDin Answers and Scores are exported as stable 6‑character hashes (deterministic per input).
Data structure (the five exports)
Section titled “Data structure (the five exports)”We produce five datasets for the same source sessions. Each dataset is exported as:
- CSV: for spreadsheet use and dashboard ingestion
- JSON: canonical structured form for programmatic checks
The five datasets are:
- Answers
- Scores
- Participants
- History
- Meta
Answers
Section titled “Answers”What it is
Section titled “What it is”One row per participant per session, containing the normalized answers to the YADM flow questions.
-
Rows: participants who have at least one answer recorded for the session.
-
Dropped columns: any question column that has no non-empty answers in any session is omitted entirely (including polls such as Decision-style when unused).
-
Columns (prefix):
Index,userID,projectID -
Columns (answers): ordered chiefly by the latest-session flow (with tweaks below). The Answers sheet uses five header rows — participant rows begin on row 6 (
Index,userID,projectID, then values under eachShort_namefrom row 2): -
Column A: labels each metadata row (
Index,Short_name,Prompt,Answer_type); columns B–C are left blank on rows 1–4 so participant metadata does not sit under the wrong row labels. -
Row 1 — numeric index:
1…Nfor answer columns only (after the three prefix columns). -
Row 2 —
Short_name: compact names; Estimate, Trees, and Butterfly useEstimate-1…,Trees-1…,Butterfly-1… in export order (no redundant middle segment likeTrees-1-1). Other polls usePoll-1,Poll-2, … when a poll has multiple widgets. Columns only present in older cohort JSON get a‑legacysuffix (e.g.Estimate-41-legacy). -
Row 3 —
Prompt: full question/task text; for Estimate number inputs,(Lower)or(Upper)is appended to the paired statement (first column of a pair = Lower, second = Upper). -
Row 4 —
Answer_type: best-effort widget type + options/range (sliders omit step size — it varies by client and can mislead readers). -
Row 5 — column keys:
Index,userID,projectIDin A–C only; columns D onward are left blank (question keys stay on row 2). Separates the multi-row question headers from the data block. -
Trust placement (answers): the first trust poll’s first two answer widgets (e.g. short names under that poll’s numbering) are forced to answer column indices 2 and 3 (immediately after the first answer column, typically traps). Separate trust polls (e.g. closing trust) keep their own columns later in the sheet.
-
Stable layout tweak: answer columns 28–32 (1-based among answer columns only) are moved to sit immediately after column 7 so Trees / Butterfly sit next to the early exercises for this export (column indices in
answersQuestionCatalogfollow this order).
Notable normalization rules
Section titled “Notable normalization rules”- Flow alignment: we use the latest session as the canonical ordering, and append older/extra question variants if they are distinct.
- Layout
namecollisions: answer fields are keyed internally by(item.name) + poll card title, because the samename(e.g. a reused…_numberinput_1) can appear on more than one poll; without this, a later card would steal the label from an earlier one. - Cards answer translation: selector choices are translated using the image filename mapping:
choice1 → 5,choice2 → 8,choice3 → red,choice4 → blue
- Estimate bounds: estimate questions are represented as two columns per question:
(Lower)and(Upper)- we include the current 5 questions, plus older distinct questions (when titles differ), capped at 15 total estimate questions.
- placeholders such as Q9 are replaced when possible with the longest non-placeholder prompt for that answer id across exports; if that still fails, we fill from the same ordinal Lower/Upper slot (paired estimate index) using the richest prompt found in any session for that slot.
- numberinputs are paired in flow order; an unmatched input at the end of one segment can pair with the next estimate input (carry) so prompts such as “Nile river” align with Lower/Upper pairs where possible.
- Trees vs Butterfly (four column roles: two for Trees, at least two for Butterfly):
- Trees: Taller or shorter and Height estimate — classification looks for taller / shorter in the prompt (it does not treat the substring
lessinsideshorteras “more/less”). - Butterfly (legacy Calories card or Butterfly title): More or less and Number estimate — uses word-boundary more / less (or calories / apple), the Calories/Butterfly poll title, or a number input on the same poll card that follows a Butterfly direction question.
- Trees columns are output first, then Butterfly blocks. You may see more than two Butterfly headers when different cohorts used different direction-question wordings; each direction + number pair stays distinct.
- Trees: Taller or shorter and Height estimate — classification looks for taller / shorter in the prompt (it does not treat the substring
- Railroad / Ferrari: labels prefer the long question string from any session; if only stubs like
Q6:exist, we scrape the longest instructional text block from the poll card in the JSON and fall back to the longest prompt seen anywhere for that poll. - No empty rows: participants with no answers in a session are excluded from
answers.csv. - CSV safety: all headers/cells are sanitized to remove embedded newlines and written with a UTF‑8 BOM + CRLF line endings for Excel/Sheets compatibility.
Scores
Section titled “Scores”What it is
Section titled “What it is”One row per participant per session, containing the computed scores for the canonical YADM polls.
- Rows: named participants with at least one score present.
- Columns:
userID,projectID, then score columns titled by poll title (exact order):Trust,Trust 2,DeltaTrust,Calibration,Ferrari,Cards,Railroad,Anchoring,Estimate,TotalScore,PerformanceuserIDandprojectIDare stable 6‑character hashes (deterministic per input).
Normalization rules
Section titled “Normalization rules”- Column titles: derived from the latest session’s poll titles, mapping older naming changes onto the latest names.
- Order (after
Trust/Trust 2/DeltaTrust): fixed metric key order — Calibration, Ferrari, Cards, Railroad, Anchoring, Estimate — then derived columns (see Shape); poll titles come from the latest session’s cards but column sequence is not the raw poll order. - At most one score per poll (per participant).
- Rounding: scores are rounded to 2 decimal places.
- Trust: numeric answer parsed from the first trust poll’s answer widget (if missing, leave empty).
- Trust 2: numeric answer parsed from the last trust poll’s answer widget (opening/closing trust therefore stay distinct). (If missing, leave empty.)
- DeltaTrust: (Trust2 - Trust) (empty if either is missing), placed immediately after
Trust 2. - TotalScore: sum of all non-trust scores:
Calibration + Ferrari + Cards + Railroad + Anchoring + Estimate(empty if none present; order matches the score columns). - Performance: average of present values among only
Calibration,Ferrari,Cards,Railroad,Anchoring,Estimate(same six asTotalScore; Trust and Trust 2 are not included). Empty if none of those six are present. - No empty rows: participants without any score are excluded from
scores.csv.
Participants
Section titled “Participants”What it is
Section titled “What it is”A participant list for dashboard display: names, scores, and a report link placeholder, per session.
- Rows: named participants with at least one score present.
- Columns (order):
firstName,lastName,- the score columns (same titles and order as
scores.csv— usescoreColsOrderedin the exporter), reportLink,sessionLabel
- Exclude missing names: participants without first+last name are excluded.
- Exclude missing scores: participants with all scores empty are excluded.
- Report link: placeholder based on staff Results URL with a
#participant=<userId>anchor.
History
Section titled “History”What it is
Section titled “What it is”A per-session list for dashboards and rollups.
- Rows: one per session JSON file.
- Columns:
sessionName,sessionDate,participantCount,projectLink
- Session date: day of the earliest activity timestamp found in that session (answer/load/submit timestamps). (We explicitly avoid using content
publishedAttimestamps.) - Participant count: uses the named roster count for the session (so sessions with missing runtime data still have correct counts).
What it is
Section titled “What it is”A compact summary for dashboards.
Fields
Section titled “Fields”totalParticipants: number of rows in the Participants datasetfirstSessionDate,lastSessionDatenumberOfAnswersRows,numberOfAnswersColumnsnumberOfScoresRows,numberOfScoresColumnsanswersQuestionCatalog: array describing each answer column (same content as answers sheet rows 1–4 for columns D onward, transposed):column_index,short_name,prompt,answer_type. Inmeta.csvthis appears after a blank line as a second table (column_index, …).
XLSX workbook
Section titled “XLSX workbook”YADM_Data.xlsxis built with ExcelJS (install:npm i exceljsin the Analytics folder).- Answers sheet: freeze panes after column C and row 5 (first data cell D6); rows 1–5 use light grid borders with a stronger bottom border under row 5; top-aligned, wrap text; column widths tuned for labels vs answers.
- Meta sheet: scalar metrics as a two-column table; then a merged title row Answers overview, then the same
answersQuestionCatalogcolumns as the second table inmeta.csv(column_index,short_name,prompt,answer_type). - Other sheets: freeze the header row, bold header, bottom border, wrapped top-aligned cells.
- Excel reserves the worksheet name History — that sheet is named
Session_historyin the workbook only (CSV/JSON filenames unchanged).
Building / refreshing the exports
Section titled “Building / refreshing the exports”The exporter lives at Desktop/YADM_Data/Analytics/_build-exports.mjs and reads session JSON files from the parent YADM_Data/ folder.
Run it from the Analytics directory:
cd ~/Desktop/YADM_Data/Analyticsnode ./_build-exports.mjsThis regenerates:
answers.csv|json|jsscores.csv|json|js(includesTotalScoreandPerformance)participants.csv|json|jshistory.csv|json|jsmeta.csv|json|js~/Desktop/YADM_Data/Analytics/YADM_Data.xlsx(styled workbook; Answers freeze C×5, five header rows incl. column keys, then data)~/Desktop/YADM_Data/Analytics/YAMD_Data.zip(CSV bundle, same folder as the CSVs)
Notable behaviors implemented in the exporter:
- Trust / Trust 2 (scores):
Trustcomes from the first trust poll’s answer widget;Trust 2comes from the last trust poll’s answer widget. If a participant didn’t answer, the cell is left empty (no computed fallback). - Trust answers: the first trust poll’s first two widgets are moved to indices 2 and 3 among answer columns (after the first question column).
- DeltaTrust: only computed when both trust values are present.
- Performance: average of non-trust poll scores only (
Calibration,Ferrari,Cards,Railroad,Anchoring,Estimate), over cells that are present. - Answer column ordering (before trust placement):
Trees → Butterfly → Trust polls → everything else, grouped usingclassifyTreesButterflyon each column’s prompt/title (legacy behavior).Short_namestill uses poll card titles forTrees-*/Butterfly-*only (Trees,Calories/Butterfly), so e.g. Ferrari staysFerrari-*even if a prompt matches butterfly heuristics. Two explicit tail moves:Trust - Compared to your peers, …is forced to second last.Decision types - How much of your job …is forced to the very last column.
- Ferrari label: prefers the short “how much / cost / price” instruction text from the Ferrari card over longer unrelated intro text when building the header.