<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Learning, Thinking, and Coding</title>
<link>https://brianrepko.github.io/blog/</link>
<atom:link href="https://brianrepko.github.io/blog/index.xml" rel="self" type="application/rss+xml"/>
<description>Brian Repko&#39;s blog</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Fri, 08 May 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>The duality of development</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2026-05-08-duality-of-dev/</link>
  <description><![CDATA[ 





<section id="oldies-and-goodies" class="level2">
<h2 class="anchored" data-anchor-id="oldies-and-goodies">Oldies and Goodies</h2>
<p>I started this blog back in 2010 by capturing a bunch of stuff that I was saying over and over during my consulting gigs. Most of them have aged pretty well - you’re here already so check them out later and let me know.</p>
<p>This is another of those that I just realized I’ve not actually ever posted. It recently came up at <a href="https://minnestar.org/minnebar/">Minnebar20</a> in a <a href="https://sessions.minnestar.org/sessions/1922">session</a> from <a href="https://www.jlryanconsulting.com">Jamie Ryan</a> on risk management via smart small steps - so thanks to her for getting me in action here.</p>
<p>I also feel this more in an age of “AI” coding tools.</p>
</section>
<section id="two-hard-problems-in-computer-science" class="level2">
<h2 class="anchored" data-anchor-id="two-hard-problems-in-computer-science">Two Hard Problems in Computer Science</h2>
<p>So… <a href="https://martinfowler.com/bliki/TwoHardThings.html">naming things is hard</a> and I don’t know what to call these two aspects of software development. So let me start with an analogy from art.</p>
</section>
<section id="art-school" class="level2">
<h2 class="anchored" data-anchor-id="art-school">Art School</h2>
<div class="callout callout-style-default callout-caution callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Caution</span>musical reference because that is how my brain works
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/PzWAO9Xmx9Y?si=ZK5kXnqeT51ks3si" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
</div>
</div>
<p>Some art is <em>additive</em> - painting for example - in that the artist keeps adding to an empty canvas.</p>
<p>Some art is <em>subtractive</em> - sculpting is the classic example - in the artist keeps subtracting to get to the finished piece.</p>
<p>In the same sense, I think some parts of coding are “building” and some parts of coding are “engineering”.</p>
</section>
<section id="no-question-so-many-questions" class="level2">
<h2 class="anchored" data-anchor-id="no-question-so-many-questions">No Question (So Many Questions)</h2>
<div class="callout callout-style-default callout-caution callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Caution</span>another musical reference, because it’s all about the songs of my youth
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/bKyVQid8Ch4?si=DCanoGjVAW4m6yhe" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
</div>
</div>
<p>In the end, instead of defining these two aspects (I can’t even name them well) - I feel like each aspect has its set of questions.</p>
<p>On the building side, we have:</p>
<ul>
<li>What are we building?</li>
<li>How do we solve this problem / implement this?</li>
<li>How can we make this better - how to optimize?</li>
<li>Is this meeting the customer needs?</li>
<li>Is this getting used?</li>
<li>Are we doing the right job? (QA - see <a href="../2010-01-25-qa-versus-qc/">another old blog post</a>)</li>
<li>How do we teach the system - to users, to developers, to administrators, to installers?</li>
</ul>
<p>and on the engineering side, we have:</p>
<ul>
<li>What could go wrong here?</li>
<li>What are we missing here?</li>
<li>What are the risks? How to mitigate those?</li>
<li>Are we doing the job right? (QC)</li>
<li>How do we make sure the code is correct? stays correct?</li>
<li>How do we know the system as a whole works?</li>
<li>How do we make this secure?</li>
<li>How do we make this performant?</li>
</ul>
<p>There are more questions (so many questions) that could go into this. In the end, the building side is the “let’s succeed” side. The engineering side is a “let’s not fail” side.</p>
<p>Perhaps those are the names I was looking for all along.</p>
</section>
<section id="which-side-are-you-on" class="level2">
<h2 class="anchored" data-anchor-id="which-side-are-you-on">Which side are you on?</h2>
<div class="callout callout-style-default callout-caution callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Caution</span>last musical reference, I swear
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-3" class="callout-3-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/pl9kcJ6RRmU?si=u8meZdb3u-7seX6U" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
</div>
</div>
<p>I think everyone has a natural tendency to default to one side or the other - their “strong” side. I know that mine is the “engineering” side. It’s why I’ve been so involved in testing frameworks, I think. But it’s important to grow, in any computing vocation, in both sides. Particularly, for computational biology, moving from developing scripts to developing packages (that others use) is adding to your engineering side.</p>
<p>This also feels like it applies to science in general - at least I see it in biology. And I see my default side come out in the questions I ask when reviewing a biology paper. I’m fortunate enough to have worked with some exceptional biologists that designed sets of experiments on both sides of the key scientific question (how many shRNAs are needed to reliably knock-down a gene? does it depend on transcript lengths?).</p>
<p>The other cross-cutting aspect to this is <strong>how</strong> you execute those building and engineering skills. That is “the process” to each side - and that there is probably a light-weight vs heavy-weight scale to this (or just recognizing that you add weight as risk and / or team-size go up).</p>
<p>But that might be my “let’s not fail” side speaking.</p>


</section>

 ]]></description>
  <category>agile</category>
  <category>development</category>
  <guid>https://brianrepko.github.io/blog/posts/2026-05-08-duality-of-dev/</guid>
  <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2026-05-08-duality-of-dev/yin-yang.png" medium="image" type="image/png" height="71" width="144"/>
</item>
<item>
  <title>Dataset-JSON</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2026-03-11-datasetjson-part3/</link>
  <description><![CDATA[ 





<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Series
</div>
</div>
<div class="callout-body-container callout-body">
<p>This is part 3 of a 3 part series on Dataset-JSON.</p>
<p>You can find part 1 <a href="https://brianrepko.github.io/blog/posts/2025-12-31-datasetjson-part1/" target="_blank">here</a> for an overview of Dataset-JSON.</p>
<p>You can find part 2 <a href="https://brianrepko.github.io/blog/posts/2026-01-15-datasetjson-part2/" target="_blank">here</a> for a list of concerns / issues with the Dataset-JSON specification.</p>
</div>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>CDISC workshop
</div>
</div>
<div class="callout-body-container callout-body">
<p>There is a planned Dataset-JSON Hands-On Implementation workshop on 18 May 2026 from 0900-1300 CET at the 2026 CDISC Europe Interchange conference in Milan. See <a href="https://www.cdisc.org/events/interchange/2026-cdisc-europe-interchange">the main website</a> or <a href="https://web.cvent.com/event/b32aaba1-214b-486e-8b5a-fdc8342f9794/websitePage:645d57e4-75eb-4769-b2c0-f201a0bfc6ce">here</a> for details.</p>
</div>
</div>
<section id="overview" class="level2">
<h2 class="anchored" data-anchor-id="overview">Overview</h2>
<p>As we saw in Part 1, Dataset-JSON includes multiple deliverables and one of those is a REST API. The intention for this deliverable is that, in addition to file transfer, we can also do data transfer via APIs. These types of transfers could be between sponsors and EDCs or sponsors and CROs - really many different ideas here and I’m glad CDISC is doing this forward-thinking. This kind of data transfer is similar to other specifications in the healthcare space - such as FHIR.</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Long post ahead
</div>
</div>
<div class="callout-body-container callout-body">
<p>This will be a long post - partly because there are various aspects to creating a REST API that aren’t addressed in the specification or leave me with concerns. So I’m going write up various aspects of architecting a REST API (particularly one that transfers lots of data) as it’s needed for the Dataset-JSON API discussion below.</p>
</div>
</div>
</section>
<section id="architecting-a-rest-api" class="level2">
<h2 class="anchored" data-anchor-id="architecting-a-rest-api">Architecting a REST API</h2>
<p>You can find the <a href="https://github.com/cdisc-org/DataExchange-DatasetJson-API">Dataset-JSON API repository on Github</a> and this includes a User’s Guide as well as the API specification as an OpenAPI specification (and an HTML rendering of that specification).</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>RFC9110 as a guide
</div>
</div>
<div class="callout-body-container callout-body">
<p>The User Guide includes a reference to the <a href="https://roy.gbiv.com/pubs/dissertation/top.htm">Fielding dissertation on REST</a>. I would also point folks to <a href="https://www.rfc-editor.org/rfc/rfc9110">RFC 9110</a> for HTTP Semantics as we go through this. The <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP">Mozilla docs</a> are good too.</p>
</div>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>FHIR as an example
</div>
</div>
<div class="callout-body-container callout-body">
<p>FHIR is a healthcare data transfer REST API - an international standard. Here are some links for a <a href="https://fhirblog.com/2026/03/09/fhir-the-important-bits/amp/">primer</a> as well as the <a href="https://hl7.org/fhir/R4/http.html">official guide</a>. The OpenAPI specs (as yaml) can be found <a href="https://healthcare.docs.wso2.com/en/latest/resources/utils/fhir-api-oas-list/">here</a>.</p>
</div>
</div>
<section id="environments" class="level3">
<h3 class="anchored" data-anchor-id="environments">Environments</h3>
<p>Not many folks think about this so I was glad to see it in the User’s Guide under “Optional Endpoints and Features”. In general, different environments can be distinguished with any of</p>
<ul>
<li>hostname (eg. dev.example.com, qa.example.com)</li>
<li>port (eg. 80 for prod, 8080 for qa, 18080 for dev)</li>
<li>path (eg. /dev, /qa, /prod)</li>
</ul>
<p>The user guide only highlights the path option - my experience has been that hostname is more common. The API specification is then based under environment’s base URL (FHIR calls this the “Service Base URL”).</p>
</section>
<section id="api-versioning-vs-content-versioning" class="level3">
<h3 class="anchored" data-anchor-id="api-versioning-vs-content-versioning">API Versioning (vs content versioning)</h3>
<p>This topic is a bit more subtle and more important when APIs are not relying on HATEOAS. When an API (aka the URLs / VERBs for the resources) is “defined” (OpenAPI) vs “discovered” (HATEOAS) - you need to think about how the API will evolve. It is important to have a mechanism to either report on the version of the API (with links to other versions) and/or potentially to support multiple versions of the API in a given implementation (so that clients can move from v&lt;N&gt; to v&lt;N+1&gt;).</p>
<p>It’s also important to note that content versioning (the content of the requests and responses) can have different versions from the API itself.</p>
<p>Here as well, there are multiple ways to do this</p>
<ul>
<li>path (eg. /v1, /v2)</li>
<li>header (eg. <code>API-Version</code>)</li>
</ul>
<p>I often see path used vs header (easier to route) and again, this becomes part of the “Service (for that version) Base URL”. Also, version number for an API should just be an integer - this is not related to semantic versioning.</p>
<p>This aspect is not addressed in the User’s Guide and different implementations will need to document this as we move from version 1 to version 2 of this API.</p>
</section>
<section id="authentication-authorization-audit" class="level3">
<h3 class="anchored" data-anchor-id="authentication-authorization-audit">Authentication / Authorization / Audit</h3>
<p>The main security aspects - Authentication, Authorization, and Auditing - should be addressed by the implementations of the API. Note that authentication mechanisms might need to be separate from authorization mechanisms.</p>
<p>The Dataset-JSON API specifies a header of <code>api-key</code> on all the requests.</p>
<p>I <em>think</em> that this is for authorization - what operations is this key-holder allowed to perform (and on what studies / datasets). But this isn’t clear in the User Guide.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>TIL “X-” for custom headers is deprecated
</div>
</div>
<div class="callout-body-container callout-body">
<p>I have to admit that, when I first saw the header <code>api-key</code>, I thought that was a huge mistake. That isn’t a defined header for HTTP and I assumed it would need an <code>X-</code> in front of that. That is not the case - the leading <code>X-</code> was deprecated as a requirement - it can still be used but isn’t required.</p>
</div>
</div>
<p>There are some style concerns that I have with <code>api-key</code>. HTTP headers are case-insensitive but tend to be “defined” in Pascal case - except with hyphens between words. I would suspect that most APIs use <code>API-Key</code> officially (and have seen <code>Api-Key</code> as well).</p>
<p>The one potential issue is that many APIs will prefix the key or token with a product name. That way there won’t be an issue with multi-API coordination. Something like <code>Dataset-JSON-API-Key</code> might be better.</p>
</section>
<section id="error-handling" class="level3">
<h3 class="anchored" data-anchor-id="error-handling">Error Handling</h3>
<p>The spec and User Guide make use of the 422 HTTP response code for all business logic errors and while that is appropriate for some situations, it isn’t for others. The FHIR guide (section 3.1.0.4.2) does a great job of listing the various HTTP status codes and when to use what. Particularly, for conditional requests or range requests - there are better more specific status codes that could be used.</p>
<p>The User Guide has a section of various HTTP Status codes but doesn’t really address when these are to be used (other than a list of various notes at the end). It could be worth collecting this in the User Guide in a more concise manner as well as compare to other APIs.</p>
</section>
<section id="mime-types-and-content-negotiation" class="level3">
<h3 class="anchored" data-anchor-id="mime-types-and-content-negotiation">MIME types and Content Negotiation</h3>
<p>The Dataset-JSON specification details three representations for a dataset - JSON, NDJSON, and DSJC - with DSJC (Dataset-JSON Compressed) defined as zlib compression on the NDJSON representation. The spec defines a MIME type only for DSJC only - <code>application/vnd.cdisc.dataset-json.compressed</code></p>
<p>This seems wholly backwards to me - I would have suggested that JSON and NDJSON could use the MIME types</p>
<ul>
<li><code>application/vnd.cdisc.dataset-json+json</code></li>
<li><code>application/vnd.cdisc.dataset-json+ndjson</code></li>
</ul>
<p>And DSJC would be a zlib compressed NDJSON.</p>
<p>When requesting a dataset - one can make use of the <code>Accept</code> and <code>Accept-Encoding</code> headers to ask for JSON or NDJSON and potentially with various compression encodings. Something like this:</p>
<pre><code>Accept: application/vnd.cdisc.dataset-json+ndjson
Accept-Encoding: deflate, zlib</code></pre>
<p>and</p>
<pre><code>Content-Type: application/vnd.cdisc.data-json+ndjson
Content-Encoding: deflate</code></pre>
<p>Many APIs also allow for the use of extensions on resource URIs in order to do content negotiation. A request for a dataset with <code>.json</code>, <code>.ndjson</code>, or <code>.dsjc</code> could basically operate the same as above.</p>
<p>The API makes use of the generic <code>application/json</code> and <code>application/x-ndjson</code>. This is also where one could add Dataset-JSON version information (when we have more than just 1.1).</p>
</section>
<section id="conditional-requests-and-concurrency" class="level3">
<h3 class="anchored" data-anchor-id="conditional-requests-and-concurrency">Conditional Requests and Concurrency</h3>
<p>Conditional headers include <code>If-Match</code>, <code>If-None-Match</code>, <code>If-Modified-Since</code>, <code>If-Unmodified-Since</code>. The first two make use of <code>ETag</code> values and the second two make use of <code>Last-Modified</code> values.</p>
<p>The reason for these is to (1) limit bandwidth on GETs where the client might already have a cached copy and (2) optimistic locking for creation, update, and potentially delete operations.</p>
<p>There are some references in the User Guide to the use of <code>If-Modified-Since</code> and some usage is on collections (list of datasets) and using the header as a query criteria. That’s not the intent of that header.</p>
<p>Concurrency issues (and HTTP status codes for them) are not addressed in the API User’s Guide but the read-write portion of the API will need to address this - hopefully with <code>ETag</code> and <code>If-Match</code> headers.</p>
</section>
</section>
<section id="dataset-json-api" class="level2">
<h2 class="anchored" data-anchor-id="dataset-json-api">Dataset-JSON API</h2>
<p>OK! Now on to the API itself…</p>
<p>The User Guide makes clear that for conformance (still being worked on), one can implement a read-only version or a read-write version. In addition, there are optional components, which are:</p>
<ul>
<li>NDJSON (and thus DSJC?) support</li>
<li>Snapshots and Dataset versioning</li>
<li>Study “Define” data (define.xml)</li>
</ul>
<section id="about" class="level3">
<h3 class="anchored" data-anchor-id="about">About</h3>
<p>The <code>/about</code> endpoint returns (GET) information about the service. The “About” object has the following:</p>
<table class="caption-top table">
<caption>About properties</caption>
<colgroup>
<col style="width: 16%">
<col style="width: 25%">
<col style="width: 18%">
<col style="width: 40%">
</colgroup>
<thead>
<tr class="header">
<th>property</th>
<th>definition</th>
<th>attributes</th>
<th>intent</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>lastUpdated</td>
<td>string (date-time)</td>
<td>required</td>
<td>timestamp for last release??</td>
</tr>
<tr class="even">
<td>author</td>
<td>string (uri)</td>
<td>required</td>
<td>website for organization??</td>
</tr>
<tr class="odd">
<td>repo</td>
<td>string (uri)</td>
<td>required</td>
<td>why would I share this??</td>
</tr>
<tr class="even">
<td>links</td>
<td>array-of-objects</td>
<td>required (0+)</td>
<td>links to various features??</td>
</tr>
<tr class="odd">
<td>link.name</td>
<td>string</td>
<td>required</td>
<td>name of link (any stds?)??</td>
</tr>
<tr class="even">
<td>link.href</td>
<td>string</td>
<td>required</td>
<td>URL under service base URL</td>
</tr>
</tbody>
</table>
<p>I struggle with what the intent is of this object and properties are. The <code>links</code> property feels like HATEOAS</p>
<p>The version of the API (as mentioned above) will be in the <code>/docs</code> resource generated from the OpenAPI spec. That said, the About object could represent a machine-readable version of that information. The list of capabilities (read-write, snapshots, define, ndjson) could be here but again are available from the spec. That said, NDJSON support should be handled via content negotiation and as an optional component, could be specified here as supported or not.</p>
</section>
<section id="studies" class="level3">
<h3 class="anchored" data-anchor-id="studies">Studies</h3>
<p>There are three study-related objects - <code>Studies</code>, <code>Study</code>, <code>StudyRequest</code>. Studies are identified by the <code>studyOID</code> value.</p>
<table class="caption-top table">
<caption>Study operations</caption>
<colgroup>
<col style="width: 9%">
<col style="width: 25%">
<col style="width: 17%">
<col style="width: 21%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Verb</th>
<th>URL</th>
<th>Request</th>
<th>Response</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>GET</td>
<td>/studies</td>
<td>-</td>
<td>array-of-Studies</td>
<td>Get list of studies</td>
</tr>
<tr class="even">
<td>POST</td>
<td>/studies</td>
<td>StudyRequest</td>
<td>Study</td>
<td>Create a study</td>
</tr>
<tr class="odd">
<td>GET</td>
<td>/studies/{studyOID}</td>
<td>-</td>
<td>Study</td>
<td>Get a single study</td>
</tr>
<tr class="even">
<td>PUT</td>
<td>/studies/{studyOID}</td>
<td>StudyRequest</td>
<td>Study</td>
<td>Update a study</td>
</tr>
<tr class="odd">
<td>DELETE</td>
<td>/studies/{studyOID}</td>
<td>-</td>
<td>-</td>
<td>Delete a study</td>
</tr>
</tbody>
</table>
<p>As for the study-related objects - here are the properties and where they are used:</p>
<table class="caption-top table">
<caption>Study-related properties</caption>
<colgroup>
<col style="width: 23%">
<col style="width: 25%">
<col style="width: 19%">
<col style="width: 9%">
<col style="width: 7%">
<col style="width: 14%">
</colgroup>
<thead>
<tr class="header">
<th>property</th>
<th>definition</th>
<th>attributes</th>
<th>Studies</th>
<th>Study</th>
<th>StudyRequest</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>studyOID</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td>X</td>
</tr>
<tr class="even">
<td>name</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td>X</td>
</tr>
<tr class="odd">
<td>label</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td>X</td>
</tr>
<tr class="even">
<td>href</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td>?</td>
</tr>
<tr class="odd">
<td>standards</td>
<td>array-of-enums</td>
<td>optional/nullable</td>
<td>X</td>
<td>X</td>
<td>X</td>
</tr>
<tr class="even">
<td>studyCreationDateTime</td>
<td>string (date-time)</td>
<td>optional</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>datasets</td>
<td>array-of-StudyDataset</td>
<td>optional/nullable</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>metaDataRef</td>
<td>array-of-MetaDataRef</td>
<td>optional/nullable</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>snapshots</td>
<td>array-of-Snapshot</td>
<td>optional/nullable</td>
<td></td>
<td>X</td>
<td></td>
</tr>
</tbody>
</table>
<section id="standards" class="level4">
<h4 class="anchored" data-anchor-id="standards">Standards</h4>
<p>The <code>standards</code> values are listed as <code>sendig</code>, <code>sdtmig</code>, <code>adamig</code>, and <code>other</code>. For a pre-clinical study you may have datasets using SEND (ADaM is not always used for pre-clinical studies). For a clinical (drug / biologic) study you may have datasets for both SDTM and ADaM. For a clinical (device) study you may only have ADaM datasets (device submissions don’t have to use SDTM). I am assuming this list of values on the Study will limit the values used on the datasets under the study.</p>
</section>
<section id="studydataset-metadataref-and-snapshot" class="level4">
<h4 class="anchored" data-anchor-id="studydataset-metadataref-and-snapshot">StudyDataset, MetaDataRef, and Snapshot</h4>
<p>StudyDatasets are a core part of the API and we cover that next. Snapshot and MetaDataRef are both part of the optional “Study-Snapshots” and “Define” capabilities.</p>
<p>For the most part, this is a standard REST API for a simple object. There are some obvious business rules to consider - creating a study with an existing <code>studyOID</code>, updating a study with a correct URL but the <code>studyOID</code> doesn’t match between body and URL, deleting a study that doesn’t exist or what to do if you can’t delete all the datasets for a given study (assuming cascading deletes). This information would add value to the User’s Guide.</p>
<p>I also should look at the ODM-XML definition of a Study or some of the other CDISC REST APIs to verify this object definition (or the intended content of these properties). In particular, the <code>href</code> property feels like HATEOAS (does this point to the same URL - is this like “_id”?).</p>
<p>There are a few issues when looking at the JSON OpenAPI file.</p>
<ul>
<li>There is a default value on <code>studyCreationDateTime</code> of <code>2025-11-17T20:09:33.999933Z</code> which seems pretty magical</li>
<li>There is a missing <code>href</code> definition on StudyRequest even though it is marked as required</li>
<li>There is an empty string value option in the <code>standards</code> property for StudyRequest</li>
</ul>
<p>This is also missing a property for study last modified timestamp - which is used in various other operations within the API. This also begs the question of why the <code>standard</code> isn’t part of the Dataset-JSON dataset metadata.</p>
</section>
</section>
<section id="study-datasets" class="level3">
<h3 class="anchored" data-anchor-id="study-datasets">Study-Datasets</h3>
<p>This is the core part of the API - how to get Study Datasets as Dataset-JSON objects. Part of the challenge with designing a REST API for a Dataset-JSON object is that a Dataset-JSON object contains multiple concepts - dataset metadata, column metadata, and rows of data. The REST API could have considered each of these as their own sub-resources but instead makes use of query parameters.</p>
<p>There are three Study-Dataset-related objects - StudyDataset, DatasetJson, RowData. Study-Datasets are identified by the <code>studyOID</code> and <code>itemGroupOID</code> (called <code>datasetOID</code> below) values. This concept always relates to the <strong>LATEST</strong> version of the dataset.</p>
<table class="caption-top table">
<caption>Study-Dataset operations</caption>
<colgroup>
<col style="width: 6%">
<col style="width: 32%">
<col style="width: 17%">
<col style="width: 18%">
<col style="width: 26%">
</colgroup>
<thead>
<tr class="header">
<th>Verb</th>
<th>URL</th>
<th>Request</th>
<th>Response</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>GET</td>
<td>/studies/{studyOID}/datasets</td>
<td>-</td>
<td>array-of-StudyDatasets</td>
<td>Get list of datasets for a study</td>
</tr>
<tr class="even">
<td>POST</td>
<td>/studies/{studyOID}/datasets</td>
<td>standard, DatasetJson</td>
<td>StudyDataset</td>
<td>Create a dataset within the study</td>
</tr>
<tr class="odd">
<td>GET</td>
<td>/studies/{studyOID}/datasets/{datasetOID}</td>
<td>-</td>
<td>DatasetJson</td>
<td>Get a single study-dataset</td>
</tr>
<tr class="even">
<td>PUT</td>
<td>/studies/{studyOID}/datasets/{datasetOID}</td>
<td>standard, DatasetJson</td>
<td>StudyDataset</td>
<td>Update a study-dataset</td>
</tr>
<tr class="odd">
<td>PATCH</td>
<td>/studies/{studyOID}/datasets/{datasetOID}</td>
<td>RowData</td>
<td>StudyDataset</td>
<td>Append records to a study-dataset</td>
</tr>
<tr class="even">
<td>DELETE</td>
<td>/studies/{studyOID}/datasets/{datasetOID}</td>
<td>-</td>
<td>-</td>
<td>Delete a study</td>
</tr>
</tbody>
</table>
<p>Adding to the operations above</p>
<ul>
<li>Getting the list of datasets for a study can be filtered by <code>standard</code> and last modified date (with <code>If-Modified-Since</code> header)</li>
<li>Getting a single dataset can be limited with use of <code>metadataonly</code> or <code>dataonly</code> query parameters</li>
<li>Getting a single dataset can be limited with use of <code>offset</code> and <code>limit</code> query parameters</li>
</ul>
<p>Again, this would be a much better API if <code>standard</code> was part of the Dataset-JSON metadata.</p>
<p>As for the study-related objects - here are the properties and where they are used:</p>
<table class="caption-top table">
<caption>Study-Dataset-related properties</caption>
<colgroup>
<col style="width: 26%">
<col style="width: 22%">
<col style="width: 17%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 8%">
</colgroup>
<thead>
<tr class="header">
<th>property</th>
<th>definition</th>
<th>attributes</th>
<th>StudyDataset</th>
<th>DatasetJson</th>
<th>RowData</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>itemGroupOID</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>name</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>label</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>href</td>
<td>string</td>
<td>required</td>
<td>X</td>
<td></td>
<td></td>
</tr>
<tr class="odd">
<td>records</td>
<td>integer</td>
<td>-varies-</td>
<td>X/required</td>
<td>X (null=0)</td>
<td></td>
</tr>
<tr class="even">
<td>standard</td>
<td>string - enum</td>
<td>optional/nullable</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>datasetJSONCreationDateTime</td>
<td>string (date-time)</td>
<td>optional/nullable</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>datasetJSONVersion</td>
<td>string - enum</td>
<td>required</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>studyOID</td>
<td>string</td>
<td>required</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>fileOID</td>
<td>string</td>
<td>optional/nullable</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>dbLastModifiedDateTime</td>
<td>string (date-time)</td>
<td>optional</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>originator</td>
<td>string</td>
<td>optional/nullable</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>sourceSystem</td>
<td>SourceSystem</td>
<td>optional</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>metaDataVersionOID</td>
<td>string</td>
<td>optional/nullable</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>metaDataRef</td>
<td>string</td>
<td>optional/nullable</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="even">
<td>columns</td>
<td>array-of-Columns</td>
<td>required</td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr class="odd">
<td>rows</td>
<td>array-of-arrays</td>
<td>optional</td>
<td></td>
<td>X</td>
<td>X</td>
</tr>
</tbody>
</table>
<p>I won’t go through <code>SourceSystem</code> and <code>Column</code> as those match the Dataset-JSON specification (with the exception of the use of JSON schema <code>format</code> in the API vs <code>pattern</code> in the Dataset-JSON specification).</p>
<p>We’ve discussed the enumerated values for <code>standard</code> already. For <code>datasetJSONVersion</code>, this list is <code>1.1</code>, and <code>1.1.0</code> through <code>1.1.5</code>. Frankly, I don’t know what changed in the various <code>1.1.x</code> releases - and how to deal with versions that my implementation can’t handle.</p>
<p>I don’t get what <code>href</code> is supposed to be here (again, feels like “_id” in HATEOAS). There are again obvious business rule violations where it would be useful for the User Guide to go over.</p>
<p>The use of both <code>metadataonly</code> and <code>dataonly</code> (what happens if I set both to true or both to false?) vs the use of sub-resource URLs or potentially a <code>format</code> parameter with values of <code>metadata</code>, <code>data</code> with a default of both. I’m not sure how to represent a <code>dataonly</code> DatasetJson object given required property attributes.</p>
<p>There is a whole section on “API Identifiers” as a caution of using HTTP/HTML special characters in your <code>studyOID</code> or <code>itemGroupOID</code> vs URL-encoding them.</p>
<p>From what I can tell, updating the <code>standard</code> (actually any property) - requires getting and resending the entire dataset. It also seems odd that I could send a <code>datasetJSONCreationDateTime</code> in the POST request to create the object. One could argue that creation of objects could be done with a PUT on the resource - update if it exists and create if not.</p>
<p>Also, as mentioned in part 2, you can’t really properly validate the data coming in on a create or update (or append) operation without the full define.xml data. And thus in order to do that - given a <code>metaDataRef</code> - you will need to get that data. These can be costly operations for a synchronous network call - I wonder if a better API could have been a more asynchronous task-based one. You are basically initiating a data pipeline job - which is then a resource of its own (and can include a callback-URL).</p>
<p>I actually wonder if better filters will be needed - even something like data for a given <code>USUBJID</code>.</p>
<p>Similar to the Study object definitions, there are some issues with these objects as well in the OpenAPI json file:</p>
<ul>
<li>The <code>datasetJSONCreationDateTime</code> has a default of <code>2025-11-17T20:09:33.993457Z</code></li>
</ul>
</section>
</section>
<section id="optional-api-capabilities" class="level2">
<h2 class="anchored" data-anchor-id="optional-api-capabilities">Optional API Capabilities</h2>
<p>Again, the “write” part of this API is considered optional. Let’s look at some of the other optional API capabilities.</p>
<section id="ndjson-support" class="level3">
<h3 class="anchored" data-anchor-id="ndjson-support">NDJSON support</h3>
<p>I’ve mentioned the potential use of content negotiation for NDJSON support above. Looking at the OpenAPI file and the User’s Guide, there seem to be two endpoints for getting a dataset as NDJSON.</p>
<p>The sub-resource <code>$export</code> and the sub-resource <code>/ndjson</code> on the Study-Dataset URL can return the Dataset-JSON object as NDJSON. I think the idea here is to call the first, which creates a job and returns a <code>Location</code> header (I’m assuming) to get the data. The second sub-resource is then an example of where the data might be - and if called early returns a 202.</p>
<p>I don’t understand why processing the return of the NDJSON version of a dataset would take any longer than the JSON version of the dataset. This whole section and use case don’t make sense to me.</p>
</section>
<section id="study-snapshots-and-thus-dataset-versioning" class="level3">
<h3 class="anchored" data-anchor-id="study-snapshots-and-thus-dataset-versioning">Study-Snapshots (and thus dataset versioning)</h3>
<p>I was thrilled to see this added but am worried that this brings up other concerns. Most back-end systems for clinical trial data have both source code <em>AND DATA</em> version control. All data updates create a new version (integer counter or commit hash), and at given points, the data in a study can get tagged. My experience is that this has been used to tag versions of SDTM or ADaM datasets used in various versions of a CSR. The labels will be things like “csr_1” or “csr_final” but again can be anything. There are times that the label will need to be moved to a different version of a dataset - rare but done under GXP processes. It is literally a tag across the dataset versions within a study.</p>
<p>But this also makes me realize that there is nothing in Dataset-JSON nor this API that deals with version of data. There is only a timestamp. Which is important - but also can be related to a version that is an integer or commit hash. I’d want to see what dataset versions the label has been applied to across the study. I’d also want to see - for a given version of the dataset - all the labels it might have.</p>
<p>I’m not going to go through this part of the API as I did above - I don’t think it’s worth it at this point.</p>
</section>
<section id="study-define-define.xml" class="level3">
<h3 class="anchored" data-anchor-id="study-define-define.xml">Study “Define” (define.xml)</h3>
<p>This part of the API allows you to get the define.xml as a string for the SDTM or ADaM subset of your study. I don’t know why they return the XML as a string here - versus using a multi-part response where one part could have a MIME type of <code>application/xml</code> I also think that we will want to support something like Define-JSON when that is ready.</p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>I’m not gonna lie - this part of the Dataset-JSON deliverables was pretty underwhelming to me. I get the value in this but really think this needs both more brains on it to go through scenarios and use-cases as well as more software engineering chops to create a production-ready spec. I get that there will be very slow up-take for the API and there will be future changes. As I said in part 2, I’m happy to help - this isn’t just to criticize. I just don’t know how best to do that.</p>
<p>This ends my review of the Dataset-JSON deliverables from CDISC. I hope you find value in this. As always, you can contact via LinkedIn, Mastodon, email, or phone - it’s all there at the top of this page or on <a href="https://www.learnthinkcode.com">my webpage</a>.</p>


</section>

 ]]></description>
  <category>pharmaverse</category>
  <category>data</category>
  <guid>https://brianrepko.github.io/blog/posts/2026-03-11-datasetjson-part3/</guid>
  <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2026-03-11-datasetjson-part3/datasetjson.png" medium="image" type="image/png" height="72" width="144"/>
</item>
<item>
  <title>edgePython and other AI thoughts</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2026-02-26-edgepython-and-ai/</link>
  <description><![CDATA[ 





<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>I am an “AI” luddite
</div>
</div>
<div class="callout-body-container callout-body">
<p>Much like my father retired just before the advent of personal computers, I retired just before the advent of good GenAI tooling. So I haven’t used it ever - though I’ve made use of transformer models (biological data transformation) and am familiar with data science (as it was my career for 15 years) and software engineering (my whole career).</p>
</div>
</div>
<section id="thoughts-of-ai-dancing-in-my-head" class="level2">
<h2 class="anchored" data-anchor-id="thoughts-of-ai-dancing-in-my-head">Thoughts of “AI” dancing in my head</h2>
<p>This last week I received a link to the <a href="https://shumer.dev/something-big-is-happening">Matt Shumer viral post</a>. I also found some posts that were critical of it - my favorite is <a href="https://medium.com/ai-software-engineer/i-took-matt-shumer-ai-article-with-a-grain-of-salt-here-are-my-reasons-d553a6b1b9e2">here</a> - much of which I agree with. Code, as a subset of text, is so formalized that it makes sense to me that the generation of code will be a sweet spot for generative “AI” tools. This isn’t new - these tools existed prior to making use of LLMs for them. After all, for the youngins, <a href="https://en.wikipedia.org/wiki/Model-driven_architecture">model-driven architecture</a> was a thing 25 years ago.</p>
<p>But with the “AI is amazing” blog dancing in my head - I got hit by <code>{edgePython}</code>.</p>
</section>
<section id="edgepython" class="level2">
<h2 class="anchored" data-anchor-id="edgepython">edgePython</h2>
<p>Within bioinformatics, the <a href="https://bioconductor.org/packages/release/bioc/html/edgeR.html"><code>{edgeR}</code> package</a> is one of the main packages used for differential expression analysis. It is an R package that is part of the <a href="https://bioconductor.org">BioConductor</a> ecosystem - a large collection of R packages for bioinformatics. BioConductor is one of main reasons that bioinformaticians and computational biologists learn R. There is an ever-ongoing language discussion - R vs Python - that exists in this space and BioC is a <em>HUGE</em> plus on the R side of the scale. (It’s not a scale - you should learn both).</p>
<p>So last week, Lior Pachter posted <a href="https://liorpachter.wordpress.com/2026/02/19/the-quickening/">The Quickening</a> about his work creating a re-implementation or port of <code>{edgeR}</code> from R (and C) to Python - called <code>{edgePython}</code>. You can find links to the <a href="https://github.com/pachterlab/edgePython">code</a> as well as a <a href="https://www.biorxiv.org/content/10.64898/2026.02.16.706223v2">preprint about the work</a> in the blog. In particular, for me, the ability to work with AnnData files and proper kallisto support was like - DAMN! - this is amazing.</p>
<p>In the end, I started wondering if all of BioConductor could be ported to Python - or perhaps Julia or Rust for pure language implementations that also cover the optimized parts written in C, C++, or Fortran.</p>
<p>And then I wondered about who would own it…</p>
</section>
<section id="open-source-copyright-and-licensing" class="level2">
<h2 class="anchored" data-anchor-id="open-source-copyright-and-licensing">Open Source Copyright and Licensing</h2>
<p>When you write code, you own the copyright to it. Most coders work for someone and part of their employee agreements is that you give up that copyright to your employer. For open source projects, they all have (or SHOULD HAVE) a license - which delineates what others can do with the code - but copyright is still owned by the authors. Some open source projects have contributor agreements that also make it that you give up copyright to the project. Where this comes into play is that <strong>in order to change the license, all the copyright owners need to agree to the change</strong>.</p>
<p>So - <code>{edgeR}</code> - has a GPL v2 (or higher) license - and <code>{edgePython}</code>, as a port, is a derivative work of <code>{edgeR}</code>. But if <code>{edgePython}</code> was generated with a GenAI tool (in this case it was both Claude and Codex), my understanding is that there is no new copyright holder. The <code>{edgeR}</code> copyright owners are still the copyright owners of <code>{edgePython}</code> - technically for method signatures (?) - but not for any of the actual code.</p>
<p>The license for <code>{edgePython}</code> was chosen to be GPL v3 - which is allowed on a derivative work with a GPL v2 (or higher) license. In my experience, Lior Pachtor always gets it correct, but I think it will be an interesting scenario - one that will happen often - for copyright lawyers to work through.</p>
<p>And as coincidence would have it, rOpenSci just posted on their take of the use of GenAI tools with their packages.</p>
</section>
<section id="scientific-open-source-groups-are-on-it" class="level2">
<h2 class="anchored" data-anchor-id="scientific-open-source-groups-are-on-it">Scientific Open Source groups are on it</h2>
<p>Today, rOpenSci, blogged about their <a href="https://ropensci.org/blog/2026/02/26/ropensci-ai-policy/">draft “AI” policy</a>. In it, they reference both the policies from the <a href="https://blog.joss.theoj.org/2026/01/preparing-joss-for-a-generative-ai-future">Journal of Open Source Software</a> as well as <a href="https://www.pyopensci.org/blog/generative-ai-peer-review-policy.html">pyOpenSci</a>.</p>
<p>These guides do make suggestions about what to document around the use of the GenAI tooling. To his credit, the <code>README.md</code> file of <code>{edgePython}</code> has all of that (with the preprint doing the heavy lifting). Again, Lior Pachter always gets it correct.</p>
<p>The pyOpenSci policy discusses one scenario that I’d not thought about. There will be issues with understanding what code that went into training these models, not knowing what the license is for that training data, and if the code generated might be in violation of the license from the code in the training data (example, missing attribution).</p>
<p>As much as I love the idea of porting bioinformatics code from one language to another - I’m not sure how to get around the scenario that pyOpenSci highlights. They, and rOpenSci, are correctly pushing this type of review onto the contributors but I’m not sure that tooling exists for finding all the license requirements of the input code related to your output code (when it is the same).</p>
</section>
<section id="teamwork-makes-the-dream-work" class="level2">
<h2 class="anchored" data-anchor-id="teamwork-makes-the-dream-work">Teamwork makes the dream work</h2>
<p>In the <code>{edgePython}</code> blog post, one of the comments made about <code>{edgeR}</code> is that it is a complex code base maintained and improved over 16 years. What would be interesting to me, is not only how to make the port to Python but could the tools also create a description or documentation of the new package. There isn’t anything in the <code>docs</code> folder of <code>{edgePython}</code> - just an <code>api.yaml</code> file that I’m not sure the usage of. I’m particular to the <a href="https://en.wikipedia.org/wiki/4%2B1_architectural_view_model">4+1 model</a> of describing software - it would be amazing to see if tools can generate that.</p>
<p>Documentation is important for team work - how do we talk about the system and the code (and the product) that we are making. Having that model in your head is key to high performing software teams. Many of the “AI is awesome” work is done by <strong>teams-of-1</strong>. I struggle with the parts of software engineering that are needed for <strong>teams-of-n</strong>. I think that is what JOSS, pyOpenSci, and rOpenSci are starting to discuss - particular to code review. Code review is essentially a part of - <em>does this change require an update to the model we all have in our heads? and how best to make that happen?</em></p>
<p>And that was my additional criticism of the “Something Big is Happening” post. There is increasing value in these tools - they are getting better at code generation - but that is only a part of the process and skill of product / software development.</p>


</section>

 ]]></description>
  <category>bioinformatics</category>
  <category>ai</category>
  <guid>https://brianrepko.github.io/blog/posts/2026-02-26-edgepython-and-ai/</guid>
  <pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2026-02-26-edgepython-and-ai/ai.png" medium="image" type="image/png" height="74" width="144"/>
</item>
<item>
  <title>Dataset-JSON</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2026-01-15-datasetjson-part2/</link>
  <description><![CDATA[ 





<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Series
</div>
</div>
<div class="callout-body-container callout-body">
<p>This is part 2 of a series on dataset-json, you can find part 1 <a href="https://brianrepko.github.io/blog/posts/2025-12-31-datasetjson-part1/" target="_blank">here</a> for an overview of dataset-json.</p>
</div>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>CDISC workshop
</div>
</div>
<div class="callout-body-container callout-body">
<p>There is a planned Dataset-JSON Hands-On Implementation workshop on 18 May 2026 from 0900-1300 CET at the 2026 CDISC Europe Interchange conference in Milan. See <a href="https://www.cdisc.org/events/interchange/2026-cdisc-europe-interchange">the main website</a> or <a href="https://web.cvent.com/event/b32aaba1-214b-486e-8b5a-fdc8342f9794/websitePage:645d57e4-75eb-4769-b2c0-f201a0bfc6ce">here</a> for details.</p>
</div>
</div>
<section id="intention" class="level2">
<h2 class="anchored" data-anchor-id="intention">Intention</h2>
<p>Similar to how art can be additive (painting) or subtractive (sculpting), computational work can be positive (“how do I get to a result”) or negative (“what could go wrong here”). Most of us develop in both ways of working - and we tend to have a default / stronger side. Mine is the negative - which is why I think I’ve always focused on risk with systems development. I mention this because this blog post falls 100% in the - what could go wrong here category. This is me going through a specification and calling out anything that feels risky. I hope this spawns conversations and github issues and wiki updates and PRs. This is me also calling out where I’d love to contribute - not just complain. I’ll also add that I am not a SAS programmer so I may definitely get some items wrong here.</p>
<p>From part 1, here is the <a href="https://cdisc-org.github.io/DataExchange-DatasetJson/doc/dataset-json1-1.html">link to the html version of the spec</a> - so let’s dive in…</p>
</section>
<section id="background---xml-and-json-technologies" class="level2">
<h2 class="anchored" data-anchor-id="background---xml-and-json-technologies">Background - XML and JSON technologies</h2>
<p>Within the XML set of specifications and technologies - we have the following:</p>
<ul>
<li><a href="https://www.w3.org/TR/xml/">XML</a> - data format</li>
<li><a href="https://www.w3.org/TR/xmlschema11-2/">XML Schema</a> - structure and data types with simple validation and constraints</li>
<li><a href="https://schematron.com">Schematron</a> - rule-based content validation, complex relationships, and business logic</li>
</ul>
<p>For JSON we have a similar set of specifications and technologies:</p>
<ul>
<li><a href="https://www.json.org/json-en.html">JSON</a> - data format</li>
<li><a href="https://json-schema.org">JSON schema</a> - structure and data types with simple validation and constraints</li>
<li><a href="https://amer-ali.github.io/jsontron/">jsontron</a> - a port of Schematron for JSON (not widely used)</li>
</ul>
<p>The reason that I want to bring this up is that some of the issues below relate to “where” something belongs and potentially having it in the wrong space.</p>
<p>Also, I think that the use of jsontron would be interesting in this space potentially.</p>
<section id="data-types" class="level3">
<h3 class="anchored" data-anchor-id="data-types">Data Types</h3>
<p>JSON, JSON schema, and dataset-json work with a set of data types. Note that JSON / JSON schema are working on object / property definition and dataset-json is column definition. Here is a table that goes over what is available for each specification in relation to each other:</p>
<table class="caption-top table">
<caption>Data Types</caption>
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>JSON</th>
<th>JSON schema</th>
<th>dataset-json</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>boolean</td>
<td>boolean</td>
<td>boolean</td>
</tr>
<tr class="even">
<td>string</td>
<td>string</td>
<td>string</td>
</tr>
<tr class="odd">
<td></td>
<td></td>
<td>decimal</td>
</tr>
<tr class="even">
<td></td>
<td></td>
<td>datetime</td>
</tr>
<tr class="odd">
<td></td>
<td></td>
<td>date</td>
</tr>
<tr class="even">
<td></td>
<td></td>
<td>time</td>
</tr>
<tr class="odd">
<td></td>
<td></td>
<td>URI</td>
</tr>
<tr class="even">
<td>number</td>
<td>integer</td>
<td>integer</td>
</tr>
<tr class="odd">
<td></td>
<td>number</td>
<td>float</td>
</tr>
<tr class="even">
<td></td>
<td></td>
<td>double</td>
</tr>
<tr class="odd">
<td>array</td>
<td>array</td>
<td>-</td>
</tr>
<tr class="even">
<td>object</td>
<td>object</td>
<td>-</td>
</tr>
</tbody>
</table>
<p>Note that dataset-json has both a JSON schema and LinkML version of the schema. The JSON schema is based on the 2019-09 standard (vs the latest, which is 2020-12).</p>
<p>Also note that JSON schema has various constraints added to the type. In general, objects can also specify, via a <code>required</code> array, which properties are required. There is also the capability to do “required-if” logic (which could also be a jsontron rule).</p>
<p>For string-based data types, we have <code>minLength</code>, <code>maxLength</code>, <code>pattern</code>, and <code>format</code>. JSON schema has known formats for <code>date-time</code>, <code>date</code>, <code>time</code>, and <code>uri</code> (and <code>iri</code>) - but <strong>dataset-json does not use these</strong>. It makes use of <code>pattern</code> for validation of those string-based types. This may be the case that <code>pattern</code> is actually used when validating and <code>format</code> may not be. That is probably dependent on the validator software. You can find the full list of supported JSON schema formats <a href="https://json-schema.org/understanding-json-schema/reference/type#built-in-formats">here</a>.</p>
<p>For numeric data, JSON schema separates integers (ℤ) from decimal (𝔻) data. Note that decimals (𝔻) is a subset of the rationals (ℚ) - given by <img src="https://latex.codecogs.com/png.latex?a/10%5En"> (where <img src="https://latex.codecogs.com/png.latex?a"> and <img src="https://latex.codecogs.com/png.latex?n"> are in ℤ). Thus <img src="https://latex.codecogs.com/png.latex?1/3"> is in ℚ but not in 𝔻 (it repeats forever) and thus for some values we have to deal with both precision and rounding.</p>
<p>The dataset-json data types of <code>float</code> and <code>double</code> are intended to represent in-memory IEEE 754 single- and double-precision constructs. It is basically saying what data type to use in the environment (SAS, R, python, etc.) that will import this data - or at least the <em>capability</em> of said environment (eg. R only has doubles - not floats). There are the well-known issues of converting values in 𝔻 to a base 2 binary format.</p>
<p>Interestingly, there is no JSON schema format for dataset-json’s “decimal” - again that is using a data type to specify the in-memory representation for this column (aka decimal value based on JSON string).</p>
<p>In terms of character encoding for the file, later versions of JSON require UTF-8 encoding. The dataset-json User’s Guide refers to optional usage of UTF-16 or UTF-32 but this is not allowed in <a href="https://datatracker.ietf.org/doc/html/rfc8259">RFC8259</a>.</p>
<p>Also, a general issue with JSON is that the Unicode escape - <code>\uXXXX</code> - only supports Unicode’s Basic Multilingual Plane (BMP) and not full Unicode. One needs to either embed the codepoints directly (UTF-8 supports all Unicode) or escape as surrogate pair.</p>
<p>And then for moving from US-ASCII to Unicode…</p>
<p><img src="https://brianrepko.github.io/blog/posts/2026-01-15-datasetjson-part2/unicode-meme.jpg" class="img-fluid"></p>
<p>Once you have Unicode strings, there are many issues that one needs to worry about. I’m not going to go over all of these but here is a short list (many from <a href="https://tonsky.me/blog/unicode/">here</a>).</p>
<ul>
<li>Unicode normalization</li>
<li>length of a Unicode string (with grapheme clusters - particularly for Asian languages)</li>
<li>casing (upper/lower)</li>
<li>sorting / collation</li>
<li>right-to-left vs left-to-right languages</li>
<li>regex class membership (eg. <code>:alpha:</code>)</li>
</ul>
<p>I get that we are looking to move beyond the current XPT limitations at a later stage. At some point, I’m sure CDISC will have a number of workshops on “Internationalizing Clinical Trial Data”.</p>
</section>
</section>
<section id="our-problem-domain" class="level2">
<h2 class="anchored" data-anchor-id="our-problem-domain">Our Problem Domain</h2>
<p>I think it’s useful to also step back and share what the actual problem we are trying to solve here is. We will have, as part of the data and coding related to a clinical trial, “data frames”. For SAS, these are datasets, for R, these are data.frame(s) and for python, these are pandas.DataFrame(s) (or polars?) and in SQL, these are tables.</p>
<p>We have two main contexts - same-language serialization (export / import) and different-language serialization. In both these contexts, the ultimate question is - is the original in-memory data frame the same as the resulting in-memory data frame. When we are in the same-language context (SAS → JSON → SAS), “sameness” of in-memory data frames can consider deeper properties of the internal data structure. When we are in the different-language context (R → JSON → SAS), “sameness” of the in-memory data frames does not have access to those deeper properties - internal data types used for the columns are in different programming languages and data frame classes have different properties - SAS datasets can have a “key” (hash), multiple indices, display formats etc. R data.frames have row.names which SAS does not (similar to key but SAS keys don’t <em>have to</em> be unique).</p>
<p>The reason that I bring this up is that some parts of the dataset-json specification are only for the <strong>SAS-specific same-language context</strong>.</p>
<p>I actually think that we might not care about a different-language context as clinical trial validation from the regulatory authority will be same-language. However, as clinical trial code becomes more polyglot - some SAS, some R, some python - then we might need to consider different-language contexts.</p>
</section>
<section id="oids-and-the-model-beyond-a-data-frame" class="level2">
<h2 class="anchored" data-anchor-id="oids-and-the-model-beyond-a-data-frame">OIDs and the model beyond a “data frame”</h2>
<p>As I was initially reviewing the dataset-json specification, one quick item that stuck out was that there was no way to say if a column allowed for missing values or not - that is - is a column required.</p>
<p>This led me into the ODM model and the various OID columns in the specification.</p>
<p>The dataset-json specification is not <em>ONLY</em> “serialized data frames” - it is “serialized data frames within a catalog”. That catalog is the <code>define.xml</code> that is sent along with the <code>.json</code> or <code>.ndjson</code> or <code>.dsjc</code> files. This gets a bit too deep and needs it’s own blog post but in short - you can think of the following connections.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>ODM term</th>
<th>concept</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Study</td>
<td>Study</td>
</tr>
<tr class="even">
<td>ItemGroup</td>
<td>Domain or data frame metadata</td>
</tr>
<tr class="odd">
<td>Item</td>
<td>Variable or column metadata</td>
</tr>
<tr class="even">
<td>ItemGroupData</td>
<td>actual row of values</td>
</tr>
<tr class="odd">
<td>ItemData</td>
<td>actual cell value</td>
</tr>
<tr class="even">
<td>File</td>
<td>actual filename</td>
</tr>
</tbody>
</table>
<p>So an ItemGroupOID might have a value of <code>IG.DM</code> for the SDTM <code>DM</code> domain (Demographics). And there will be a ItemOID of <code>IT.STUDYID</code> or <code>IT.USUBJID</code> in that item group. Those are columns that used in multiple domains. The DM domain might also have an ItemOID of <code>IT.DM.ETHNIC</code> which is specific to the DM domain.</p>
<p>The reason that I bring this up is that some properties of the column definition are in <code>define.xml</code> as <strong>ItemRef(s)</strong> and some properties as <strong>ItemDef(s)</strong>. <code>Mandatory</code> is part of the ItemRef. The dataset-json specification is only ItemDef data.</p>
<p>The specification is written so that the connection to the catalog is optional, but to fully validate data, there are some column attributes that are only in <code>define.xml</code>.</p>
</section>
<section id="column-attribute-issues" class="level2">
<h2 class="anchored" data-anchor-id="column-attribute-issues">Column Attribute Issues</h2>
<p>Looking at the column attributes, what is required are the basics</p>
<ul>
<li><code>itemOID</code> - the unique key of this column</li>
<li><code>name</code> - the column name</li>
<li><code>label</code> - the column label</li>
<li><code>dataType</code> - the column type</li>
</ul>
<p>These all make sense - with dataType being an enum (see list above). Technically, there are many other dataTypes in the ODM XML model but in practice the main types are there.</p>
<section id="targetdatatype" class="level3">
<h3 class="anchored" data-anchor-id="targetdatatype">targetDataType</h3>
<p>As for <code>targetDataType</code>, this came about as datetime, date, and time values in SAS are <strong>either</strong> strings or numeric values (with the 0-date being Jan 1, 19<strong>60</strong> and the 0-time being midnight). But the data in the JSON file is always a string, so this column attribute is telling the serialization software what the resulting in-memory data type should be. But R and python have various classes for datetime, date, and time values. Some of those do use a number representation for timestamps under the covers. But again, it seems we are using the specification to specify details of the internal data structure.</p>
<p>This column attribute is also used for the <code>decimal</code> data type in a superfluous manner. It has to be added for string-representations of decimal values.</p>
<p>If we really want to specify the language-specific data structure properties, then it might make more sense to have those under the top-level <code>sourceSystem</code> perhaps - or under language-specific top-level attributes.</p>
</section>
<section id="length" class="level3">
<h3 class="anchored" data-anchor-id="length">length</h3>
<p>Next we have <code>length</code>, which as mentioned above is problematic for Unicode strings based on normalization and grapheme clustering. The sense I get is that the value here is meant to be a <strong>capability</strong> (you have to handle strings at-least this long).</p>
</section>
<section id="displayformat" class="level3">
<h3 class="anchored" data-anchor-id="displayformat">displayFormat</h3>
<p>This is clearly a SAS only attribute to used to make sure original and resulting datasets look the “same”.</p>
</section>
<section id="keysequence" class="level3">
<h3 class="anchored" data-anchor-id="keysequence">keySequence</h3>
<p>The keySequence attribute can be used to create the SAS dataset key (hash) - so this is clearly for internal data structure. Technically in SAS, the key doesn’t have to be unique (eg. <code>MULTIDATA: 'Y'</code>). For R, <code>row.names</code> could be created from this but <code>row.names</code> must be unique.</p>
<p>I’m not familiar enough with SAS to know if a component of a key implies it is required.</p>
<p>I’d also add that SAS datasets can have multiple indices, so I’m not sure why, if we are specifying internal data structure we include the key and not the indices. My guess is that these are not part of ODM XML schema to which we are creating this specification from. It could be the ODM XML schema is a bit SAS-specific as well.</p>
</section>
<section id="what-is-not-here" class="level3">
<h3 class="anchored" data-anchor-id="what-is-not-here">What is not here</h3>
<p>As mentioned above, if I needed to specify more rules to validate a dataset, then I would need to know if the column values can be missing or not (required) or conditionally required.</p>
<p>In addition, all those JSON schema properties start to apply - range on numeric values? enumerations? code-lists? string regex?</p>
</section>
</section>
<section id="top-level-metadata-attribute-issues" class="level2">
<h2 class="anchored" data-anchor-id="top-level-metadata-attribute-issues">Top-level Metadata Attribute Issues</h2>
<p>Again, the required attributes for the dataset-json top-level metadata all make sense.</p>
<ul>
<li>itemGroupOID - the key for this dataset</li>
<li>name - the name of the dataset</li>
<li>label - the label for the dataset</li>
<li>columns - the column definitions (see above)</li>
<li>records - the number of rows</li>
<li>datasetJSONCreationDateTime - timestamp for dataset-json file creation</li>
<li>datasetJSONVersion - the version (1.1 with optional third component)</li>
</ul>
<p>In general, it is a pet peeve of mine to allow datetime and time values without timezones but that is allowed here (as well as dealing with incomplete data) - all part of ISO 8601.</p>
<p>And for style, I’d have used <code>rowCount</code> for <code>records</code> but given the context it makes total sense.</p>
<section id="oids" class="level3">
<h3 class="anchored" data-anchor-id="oids">OIDs</h3>
<p>The <code>fileOID</code>, <code>studyOID</code>, <code>metaDataVersionOID</code> and <code>metaDataRef</code> attributes are all related to connecting this data to the <code>define.xml</code> catalog - technically optional but I’m not sure how to properly validate a dataset without it.</p>
<p>In the end, <code>itemGroupOID</code> and <code>itemOID</code> are the only required OID values.</p>
</section>
<section id="dblastmodifieddatetime" class="level3">
<h3 class="anchored" data-anchor-id="dblastmodifieddatetime">dbLastModifiedDateTime</h3>
<p>The <code>dbLastModifiedDateTime</code> attribute is the last modified timestamp for the source of this JSON data.</p>
<p>There is a jsontron type rule for this being earlier than the creation timestamp required attribute.</p>
<p>Again as a style thing, I would have used “source” vs “db” - and perhaps made this part of the <code>sourceSystem</code> object. And it should have a timezone.</p>
<p>The mention of “db” however, makes me wonder about creating a DuckDB extension to read and write dataset-json files (each file as a table and then include the catalog data as well).</p>
</section>
<section id="source-information" class="level3">
<h3 class="anchored" data-anchor-id="source-information">Source information</h3>
<p>The <code>originator</code>, <code>sourceSystem</code> (with <code>name</code> and <code>version</code>) are to be used to share the organization and “system” that generated this dataset-json file. It’s not clear to me if <code>sourceSystem</code> should be a package name, programming language, internal SCE name? I think more clarity on how this could be used is warranted. This could also be a place there SAS, R, python-specific data could be recorded.</p>
</section>
</section>
<section id="extendability-issues" class="level2">
<h2 class="anchored" data-anchor-id="extendability-issues">Extendability Issues</h2>
<p>There is discussion of extendability to dataset-json - this uses a new <code>sourceSystem.systemExtensions</code> attribute but also seems to imply a custom JSON schema file. I do think JSON schema has built in schema extension capabilities and perhaps those could be looked at in the examples. Perhaps this is a place to extend the data to internal data structure details.</p>
</section>
<section id="next-article" class="level2">
<h2 class="anchored" data-anchor-id="next-article">Next article</h2>
<p>I hope this is useful and I am more than open to creating github issues / updates to the User Guide etc. Happy to receive feedback - my contact info is at <a href="https://www.learnthinkcode.com">my website</a>.</p>
<p>In part 3, I’ll review the API and potential issues there.</p>


</section>

 ]]></description>
  <category>pharmaverse</category>
  <category>data</category>
  <guid>https://brianrepko.github.io/blog/posts/2026-01-15-datasetjson-part2/</guid>
  <pubDate>Thu, 15 Jan 2026 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2026-01-15-datasetjson-part2/datasetjson.png" medium="image" type="image/png" height="72" width="144"/>
</item>
<item>
  <title>Dataset-JSON</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2025-12-31-datasetjson-part1/</link>
  <description><![CDATA[ 





<p>As part of R/Pharma this year, I attended a <a href="https://atorus-research.github.io/datasetjson_workshop/" target="_blank">workshop</a> on the CDISC Dataset-JSON standard. This was a great workshop and props/kudos/flowers to all the folks that put that together. More recently, CDISC <a href="https://mailchi.mp/cdisc.org/2025bod-12872980?e=d16406719d" target="_blank">announced</a> that v1.0 of the API standard is released as well as the supplementary details on compression (aka Compressed Dataset-JSON v1.1).</p>
<p>This later announcement (and some a review of Pilot 5 updates) gives me a chance to dive into this important set of specifications and updates - thus this series.</p>
<p>I’m planning to do an overview in this part 1, go over some potential issues and clarifications in part 2, and then go over the REST API in part 3 (which TBH I’ve not looked at in detail yet, so it goes last).</p>
<section id="background" class="level2">
<h2 class="anchored" data-anchor-id="background">Background</h2>
<p>The data sent to regulatory authorities (eg. FDA for USA, EMA for EU, PMDA for Japan, etc.) typically conforms to <a href="https://www.cdisc.org" target="_blank">CDISC</a> international standards. When transferring this (tabular) data - from sponsor to RA - the <a href="https://www.loc.gov/preservation/digital/formats/fdd/fdd000464.shtml" target="_blank">SAS V5 XPORT format</a> (aka “XPT”) is currently used (required for the FDA). XPT was defined in 1989 and its use by the FDA was made official in 1999. It has many disadvantages - some of which we will highlight below.</p>
<p>Also in 1999, CDISC created the Operational Data Model (ODM). ODM is a foundational data standard that underlies many of the other CDISC standards and is XML-based. Between 2000 and 2005, ODM versions 1.0 through 1.3 were released. Some point releases, ODM v1.3.1 and v1.3.2 were released in 2010 and 2013 respectively. So from 1999 to 2013, data was defined based on ODM XML foundational standards and still transferred in XPT.</p>
<p>As industry is exploring coding alternatives to SAS - and potentially with sponsors creating all submission materials with R (or without SAS) - other data transfer format alternatives are needed. Moving to new data transfer standards will also allow CDISC to move beyond the limitations imposed by the XPT format.</p>
<p>Given that ODM was XML-based, in 2012, work was done to create Dataset-XML, with version 1.0 released in 2014. This is a XML-based data transfer format based on ODM v1.3.2. However, it was found that Dataset-XML created data files that were quite large (already an issue for XPT) and it was not widely adopted by industry.</p>
<p>Skip ahead ten years, and in 2023, ODM v2.0 was released. Dataset-XML v1.0 was not updated - it is still an extension of the ODM XML schema (as far as I can tell with a cursory review) and may not need an update.</p>
<p>As part of the ODM v2.0 work, on the data transfer side, Dataset-JSON v1.0 was also released in 2023. One of the best articles on the need for, development of, and plans for Dataset-JSON is <a href="https://www.clinicalleader.com/doc/no-more-xpt-piloting-new-dataset-json-for-fda-submissions-0001" target="_blank">this one from Sam Hume</a>.</p>
<p>A number of hackathons and pilot projects were done - some issues found - and then Dataset-JSON v1.1 was released in Dec 2024 to address those issues. Some issues required changes to the specification - some were addressed in the User Guide. <a href="https://phuse.s3.eu-central-1.amazonaws.com/Deliverables/Optimizing+the+Use+of+Data+Standards/WP-88+Dataset-JSON+Report.pdf" target="_blank">Here</a> is the report, from PHUSE, of those pilot projects.</p>
<p>Lastly, as mentioned above, the Dataset-JSON API v1.0 and Compressed Dataset-JSON v1.1 standards were released a few weeks ago in Dec 2025.</p>
</section>
<section id="dataset-json" class="level2">
<h2 class="anchored" data-anchor-id="dataset-json">Dataset-JSON</h2>
<p>The Dataset-JSON work actually has multiple components - with a view towards transfer of data via APIs instead of via files. In particular, this could become part of how sponsors can collect data from various other systems and collaborators - EDCs, CROs, etc.</p>
<p>These components are:</p>
<ul>
<li>Dataset-JSON specification and schema (<a href="https://github.com/cdisc-org/DataExchange-DatasetJson" target="_blank">github repository</a>)
<ul>
<li>Dataset-JSON v1.1</li>
<li>The NDJSON representation of Dataset-JSON</li>
<li>Compressed Dataset-JSON v1.1 (aka DSJC)</li>
<li>Dataset-JSON schema</li>
<li>User’s Guide (<a href="https://wiki.cdisc.org/display/PUB/Dataset-JSON+v1.1+User%27s+Guide" target="_blank">cdisc wiki</a>)</li>
</ul></li>
</ul>
<p>and</p>
<ul>
<li>Dataset-JSON API (<a href="https://github.com/cdisc-org/DataExchange-DatasetJson-API" target="_blank">github repository</a>)
<ul>
<li>Specification as HTML or OpenAPI</li>
<li>User’s Guide</li>
</ul></li>
</ul>
</section>
<section id="issues-with-xpt" class="level2">
<h2 class="anchored" data-anchor-id="issues-with-xpt">Issues with XPT</h2>
<p>Anyone that has looked at clinical trial data has wondered why column names are limited to 8 characters - it’s XPT.</p>
<p>The data is always tabular (rows and columns). Columns have name, label / description, type, and length and formatting options. There is no row identifier (unlike R data.frames) - access, <em>in SAS</em>, is via row number (<code>POINT=</code>) or via indices defined on top of the data (<code>KEY=</code>).</p>
<p>Here is a quick list of issues with XPT v5:</p>
<ul>
<li>Column/variable types - only CHARACTER (string) and DOUBLE (numeric - integer or floating point)</li>
<li>Column names are limited to 8 alphanumeric + <code>_</code> characters</li>
<li>Column labels are limited to 40 characters</li>
<li>Character values are US ASCII only with max length of 200 characters/bytes</li>
<li>Character values are stored with padding (so, larger than they need to be)</li>
<li>Numeric values are stored as IBM hexidecimal floating point (aka HFP, aka IBM-style double) format
<ul>
<li>Which is <strong>NOT</strong> IEEE 754, see <a href="https://en.wikipedia.org/wiki/IBM_hexadecimal_floating-point" target="_blank">wikipedia</a></li>
<li>This is why XPT is technically a binary file format - numeric data encoding in the file</li>
</ul></li>
<li>Inability to compress files - leads to data set splitting</li>
<li>There is no internally stored metadata
<ul>
<li>eg. file metadata, formatting on numerics, padding for characters, date/time formatting, keys</li>
</ul></li>
</ul>
<p>Note that modern versions of SAS use IEEE 754 internally for floating point data - it is only the XPT format that uses the HFP format. Also note that SAS does support date, time, and datetime variables - these are internally stored as numbers - with datetime point 0 being Jan 1, 1960 00:00:00 UTC. Actually, I have no idea how timezones work in SAS (all data is UTC and timezone is a system option?).</p>
<p>There is an XPT v8 format as well which made the following changes:</p>
<ul>
<li>Column names are extended to 32 characters (case sensitive)</li>
<li>Column labels are extended to 256 bytes</li>
<li>Character values are extended 32767 bytes</li>
<li>It is not clear if US ASCII is still a limitation or not (note the use of bytes vs character limits) - seems like it is limited to US ASCII</li>
</ul>
<p>In the end, there was not much industry uptake of XPT v8 and efforts were put elsewhere.</p>
</section>
<section id="quick-overview-of-dataset-json" class="level2">
<h2 class="anchored" data-anchor-id="quick-overview-of-dataset-json">Quick overview of Dataset-JSON</h2>
<p>Dataset-JSON is similar to XPT in that it is a single-file for tabular data.</p>
<p>As JSON (<a href="https://cdisc-org.github.io/DataExchange-DatasetJson/doc/dataset-json1-1.html" target="_blank">link to html version of spec here</a>), a dataset is a single object. Dataset metadata are attributes in that object, there is a <code>columns</code> array for column definitions, and then a <code>rows</code> array-of-arrays for data values.</p>
<p>Column definitions have the following attributes - unique ID (<code>itemOID</code>), <code>name</code>, <code>label</code>, <code>dataType</code>, <code>targetDataType</code>, <code>length</code>, <code>displayFormat</code>, and <code>keySequence</code>.</p>
<p>Note that <code>dataType</code> is the “logical” data type - tied to ODM - with values of <code>string</code>, <code>integer</code>, <code>decimal</code>, <code>float</code>, <code>double</code>, <code>boolean</code>, <code>datetime</code>, <code>date</code>, <code>time</code>, and <code>URI</code>.</p>
<p>The column attribute <code>targetDataType</code> is used for some logical types (decimals as strings, date/time/datetime as integers). We’ll discuss this more in part 2. It looks like boolean values do use JSON <code>true</code> and <code>false</code> (in other places I’ve seen these not used in favor of “Y”,“N” strings). Missing values are represented with JSON <code>null</code>. The empty string can be a value.</p>
<p>For the NDJSON representation - it is the same as above with the following changes:</p>
<ul>
<li>Each line is a JSON object</li>
<li>Row 1 is a JSON object that is all metadata and column array</li>
<li>Row 2-n are each an array of data
<ul>
<li>This is basically the <code>rows</code> array with “lines” being entries in the array-of-array</li>
</ul></li>
</ul>
<p>NDJSON is very useful for streaming large sets of data.</p>
</section>
<section id="next-article" class="level2">
<h2 class="anchored" data-anchor-id="next-article">Next article</h2>
<p>In part 2, I’ll make a list of some of the issues that come up in reading the specification and how some of the issues found in the pilot were handled.</p>
<p>Overall, I’m thrilled with this effort as it opens up clinical trial data to any programming language AND data transfer via API - but some details need some discussion. My intent is to highlight and/or clarify points related to data representation (from someone with a whole 30 year career in data engineering) and not to only criticize.</p>
<p>I hope those discussions will get added to the User’s Guide or maybe more <a href="https://pharmaverse.github.io/blog/" target="_blank">pharmaverse blog</a> posts (like <a href="https://pharmaverse.github.io/blog/posts/2023-10-30_floating_point/floating_point.html" target="_blank">here</a>, <a href="https://pharmaverse.github.io/blog/posts/2023-09-26_date_functions_and_imputation/date_functions_and_imputation.html" target="_blank">here</a>, and <a href="https://pharmaverse.github.io/blog/posts/2023-07-24_rounding/rounding.html" target="_blank">here</a>).</p>


</section>

 ]]></description>
  <category>pharmaverse</category>
  <category>data</category>
  <guid>https://brianrepko.github.io/blog/posts/2025-12-31-datasetjson-part1/</guid>
  <pubDate>Wed, 31 Dec 2025 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2025-12-31-datasetjson-part1/datasetjson.png" medium="image" type="image/png" height="72" width="144"/>
</item>
<item>
  <title>TileDB and Snowflake integration</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2025-12-20-tiledb-snowflake/</link>
  <description><![CDATA[ 





<p>I attended a webinar / demo of TileDB Carrara and it’s integration with Snowflake a few weeks ago. You can find the <a href="https://www.youtube.com/watch?v=cwcUdzkhVm0" target="_blank">recording on YouTube</a>.</p>
<p>I am familiar with TileDB as a tensor (aka n-dimensional) data format - a format often used for biomedical data. In fact, they have special formats, beyond general Arrays, for single-cell, VCF, and image data. You can find details at the <a href="https://cloud.tiledb.com/academy/structure/life-sciences/index.html" target="_blank">TileDB Academy website</a>.</p>
<p>But Carrara was new to me - it is a combination of “data” catalog (vs files) for a given project but also includes a notebook / compute environment. I found this <a href="https://www.youtube.com/watch?v=ic8EhcRStq0" target="_blank">short demo on YouTube</a> as a good overview of Carrara. What was cool to see was how these special formats can be rendered directly in the tool but also the notion of multi-file datasets rendered as a single entry in the project.</p>
<p>For the integration, you can see your Snowflake-based tabular data in your Carrara environment. And you could also see your TileDB-based multi-dimensional data in your Snowflake data - as tabular data. This basically allows you to merge multi-dimensional and tabular data with notebooks. On the Snowflake side, this could then be used for models or other computation that is added to your Snowflake environment.</p>
<p>This made me wonder if something similar wasn’t possible for DuckDB - could I see multi-dimensional TileDB data in DuckDB as tables? This doesn’t seem to be supported and, for me as an open source advocate, solidifies the need for open source alternatives to TileDB (eg. hdf5, Zarr, COG). I’ll have a later post on Zarr vs hdf5 (and more) since hdf5 doesn’t work so great on cloud-based storage.</p>
<p>That said, the ability to integrate TileDB multi-dimensional data and Snowflake together is a great expansion for multi-omic data analysis.</p>



 ]]></description>
  <category>bioinformatics</category>
  <category>data</category>
  <guid>https://brianrepko.github.io/blog/posts/2025-12-20-tiledb-snowflake/</guid>
  <pubDate>Sat, 20 Dec 2025 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2025-12-20-tiledb-snowflake/logos-snowflake-tiledb.png" medium="image" type="image/png" height="56" width="144"/>
</item>
<item>
  <title>Validation of R libraries for FDA/EMA submissions</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2025-11-06-quarto-gherkin/</link>
  <description><![CDATA[ 





<p>Since R/Pharma 2023 in Chicago, I’ve wanted to look into using the techniques of ATDD (or BDD… or Specification by Example) for testing sets of R packages (aka - an R library). This was a topic that came up as part of the <a href="https://pharmar.org" target="_blank">R Validation Hub</a> work that was presented at that conference. Particularly around the discussion on how can multiple pharma companies share their R library validation tests. It was a chance to bring a technique from software engineering that I was familiar with (in Java and Python) to the world of R and in particular the <a href="https://pharmaverse.org" target="_blank">pharmaverse</a>.</p>
<p>It’s now 2 years later and I will have a pre-recorded talk at the (<strong>free!</strong>) virtual <a href="https://rinpharma.com" target="_blank">R/Pharma</a> 2025 conference today (Nov 6, 2025). R/Pharma makes all the sessions available in about a month on <a href="https://www.youtube.com/rinpharma" target="_blank">YouTube</a> - this is a community that shares and grows together after all - but here are the early links to where I host things.</p>
<ul>
<li>The pre-recorded video is hosted <a href="https://www.learnthinkcode.com/files/rinpharma-2025/rinpharma-2025-atdd.mp4" target="_blank">here</a></li>
<li>The slides are hosted <a href="https://www.learnthinkcode.com/files/rinpharma-2025/presentation.html" target="_blank">here</a></li>
</ul>
<p>While putting this together, I found <a href="https://gojko.net/2020/03/17/sbe-10-years.html" target="_blank">this link</a> from Gojko Adzic, author of “Specification by Example” on where this technique sits in our bag of software engineering tools after 10 years - now 15 years - later. It validates that software quality is enhanced by the use of this technique - even if used for requirements only - versus being used for both requirements and automated testing. It also validates that the use of the Gherkin-based frameworks - Given/When/Then - are the norm - and fortunately, for R, we now have Jakub Sobolewski’s <a href="https://jakubsobolewski.com/cucumber/" target="_blank"><code>{cucumber}</code></a> package.</p>
<p>The more I think about what this could look like - and potentially integrated into Quarto - I see the following:</p>
<ul>
<li>The “living documentation” is a Quarto website with potentially additional descriptive explanation around gherkin text (along with <code>sessionInfo()</code> output)</li>
<li>The website can be organized into pages - Quarto documents with <code>```gherkin</code> code-blocks that contain the actual gherkin text</li>
<li>Quarto could be configured with a gherkin engine - for R, that could be <code>knitr+cucumber</code> (and can imagine a python engine as well)</li>
<li>Quarto rendering invokes knitr and cucumber to execute the gherkin tests and formats the output for <code>pandoc</code> to render the results as part of the report</li>
<li>Steps could be registered on the Quarto page in an <code>```{r}</code> block or in a package and discovered on package load</li>
<li>The report could be included as part of the validation report</li>
</ul>
<p>I’m super interested in people’s thoughts on this - please reach out via links at the end of the slides.</p>



 ]]></description>
  <category>pharmaverse</category>
  <category>agile</category>
  <guid>https://brianrepko.github.io/blog/posts/2025-11-06-quarto-gherkin/</guid>
  <pubDate>Thu, 06 Nov 2025 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2025-11-06-quarto-gherkin/logo-rinpharma.png" medium="image" type="image/png" height="50" width="144"/>
</item>
<item>
  <title>Blogging again…for real</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2025-10-31-blogging-for-real/</link>
  <description><![CDATA[ 





<p>The <a href="../2014-09-28-blogging-again-with-a-change-of-focus/">2014 post</a> that I wrote was about my move into bioinformatics as well as a move to Switzerland. Specifically, working in the Oncology disease area of Novartis Biomedical Research.</p>
<p>The <a href="../2019-04-09-snowflake-for-biomedical-research/">2019 post</a> that I wrote had me back in Minneapolis and working for Carrot Health. Still related to health - data engineering for predictive models in health care (social determinants of health, healthcare quality, etc.).</p>
<p>So, you can tell I did <em>NOT</em> start blogging again - and now it’s been 6 years.</p>
<p>Based on COVID work protocols, Novartis allowed for remote work in the disease areas and a position opened up with the Oncology team. I was fortunate to return in Oct 2020. I was even more fortunate to be able to retire, just before my 60th birthday, in June 2025.</p>
<p>Thus I can say that I truly will be blogging more - for real. I just converted my old Wordpress blog into this Quarto-based one.</p>
<p>And I have a bunch of projects - some are based in Minneapolis. I’m serving as the Twin Cities city captain for <a href="https://www.bitsinbio.org">Bits-in-Bio</a>. I’m also spending more time with <a href="https://www.isaiahmn.org">ISAIAH</a> - a pro-democracy multi-racial, multi-faith group here in Minnesota.</p>
<p>On the bioinformatics side, there are also a ton of projects.</p>
<ul>
<li>A system for bringing <a href="https://en.wikipedia.org/wiki/Specification_by_example">Specification by Example</a> to R libraries - starting with the <a href="https://pharmaverse.org">pharmaverse</a></li>
<li>A system for collecting genome and gene references in parquet / duckdb</li>
<li>Single-cell / Spatial dataset conversion with duckdb (using the hdf5 extension)</li>
<li>Improvements to the OMOP CDM data model</li>
<li>VCFs at scale (potentially with Zarr)</li>
<li>Potentially some algorithms using simplicial topology with multi-omic datasets (spatial transcriptomics + H&amp;E stain images)</li>
<li>and lots more (literally, I have a page of potential projects)</li>
</ul>
<p>Here in Minnesota, residents at age 62 can attend classes at the UMN for free. In a few years, I’ll be going back to school but I’m not sure what classes I’ll be pursuing.</p>
<p>In short, stay tuned - there will be more getting published here. I’ll also add notifications to new blog posts on Mastodon and LinkedIn - links you can find at the top of this page.</p>



 ]]></description>
  <category>bioinformatics</category>
  <guid>https://brianrepko.github.io/blog/posts/2025-10-31-blogging-for-real/</guid>
  <pubDate>Fri, 31 Oct 2025 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2025-10-31-blogging-for-real/blogging.png" medium="image" type="image/png" height="123" width="144"/>
</item>
<item>
  <title>Snowflake for Biomedical Research</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2019-04-09-snowflake-for-biomedical-research/</link>
  <description><![CDATA[ 





<p>I’ve since left biomedical research (the Genomics team at Novartis Institutes for Biomedical Research - NIBR) and am now doing health analytics with Carrot Health. At Carrot Health, we are making use of Snowflake Computing as our data storage and query system. I love Snowflake and there are so many features that make it better than what we had at NIBR. In the spirit of <a href="https://www.snowflake.com/blog/top-10-cool-things-i-like-about-snowflake/">“Top 10 Cool Things I Like About Snowflake”,</a> I bring you my 10 reasons that Snowflake works for biomedical research.</p>
<p>Reason 1 - All in the Cloud (AWS or Azure) - no DBA, no hardware, no tuning.</p>
<p>You can be up and running instantly in your cloud of choice. There are even safety features like UNDROP, should you DROP a table or view by mistake. For bio folks - if you are a small lab and don’t have DBAs and hardware admins, etc., then no problem. And if you do, then the security features (below) should be enough to get your IT department to buy-in.</p>
<p>Reason 2 - Persistent (cached) results (with a visual execution plan)</p>
<p>In Snowflake, if you (or anyone) executes a query with a result set, that result is cached and used either in other query plans or as a direct result with the same SQL. This might mean that the first time you execute a query it might need some horsepower, but then for 24 hours after that, you’ll get that data back instantly. This is great for bio folks that are typically working on a particular project’s data - joined with large public datasets. Those can be cached for a while and then when you are on to the next project, they will just get removed from cache. And if you aren’t sure, you can easily go into Query History and look at the visual execution plan.</p>
<p>Reason 3 - The functions you need - pivot / unpivot, analytic functions and UDFs</p>
<p>Let’s face it - bio folks like their data in matrix form sometimes - pivot and unpivot in the database is great. Having the ability to do a wide variety of analytic functions can help with basic statistics. And having the ability to add your own functions is great too - but ECMAscript only.</p>
<p>Reason 4 - JSON in SQL</p>
<p>Snowflake supports a VARIANT column type that can hold JSON data and it has the SQL extensions to query that data. This is super useful for mixing structured and semi-structured data together. And that is key for aggregating bio data - because we can almost agree on most of the structure but then everyone has their extra data that they want to keep.</p>
<p>Reason 5 - Can connect from anything</p>
<p>Snowflake supports ODBC, JDBC, Python, Spark, a web console, and it’s own snowsql command. You can basically use any tool to get connected. We were able to easily add support for SchemaSpy and Flyway (JDBC-based tools) for Snowflake - and I typically use DbVisualizer (JDBC) to access it.</p>
<p>Reason 6 - Like a data lake but with SQL</p>
<p>Snowflake has amazing capabilities to both load and unload data to and from S3 (we are in AWS and now Azure) and it’s fast. You can regularly point it at a folder and it knows which files have already been loaded. And you can define that process to happen automagically. There are some file formats that it doesn’t support out of the box - I’ve had to convert some fixed width data to separated-values - but that is minor compared to the built-in infrastructure. For bio folks, I think that this is awesome - getting scientists to put their data in S3 is far easier than helping them get it into the database.</p>
<p>Reason 7 - Data Sharing, Data Cloning, and Time Travel</p>
<p>Snowflake has the ability to share databases between accounts. This means that someday, we could have reference data already loaded (once) in Snowflake and have everyone share it. Or results of a consortium’s work could be shared once. Sharing WITH SQL access. There is also the ability to quickly clone data which is another way that one can share data / parts of data (or promote data from QA to PROD). Snowflake, like Datomic, also has the ability to return results based on the data at a given time. For bio folks, this is exactly what is needed for reproducible research - and/or - for data that changes over time but you don’t want to deal with formal versioning.</p>
<p>Reason 8 - Multiple Databases and Schemas</p>
<p>Snowflake is one of the few systems that supports multiple databases with multiple schemas in them. And all SQL can cross databases and schemas. This helps tremendously with data organization and potentially with role-based sharing rights. And security doesn’t stop there - data is encrypted at rest and in transit and you can even lock down access to your own AWS PrivateLink so traffic never leaves your combined data center / AWS cloud. Snowflake is HIPAA and SOC2 compliant as well.</p>
<p>Reason 9 - Scaling compute vs scaling storage</p>
<p>With Snowflake, the SQL execution “cluster” is called a “warehouse” (horrible name, I know, but there you are). One can size (and resize) a warehouse for the queries at hand - thus having the ability to scale at will as needed (there are even warehouse clusters to get you even more compute if you need). You pay separately for storage and compute but you have tremendous control over it (and access to the accounting). You can even has department only warehouses to enable chargeback policies.</p>
<p>Reason 10 - the bleeding edge is available</p>
<p>Snowflake supports parquet files as well. It would be awesome to try and use ADAM-formatted data - or heck, run a whole Big Data Genomics variant calling pipeline directly on the database. Or could be fun to try a version of hail.is directly on the database. This is something I’d love to see people try - and is only do-able in Snowflake.</p>
<p>So there are my 10 - please feel free comment or email me at brian -dot- repko -at- learnthinkcode -dot- com. I should add that I’m writing this as a way to share my experience with my past co-workers and Snowflake has not asked for this or is supporting this in anyway. My thoughts and opinions only.</p>



 ]]></description>
  <category>architecture</category>
  <category>bioinformatics</category>
  <guid>https://brianrepko.github.io/blog/posts/2019-04-09-snowflake-for-biomedical-research/</guid>
  <pubDate>Tue, 09 Apr 2019 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2019-04-09-snowflake-for-biomedical-research/snowflake-logo.png" medium="image" type="image/png" height="34" width="144"/>
</item>
<item>
  <title>Blogging again</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2014-09-28-blogging-again-with-a-change-of-focus/</link>
  <description><![CDATA[ 





<p>A few years back, I noticed a friend’s LinkedIn update that he was working for Entagen, a company doing life sciences work in the Twin Cities. And I knew that I had to jump at the chance to get into life science programming. I was subcontracted to do some work at Novartis - the research branch in particular (NIBR) and then quickly found myself working on a project to manage and merge all public genomic data. I’m so thankful to the people that made that all happen.</p>
<p>And I’ve never looked back. It’s been 3 years, a move to a new country (Switzerland) and a new disease area (Oncology) and I love my work. I love learning the science and seeing where good software engineering can make a difference. If “bioinformatician” means a “software engineer that uses their knowledge of biology” - then I now call myself a bioinformatician.</p>
<p>It’s been a while since I was blogging - but I’m going to start up again. However, the blog will be more focused on the state of software engineering in the life sciences (from my one perspective) and where technology is going in this space. Stay tuned…</p>



 ]]></description>
  <category>bioinformatics</category>
  <guid>https://brianrepko.github.io/blog/posts/2014-09-28-blogging-again-with-a-change-of-focus/</guid>
  <pubDate>Sun, 28 Sep 2014 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2014-09-28-blogging-again-with-a-change-of-focus/blogging.png" medium="image" type="image/png" height="123" width="144"/>
</item>
<item>
  <title>Slides from Practical Agility</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-09-22-slides-from-practical-agility-jbehave-and-fit-goodbadugly/</link>
  <description><![CDATA[ 





<p>I presented a “Lightning Talk” (6 minutes) at the last Practical Agility meeting on “JBehave and FIT - the Good, the Bad and the Ugly”. For the talk everything was on NEON cards (neon-green, neon-yellow and neon-red! - its all I could find at Walgreens) and before throwing them out thought that I would put them into powerpoint (minus the neon) and share. Slides are up on <a href="http://www.slideshare.net/brianrepko/fit-and-j-behave">SlideShare</a> - but you might want to download as all the notes are in the notes section. Enjoy!</p>



 ]]></description>
  <category>agile</category>
  <category>java</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-09-22-slides-from-practical-agility-jbehave-and-fit-goodbadugly/</guid>
  <pubDate>Wed, 22 Sep 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-09-22-slides-from-practical-agility-jbehave-and-fit-goodbadugly/talk-2.png" medium="image" type="image/png" height="98" width="144"/>
</item>
<item>
  <title>JBehave 3.0 released!!</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-09-02-jbehave-3-0-released/</link>
  <description><![CDATA[ 





<p>JBehave 3.0 was <a href="http://jbehave.org/2010/08/31/jbehave-3-0-released/">released</a> yesterday (finally!!). I’m thrilled to have donated the Multi-tennant Spring Security example (which has been updated to JBehave 3). That example is now part of the many examples that are included in JBehave. Looking to update the presentation on my website to explain some of the new features that make up JBehave 3.0. Congrats to Mauro and Paul on all their hard work!</p>



 ]]></description>
  <category>agile</category>
  <category>java</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-09-02-jbehave-3-0-released/</guid>
  <pubDate>Thu, 02 Sep 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-09-02-jbehave-3-0-released/jbehave-logo.png" medium="image" type="image/png" height="58" width="144"/>
</item>
<item>
  <title>Extending Jasypt - AES and Blowfish support</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-07-21-extending-jasypt-aes-and-blowfish-support/</link>
  <description><![CDATA[ 





<p>I recently had to code Java / Perl interoperable encryption - in Perl it was using the <code>Crypt::CBC</code> and <code>Crypt::Blowfish</code> modules. The perl code was meant to be as simple as possible:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode perl code-with-copy"><code class="sourceCode perl"><span id="cb1-1"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">$cipher</span> = <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Crypt::CBC</span>-&gt;new( -cipher =&gt; <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">'</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">Blowfish</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">'</span>, -key =&gt; <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">'</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">password</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">'</span> ); </span>
<span id="cb1-2"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">$ciphertext</span> = <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">$cipher</span>-&gt;<span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">encrypt_hex</span>(<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">This data is super secret hush hush</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"</span>);</span></code></pre></div></div>
<p>The key is really a passphrase that is then generated into a key and IV for use with the underlying CBC/cipher. These modules are by default compatible with OpenSSL. I thought that since this is password-based encryption, that I could use Jasypt, one of my favorite libraries. Unfortunately, Jasypt only supports the PBE* cipher algorithms and none of them are the OpenSSL standards. So then I thought that I could at least get Jasypt to support Blowfish. No luck…the algorithm is just hard-coded to PBE-based encryption. Even the IV work would be impossible.</p>
<p>So for the project, I created my own mini-framework that includes converters (hex/base64 string to byte[] or String to byte[] based on a character-set) and ciphers defined via generics with the ability to create a string-to-string “encryptor” by combining a String-to-byte[] (utf-8) converter with a byte[]-to-byte[] cipher with a byte[]-to-String (hex) converter. This is sort of what Jasypt does but its not very pluggable in that fashion.</p>
<p>Then to write the byte[]-to-byte[] cipher, I started with a generalized algorithm that works for both the PBE* algorithms but also for AES and Blowfish with the key and IV generation handled in the process. Plug in BouncyCastle’s <code>OpenSSLPBEParametersGenerator</code> for key/IV generation and write my own decorator for dealing with sharing the salt as “Salted__XXXXXXXX” in front of the ciphertext and voila! Perl-Java encryption interoperability based on passwords and random salts!!</p>
<p>That project has ended and I’m now in-between gigs so I worked that code into Jasypt - just added a feature request (with a patch) to Jasypt. Not specifically the perl stuff but the generalized algorithm. That allows users to finally extend Jasypt - still for password-based encryption but not limited to the PBE* algorithms. Support is finally in there for AES and Blowfish with key and IV generation based on PBKDF2 or whatever else you want to add. Changes to Jasypt to support the configuration of the whole “pipeline” is not in there - that would require some serious changes to Jasypt.</p>
<p>As for the algorithms - they look like this:</p>
<p>For encryption,</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode java code-with-copy"><code class="sourceCode java"><span id="cb2-1">EncryptionData data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">buildEncryptionData</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="cb2-2">data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">setMethodInput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>message<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb2-3">dataProcessor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">preProcess</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Cipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ENCRYPT_MODE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb2-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">SecretKey</span> key <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> keyGenerator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">generateSecretKey</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb2-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">AlgorithmParameterSpec</span> parameterSpec <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> paramGenerator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">generateParameterSpec</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="cb2-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">synchronized</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>encryptCipher<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> </span>
<span id="cb2-7">  encryptCipher<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">init</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Cipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ENCRYPT_MODE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> parameterSpec<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb2-8">  data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">setCipherOutput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>encryptCipher<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">doFinal</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getCipherInput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">()));</span> </span>
<span id="cb2-9"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-10">dataProcessor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">postProcess</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Cipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ENCRYPT_MODE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="cb2-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getMethodOutput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span></code></pre></div></div>
<p>For decryption,</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode java code-with-copy"><code class="sourceCode java"><span id="cb3-1">EncryptionData data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">buildEncryptionData</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span> </span>
<span id="cb3-2">data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">setMethodInput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>encryptedMessage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb3-3">dataProcessor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">preProcess</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Cipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">DECRYPT_MODE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb3-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">SecretKey</span> key <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> keyGenerator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">generateSecretKey</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb3-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">AlgorithmParameterSpec</span> parameterSpec <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> paramGenerator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">generateParameterSpec</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb3-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">synchronized</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>decryptCipher<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> </span>
<span id="cb3-7">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">this</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">decryptCipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">init</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Cipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">DECRYPT_MODE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> parameterSpec<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb3-8">  data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">setCipherOutput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>decryptCipher<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">doFinal</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getCipherInput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">()));</span> </span>
<span id="cb3-9"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> </span>
<span id="cb3-10">dataProcessor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">postProcess</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Cipher</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">DECRYPT_MODE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="cb3-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getMethodOutput</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span></code></pre></div></div>
<p>And all the real work is done in the <code>SecretKeyGenerator</code> (which actually generates more than just the key - it just returns the key), the <code>AlgorithmParamsGenerator</code> and the <code>EncryptionDataProcessor</code> - all of which are just interfaces. All the transient data for the method is kept in the <code>EncryptionData</code> class or subclass. So that is the patch just submitted.</p>
<p>And perl interoperability could be added with an <code>OpenSSLSecretKeyGenerator</code> and an <code>OpenSSLEncryptionDataProcessor</code> to handle the “Salted__XXXXXXX” format - the rest is all in there (once the patch is approved, committed and released). The <code>OpenSSLSecretKeyGenerator</code> would work like the <code>PBKDF2SecretKeyGenerator</code> in that it would produce the key and IV based on a password and fixed or random salt. Its just that OpenSSL does a funky key and IV generation mechanism that I’m not sure is in the default JCE providers. And the <code>OpenSSLEncryptionDataProcessor</code> is just an extension of the existing one with the hardcoded “Salted__” thrown in for good measure.</p>
<p>Here’s hoping that gets added by the next project. Jasypt team - I’m more than willing to help!</p>



 ]]></description>
  <category>java</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-07-21-extending-jasypt-aes-and-blowfish-support/</guid>
  <pubDate>Wed, 21 Jul 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-07-21-extending-jasypt-aes-and-blowfish-support/encryption-9.png" medium="image" type="image/png" height="128" width="128"/>
</item>
<item>
  <title>JBehave presentation for Twin Cities JUG</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-04-12-jbehave-presentation-for-twin-cities-java-users-group/</link>
  <description><![CDATA[ 





<p>Finally finished the JBehave presentation for tonight’s Twin Cities Java Users Group. You can find the PPT and source code at the LearnThinkCode <a href="http://www.learnthinkcode.com">website</a>. Any and all feedback is welcome.</p>
<p>In the end, I really think that this is the year that Agile testing via executable requirements will take off and I do think that JBehave can be a part of that. There are some key things to work out however (library bundling issues, integration with JUnit for Spring) that need to be looked at before its really ready for prime time. And getting the Pico Ajax Email / Selenium example working was painful. Please don’t release versioned software that depends on snapshot releases of other code!</p>
<p>So, it would be usable on projects if you are willing to spend some time on getting your base class / infrastructure stuff setup. Personally, I like it more than FIT/Fitnesse.</p>



 ]]></description>
  <category>agile</category>
  <category>java</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-04-12-jbehave-presentation-for-twin-cities-java-users-group/</guid>
  <pubDate>Mon, 12 Apr 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-04-12-jbehave-presentation-for-twin-cities-java-users-group/talk-1.png" medium="image" type="image/png" height="65" width="144"/>
</item>
<item>
  <title>Extreme distributed scrum - daily standup</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-02-26-extreme-distributed-scrum-daily-standup/</link>
  <description><![CDATA[ 





<p>I’ve worked on Scrum teams where 1/2 the team is in one location and 1/2 in another (both offshore and onshore) and every now and then we would use an IM conference in order to have a “standup” (except that we are sitting and on IM). We tried video and phone conferencing as well but given the lag in the network as well as lack of equipment and network availability, IM seemed to just work better. IM allowed for give-and-take (with some lag <em>but a lag we were familiar with</em>) and was always available. In addition, IM allowed the conversation to be sent via email for later reading and sharing (to the 1/2 of the team that wasn’t in yet). Since then, I’ve wondered about what technology tools one would need if the team was completely separated (think rock stars all working from home).</p>
<p>If all the team members were located in their own location - how would I set this up? The kicker here is the mess of timezones that might be in the mix. Obviously, I’d have a wiki and perhaps an agile PM (kanban/scrum) web app running somewhere that we could all access in our timezone. When folks are distributed and have an overlapping time, then can use VNC (or other free solutions like that) to “pair-up” as needed. Likewise, VoIP/IM conferences or just VoIP/IM for issues and/or questions.</p>
<p>But how to do “standups” when there isn’t a time that everyone can standup? How to let someone know you are stuck on something and how to hand off a potential solution to someone that will get it hours later. My insight was that a team could host an <strong>internal blog/Twitter</strong> to share what they did yesterday, what they are doing today and what issues are blocking them. Status updates (“working on X” or “can’t figure out Y”) can then really be done at anytime and those folks that are online can help step in. Some IM systems have a status but I’m not sure that that is very visible. <strong>Add an RSS feed on top</strong> of the teams blogs (like twitter) and you’ll start to see team collaboration. Start your day by reading all the updates from folks since you were last on. The whole project life could be read if you really wanted too - like emailing the IM discussion. Still doesn’t help with the I have an issue with X and what a potential solution might be (hours later). I could see the wiki or issue tracking system kind of work in that space. In this scenario, the world of “standup” starts to look like “status” - but that might be ok just given the realities of multiple people in multiple timezones. I thought that it would be an interesting way to run a project without standups.</p>
<p>Iteration and release planning would be difficult in this situation - that might just have to be done together.</p>
<p>I’ve not had this situation - have you? What worked or didn’t work?</p>



 ]]></description>
  <category>agile</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-02-26-extreme-distributed-scrum-daily-standup/</guid>
  <pubDate>Fri, 26 Feb 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-02-26-extreme-distributed-scrum-daily-standup/twitterrssfeed1-main_full.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Making Ant 1.8 work like Maven - not so much</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-02-19-making-ant-1-8-work-like-maven-not-so-much/</link>
  <description><![CDATA[ 





<p>I’ve done Ant build files in the past that ended up working like Maven2. Mostly since it was a non-Maven shop but also because it was a way to get folks into Maven-think but by using Ant.</p>
<p>Now, <a href="http://ant.apache.org/">Ant 1.8</a> has been released and with it some new features that could potentially make it possible to have very modular Ant builds that would be even better than Maven2. One of the main concepts within Maven2 is the various <a href="http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html">lifecycles</a> (clean, build/default and site) and that build tasks from plugins are bound to various parts of the lifecycle. Ant 1.8 introduced the notion of <a href="http://ant.apache.org/manual/targets.html">extension-points and extensionOf</a> as well as <a href="http://ant.apache.org/manual/CoreTasks/import.html">imports</a> and local properties - these could all be used to both create plugins (macrodefs) and our own lifecycles (sets of extension-points) and then bind them all up together in a build.xml and just import what you need - potentially from an http URL.</p>
<p>Well, that was the thought…</p>
<p>Turns out that imports are processed after the build.xml is parsed. That’s all well and good, but when an extensionOf attribute is parsed, Ant looks for the target(s) named in the extensionOf value in order to add the current target as a dependency. That requires that the target has to exist in the project and if that target is part of an import (as the documentation seems to suggest), then the target doesn’t exist (yet) at the time of parsing and you get a nice error message to that effect.</p>
<p>I think that this is a design flaw in how extension-point / extensionOf is supposed to work and contradicts the example cited in the documentation - which doesn’t work.</p>
<p>Its too bad because with these features, I could define my own lifecycles or even change/modify the existing ones from Maven2 to do things related to database SQL modules (create the database from all the SQL scripts and some data files) or be able to mix the SQL and java files together in the same module and add phases to the lifecycle related to database setup. This has always been something that I have to hack up the pom for anyway - which is part of why I like going back to Ant - I can change it easier when I need to.</p>
<p>Work-arounds? Change the ProjectHelper/TargetHelper to deal with extensionOf attributes after the import stack is popped (and all the targets are resolved) or import the extensionOfs (the bindings or which macros get called for each step) after the extension-points are imported. I’m not a fan of the latter as I really think that the bindings are the build - execute these steps for these lifecycle stages - but if my build is just a bunch of imports, that’s not the worst of it. Or screw the use of extension-points/extensionOf and just use imports with empty targets (which is kind of what extension-points are - except that I could then create a target that gets bound to multiple extension-points with extensionOf=“target1,target2”).</p>
<p>It does sadden me that the example cited doesn’t even work however. If I get this working, I’ll post the example.</p>



 ]]></description>
  <category>java</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-02-19-making-ant-1-8-work-like-maven-not-so-much/</guid>
  <pubDate>Fri, 19 Feb 2010 00:00:00 GMT</pubDate>
  <media:content url="https://www.apache.org/logos/res/ant/default.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Scrum and Kanban together</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-01-27-scrum-and-kanban-together/</link>
  <description><![CDATA[ 





<p>One of my favorite links is Henrik Kniberg’s “<a href="http://www.crisp.se/henrik.kniberg/Kanban-vs-Scrum.pdf">mini-book</a>” on Kanban and Scrum and how they work (and how they are similar and different). Given that description of Kanban and <a href="http://brianrepko.github.io/blog/posts/2010-01-25-when-is-a-story-prepared/">my thoughts on story prep and story release work</a>, I would really love to try a Kanban board for story prep and release work with a Scrum board for implementation work. Definitely for prep and implementation - release probably depends how that process looks - perhaps one release board for a program (multiple projects but one solution).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://brianrepko.github.io/blog/posts/2010-01-27-scrum-and-kanban-together/blog-kanban-scrum.jpg" class="quarto-figure quarto-figure-center figure-img" height="300"></p>
</figure>
</div>
<p>Backlog grooming would actively work the story prep board. The team could see what stories are getting ready for planning as well as a release team (or management team) seeing what is getting ready for release to production. I think that it would actually engage those team members that are at the daily standup but are not developers or testers - they can point to what they are working on - its just on the story prep kanban board. I think that it could make for a good information radiator for a wider “team”.</p>
<p>The other way to look at this, from a metrics standpoint, is to see that the whole Scrum board is just one column of a larger Kanban board and that you could measure and reduce the throughput time of a story from backlog to released to production on that larger Kanban board.</p>
<p>Has anyone ever done anything like this? Did it work? Things to improve about it?</p>
<p>Putting this post together, I just noticed Henrik’s <a href="https://www.infoq.com/minibooks/kanban-scrum-minibook/">Kanban and Scrum - Making the Most of Both</a> - something new to read!</p>



 ]]></description>
  <category>agile</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-01-27-scrum-and-kanban-together/</guid>
  <pubDate>Wed, 27 Jan 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-01-27-scrum-and-kanban-together/blog-kanban-scrum.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Connecting Agile Teams - one pink post-it at a time</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-01-27-connecting-agile-teams-one-pink-post-it-at-a-time/</link>
  <description><![CDATA[ 





<p>I’ve used color coded cards on Scrum boards - green for user story, blue for system story, yellow task cards for design or review work, blue task card for “technical architecture work”, etc. - lots of variations. Sometimes I suggest it, sometimes I don’t. Just part of the box of tools. The one thing that I’ve noticed with this however was the use of pink cards and post-its and how they can be used to connect agile teams and help build an agile enterprise.</p>
<p>My original use of pink cards was for a Scrum board for developers to fix a critical or blocker bug. This was for a team that was just developers - the testers were a completely different team. And the pink card was basically a request for the development team to help un-block the testing team (and dev team would estimate it and decide if they would need to take something else off the board).</p>
<p>I’ve also used pink post-its on Scrum boards to report a blocking issue on a task - just as a way to remind people working on the issue that it is important and to bring it up in standup until the issue is resolved.</p>
<p>On another team that I worked on we sort-of had a Kanban board for release or operations tasks related to the program (multiple projects - one operations team) and if the implementation team had a request to make of them (e.g.&nbsp;database to setup), then we would create a pink card for their board. Basically a pink card is a Please Do This ASAP request. Maybe pink should stand for Please Implement Now, Kind (Sir/Madam).</p>
<p>What I realized is that one way to connect these teams, with their own boards and tasks and stories is that <strong>the issue (post-it) is tied to request(s) to resolve the issue (cards) and you could track and connect those issues/tasks that way</strong>. So basically, for that first scenario (separate dev and test teams), if the test team had had a board, their pink post-it (the blocking issue) was tied to the pink card for the development team (the issue resolvers). And a board with a lot of pink is a conversation waiting to happen.</p>
<p>Simple and easy way to handle and track issues that need to get done now that I think helps build an agile enterprise.</p>



 ]]></description>
  <category>agile</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-01-27-connecting-agile-teams-one-pink-post-it-at-a-time/</guid>
  <pubDate>Wed, 27 Jan 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-01-27-connecting-agile-teams-one-pink-post-it-at-a-time/pink-post-it.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>QA versus QC</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-01-25-qa-versus-qc/</link>
  <description><![CDATA[ 





<p>For Agile projects, I often coach about the need for a QA (quality assurance) role in addition to just testers (or QC / quality control).</p>
<p>For me, QA answers the question <strong>“are we doing the right job?”</strong> and QC answers the question <strong>“are we doing the job right?”</strong>.</p>
<p>I see QA working with the Customer/Product Owner on coverage for acceptance and functional testing. A great QA person will be able to answer the architecture (“how do I - ?”) questions for the QC team as well as, like a great Business Analyst, be able to hold the domain model in their head. Could even be the same head (BA/QA)…</p>



 ]]></description>
  <category>agile</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-01-25-qa-versus-qc/</guid>
  <pubDate>Mon, 25 Jan 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-01-25-qa-versus-qc/checkbox.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>When is a story prepared?</title>
  <dc:creator>Brian Repko</dc:creator>
  <link>https://brianrepko.github.io/blog/posts/2010-01-25-when-is-a-story-prepared/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://brianrepko.github.io/blog/posts/2010-01-25-when-is-a-story-prepared/blog-circles.jpg" class="quarto-figure quarto-figure-center figure-img" height="400"></p>
</figure>
</div>
<p>This is another drawing that I use a lot while coaching agile projects and actually is part of multiple discussions around agile methods. Most agile methods talk about stories being created by the customer and put on a backlog. For some iteration, at iteration planning, the story gets explained to developers and testers. They work on the story until it is complete. And we have lots of conversations, as agile coaches and teams, about “when is a story complete”.</p>
<p>This misses a lot of the work that needs to be done in order to make teams effective. I like to have regular backlog grooming meetings with part of the team and ask the question - <em><strong>is this story prepared?</strong></em>. <em><strong>What is needed in order to bring this story to iteration planning?</strong></em> That needed work might involve QA (quality assurance) for acceptance tests or functional tests. That might involve UX (user experience) for wireframes or drawings. That might involve some TA (technical architecture) work or IA (information architecture, or domain modeling) work depending on the story. It might require a BA to work out the business value of this story or how to break it up into what needs to done now versus later (breaking up stories into smaller, potentially optional pieces). Its only when a story is prepared that it should be brought to iteration planning.</p>
<p>I also use this picture to help explain why some work is “on the board” for the iteration (meaning we are tracking velocity and burndown charts - its the developer/tester circle) vs work that needs to get done but we aren’t measuring velocity for it. The first is working towards completing the story. The latter is working towards getting the story prepared.</p>
<p>It also helps explain the roles of the non-customer, non-developer and non-tester folks…though I’m pretty careful to explain that that 2nd circle is optional work (story by story) and that that work can be done by anyone with those skills. Its really about what would make the communication of this story effective and doing that in as lightweight of a fashion as you need.</p>
<p>The last part of this drawing is that the story doesn’t stop because development is done. Its really done when its deployed (some would say deployed to production) and supported. This means that the story needs to be shared with operations and support teams. I’ve seen this done as part of a release process and actually made it the responsibility of the whole team to figure how to to communicate the stories that are being released. Really each circle needs to figure out when its done with the story and how to communicate to the next circle (and then there are feedback loops!).</p>
<p>Its really about effective communication and community. I didn’t get this last part until attending a session with <a href="https://nonodename.com/post/davidhussman/">David Hussman</a> who talks about building community around a story.</p>



 ]]></description>
  <category>agile</category>
  <guid>https://brianrepko.github.io/blog/posts/2010-01-25-when-is-a-story-prepared/</guid>
  <pubDate>Mon, 25 Jan 2010 00:00:00 GMT</pubDate>
  <media:content url="https://brianrepko.github.io/blog/posts/2010-01-25-when-is-a-story-prepared/blog-circles.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
