<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Should the Data be Public?</title>
	<atom:link href="http://cosmicvariance.com/2006/06/23/should-the-data-be-public/feed/" rel="self" type="application/rss+xml" />
	<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/</link>
	<description>Random samplings from a universe of ideas</description>
	<pubDate>Thu, 28 Aug 2008 20:20:26 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Particle Physics 2.0? &#171; Charm &#38;c.</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-239696</link>
		<dc:creator>Particle Physics 2.0? &#171; Charm &#38;c.</dc:creator>
		<pubDate>Wed, 04 Apr 2007 02:20:28 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-239696</guid>
		<description>[...] A couple of issues are raised. One is whether the data should be made available to the public (in ASCII four-vectors or whatever); after all the taxpayers fund us, shouldn&#8217;t they get their money&#8217;s worth? I certainly agree that this is desirable, although extremely complicated. Our experimental architectures have not been designed to enable this in a simple manner (it can take literally months for a new collaboration member to learn to access data!), but if this was specified as a requirement from the beginning, as I believe it is for NASA projects, it could probably be done at the expense of a lot of physicist-years. However what is in question is not the data, but the analyses that follow, and even projects that release their data allow that what you extract from the data is your work. [...]</description>
		<content:encoded><![CDATA[<p>[...] A couple of issues are raised. One is whether the data should be made available to the public (in ASCII four-vectors or whatever); after all the taxpayers fund us, shouldn&#8217;t they get their money&#8217;s worth? I certainly agree that this is desirable, although extremely complicated. Our experimental architectures have not been designed to enable this in a simple manner (it can take literally months for a new collaboration member to learn to access data!), but if this was specified as a requirement from the beginning, as I believe it is for NASA projects, it could probably be done at the expense of a lot of physicist-years. However what is in question is not the data, but the analyses that follow, and even projects that release their data allow that what you extract from the data is your work. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tony Smith</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-103993</link>
		<dc:creator>Tony Smith</dc:creator>
		<pubDate>Sun, 09 Jul 2006 20:00:35 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-103993</guid>
		<description>Nathaniel, an "experimentalist", said that he "... disagree[s] with public data. ..." because he will only get "... After working for six years, Three papers. Out of two hundred authors ...".

Nathaniel goes on to say that "... there's a simple solution that should satisfy you theorists nicely: JOIN THE EXPERIMENT! ...".

A flaw in Nathaniel's solution is that not every theorist/analyst will get to be affiliated with the experiment collaboration.

It seems to me that a more comprehensive, even simpler, solution would be to make the data public, in a format that is the work-product of Nathaniel and his fellow experimenters, by a paper authored by Nathaniel and his fellow experimenters.
Then, any theorist/analyst (whether or not affiliated) should cite that paper, so that Nathaniel et al would have a very high citation rating.

Further, if any theorist/analyst might ask Nathaniel et al for help in understanding the data, Nathaniel et al should be listed as coauthors for providing such help.

I have tried to follow that spirit in stuff that I have written. For example, in my writings about Fermilab T-quark data, I give explicit credit to Erich Ward Varnes whose 1997 UC Berkely PhD thesis contained data that I found very useful.

Tony Smith
http://www.valdostamuseum.org/hamsmith/</description>
		<content:encoded><![CDATA[<p>Nathaniel, an &#8220;experimentalist&#8221;, said that he &#8220;&#8230; disagree[s] with public data. &#8230;&#8221; because he will only get &#8220;&#8230; After working for six years, Three papers. Out of two hundred authors &#8230;&#8221;.</p>
<p>Nathaniel goes on to say that &#8220;&#8230; there&#8217;s a simple solution that should satisfy you theorists nicely: JOIN THE EXPERIMENT! &#8230;&#8221;.</p>
<p>A flaw in Nathaniel&#8217;s solution is that not every theorist/analyst will get to be affiliated with the experiment collaboration.</p>
<p>It seems to me that a more comprehensive, even simpler, solution would be to make the data public, in a format that is the work-product of Nathaniel and his fellow experimenters, by a paper authored by Nathaniel and his fellow experimenters.<br />
Then, any theorist/analyst (whether or not affiliated) should cite that paper, so that Nathaniel et al would have a very high citation rating.</p>
<p>Further, if any theorist/analyst might ask Nathaniel et al for help in understanding the data, Nathaniel et al should be listed as coauthors for providing such help.</p>
<p>I have tried to follow that spirit in stuff that I have written. For example, in my writings about Fermilab T-quark data, I give explicit credit to Erich Ward Varnes whose 1997 UC Berkely PhD thesis contained data that I found very useful.</p>
<p>Tony Smith<br />
<a href="http://www.valdostamuseum.org/hamsmith/" rel="nofollow">http://www.valdostamuseum.org/hamsmith/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nathaniel</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-100490</link>
		<dc:creator>Nathaniel</dc:creator>
		<pubDate>Fri, 30 Jun 2006 14:31:45 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-100490</guid>
		<description>I too, am an experimentalist (neutrinos) and I disagree with public data.  Technical issues have already been discussed, but here's the rub:

After working for six years on MINOS, I will get ONE (count 'em) paper.  OK, I'll be fair. Three papers. Out of two hundred authors.  If the data were made available publicly, then this paper wouldn't even get cited... some theorist would come along, do a slightly more sophisticated analysis, and I the paper wouldn't even get cited.

Even worse, to make the data public we now have to publish the methods and documentation how to use the data (which will NEVER just be a list of 4-vectors; there are correlations and resolution functions on every experiment) and that will take the experimentalists even more work.

Don't get me wrong.. I love what I do.  But I slave over computer code, measure crosstalk, invent calibration sources, crawl under dusty machines, travel, travel, travel, sit on interminably phone calls (every day) so that I can get those few weeks of analysing the data before everyone else.  Now I can't even do that?  

Happily, there's a simple solution that should satisfy you theorists nicely: JOIN THE EXPERIMENT!   I need a three more people in my calibration group to measure attenuation curves.  I need two more to get automated processing running and document things.   We need people to think deeply about statistics, and to make sure our MC models are good. We need people who understand the theory well to suggest what fits to make and the best way of presenting the data.  But, of course, that's a lot of work, so not many of you take us up on the offer.

---Nathaniel</description>
		<content:encoded><![CDATA[<p>I too, am an experimentalist (neutrinos) and I disagree with public data.  Technical issues have already been discussed, but here&#8217;s the rub:</p>
<p>After working for six years on MINOS, I will get ONE (count &#8216;em) paper.  OK, I&#8217;ll be fair. Three papers. Out of two hundred authors.  If the data were made available publicly, then this paper wouldn&#8217;t even get cited&#8230; some theorist would come along, do a slightly more sophisticated analysis, and I the paper wouldn&#8217;t even get cited.</p>
<p>Even worse, to make the data public we now have to publish the methods and documentation how to use the data (which will NEVER just be a list of 4-vectors; there are correlations and resolution functions on every experiment) and that will take the experimentalists even more work.</p>
<p>Don&#8217;t get me wrong.. I love what I do.  But I slave over computer code, measure crosstalk, invent calibration sources, crawl under dusty machines, travel, travel, travel, sit on interminably phone calls (every day) so that I can get those few weeks of analysing the data before everyone else.  Now I can&#8217;t even do that?  </p>
<p>Happily, there&#8217;s a simple solution that should satisfy you theorists nicely: JOIN THE EXPERIMENT!   I need a three more people in my calibration group to measure attenuation curves.  I need two more to get automated processing running and document things.   We need people to think deeply about statistics, and to make sure our MC models are good. We need people who understand the theory well to suggest what fits to make and the best way of presenting the data.  But, of course, that&#8217;s a lot of work, so not many of you take us up on the offer.</p>
<p>&#8212;Nathaniel</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ars Mathematica &#187; Blog Archive &#187; Releasing LHC Data</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-100065</link>
		<dc:creator>Ars Mathematica &#187; Blog Archive &#187; Releasing LHC Data</dc:creator>
		<pubDate>Wed, 28 Jun 2006 06:15:51 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-100065</guid>
		<description>[...] I saw a story on Cosmic Variance that I found vaguely shocking. At the SUSY06 conference, there was a rancorous discussion about whether the data from the Large Hadron Collider should be made public. This is probably my ignorance about how high-energy physics works, but I have trouble believing that the answer is anything other than &#8220;of course&#8221; (perhaps after an embargo period to reward the people actually working on the detector). Some good news that comes out of the comment thread is that in astronomy such public data is readily available. [...]</description>
		<content:encoded><![CDATA[<p>[...] I saw a story on Cosmic Variance that I found vaguely shocking. At the SUSY06 conference, there was a rancorous discussion about whether the data from the Large Hadron Collider should be made public. This is probably my ignorance about how high-energy physics works, but I have trouble believing that the answer is anything other than &ldquo;of course&rdquo; (perhaps after an embargo period to reward the people actually working on the detector). Some good news that comes out of the comment thread is that in astronomy such public data is readily available. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Heffernan</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-88096</link>
		<dc:creator>David Heffernan</dc:creator>
		<pubDate>Sun, 25 Jun 2006 13:50:30 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-88096</guid>
		<description>On Belle we do make small amounts of data available on request, but only a fraction of the total data set. Students use it for high school science projects, for example.  Are there any other HEP experiments that do this?

I think the biggest problem with releasing data from the LHC experiments would be the shear volume.  How much would CMS or ATLAS record in a day?  What kind of background reduction are people expecting here?</description>
		<content:encoded><![CDATA[<p>On Belle we do make small amounts of data available on request, but only a fraction of the total data set. Students use it for high school science projects, for example.  Are there any other HEP experiments that do this?</p>
<p>I think the biggest problem with releasing data from the LHC experiments would be the shear volume.  How much would CMS or ATLAS record in a day?  What kind of background reduction are people expecting here?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paolo Bizzarri</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-82003</link>
		<dc:creator>Paolo Bizzarri</dc:creator>
		<pubDate>Sat, 24 Jun 2006 20:24:04 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-82003</guid>
		<description>Sean, 

let me comment this phrase of yours:

"In principle Iâ€™m in favor of releasing the data, in practice I doubt that it would work. Without an intimate knowledge of the idiosyncrasies of the detector, too many spurious results would be hard to resist."

My idea is that the release of data is really similar to release of software code in open source projects (my professional field).

For example, for software products like Netscape/Mozilla/Firefox, the code was originally proprietary and secret; then, there was the decision to release the code of the product itself. The idea was to create a large community of developers, able to contribute to the improvement of the product.

However, for more than one year after the release of the code, the contribution from the developers outside the original development team was minimal. There was a lot of interest from other developers, but they were not able to provide any significant change to the code.

The reason was understood shortly after. The code itself was only part of the knowledge that had built the product. Each line of code was the result of several decisions made by the developers, and contained assumptions there were not easy to make explicit.

In short, the code was the result of a long and complex process, but in order to contribute to the code, you had first to became part of the process. Only after the assumptions became clearer, it was possible for other people to make significant contributions.

Which is the relation I see with LHC data ?

Data are the result of complex processes, where there is a lot of hidden knowledge that is necessary in order to understand what a number really mean in a certain context. People outside the process cannot understand what the raw data can really mean, without a proper understanding of the process itself.

However, if the parallel I have made is anything significat, IT IS useful to make data available, as far as you understand that you have to make clear which is process through which they are produced and elaborated. 

Then, other people can make useful proposal on how to improve the understanding of data. In fact, making the process public has significatively improved the process itself.</description>
		<content:encoded><![CDATA[<p>Sean, </p>
<p>let me comment this phrase of yours:</p>
<p>&#8220;In principle Iâ€™m in favor of releasing the data, in practice I doubt that it would work. Without an intimate knowledge of the idiosyncrasies of the detector, too many spurious results would be hard to resist.&#8221;</p>
<p>My idea is that the release of data is really similar to release of software code in open source projects (my professional field).</p>
<p>For example, for software products like Netscape/Mozilla/Firefox, the code was originally proprietary and secret; then, there was the decision to release the code of the product itself. The idea was to create a large community of developers, able to contribute to the improvement of the product.</p>
<p>However, for more than one year after the release of the code, the contribution from the developers outside the original development team was minimal. There was a lot of interest from other developers, but they were not able to provide any significant change to the code.</p>
<p>The reason was understood shortly after. The code itself was only part of the knowledge that had built the product. Each line of code was the result of several decisions made by the developers, and contained assumptions there were not easy to make explicit.</p>
<p>In short, the code was the result of a long and complex process, but in order to contribute to the code, you had first to became part of the process. Only after the assumptions became clearer, it was possible for other people to make significant contributions.</p>
<p>Which is the relation I see with LHC data ?</p>
<p>Data are the result of complex processes, where there is a lot of hidden knowledge that is necessary in order to understand what a number really mean in a certain context. People outside the process cannot understand what the raw data can really mean, without a proper understanding of the process itself.</p>
<p>However, if the parallel I have made is anything significat, IT IS useful to make data available, as far as you understand that you have to make clear which is process through which they are produced and elaborated. </p>
<p>Then, other people can make useful proposal on how to improve the understanding of data. In fact, making the process public has significatively improved the process itself.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Amara</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-80083</link>
		<dc:creator>Amara</dc:creator>
		<pubDate>Sat, 24 Jun 2006 19:25:27 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-80083</guid>
		<description>Noone has yet mentioned the &lt;a href="http://marsrovers.nasa.gov/gallery/all/spirit.html" rel="nofollow"&gt;Mars Rover data&lt;/a&gt; (link for Spirit). That is one of the most visible and successful open planetary science databases that exists now.</description>
		<content:encoded><![CDATA[<p>Noone has yet mentioned the <a href="http://marsrovers.nasa.gov/gallery/all/spirit.html" rel="nofollow">Mars Rover data</a> (link for Spirit). That is one of the most visible and successful open planetary science databases that exists now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard E.</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43388</link>
		<dc:creator>Richard E.</dc:creator>
		<pubDate>Sat, 24 Jun 2006 02:30:58 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43388</guid>
		<description>I have been thinking more about this, and I think the argument that "theorists will always make a hash of data analysis" is bogus.  

Again turning to the analogy with cosmology/astrophysics, I suspect many theorists in cosmology (myself included) are learning more about Bayesian statistics, priors, Markov Chains and all the rest of it than we would have ever dreamed. And I can certainly point to some deeply flawed papers in the literature that might never have seen the light if the raw data was not freely available.   However, theorists *can* learn this stuff, and don't like to look silly in public, so they have plenty of motivation for doing so.  

In the end the theorists will either learn enough of the subtleties to do it themselves, or work with experimentalists who know how to perform the relevant analyses. Many (most?) papers are flawed in some way, and the community would react to the flurry of theorist-written data-driven papers by hiking its overall level of skepticism a notch or two. Just as it did with the arrival of the Arxiv, which does an end-run around peer review (for what that is worth, but don't get me started)

As Sean pointed out above, one side-effect of the present system is that it is very hard for particle theorists to collaborate with experimentalists, if the whole collaboration needs to sign off on papers that any single member writes (and this is *after* the data is in the public domain).  Again speaking from my own experience, my foray into the world of data-analysis has largely been conducted in collaboration with someone who understood the issues involved at the outset (although not an "experimentalist" in the strict sense of the term), and it is a singularly productive mode of collaboration. To the extent that the "rules" of experimental particle physics discourage this sort of collaboration they are clearly ounter-productive.

Secondly, the cosmological community has benefitted greatly from the development of the Cosmomc package which greatly simplifies the Monte Carlo Markov Chain analyses of cosmological data. (It is not theorist-proof however, as I have seen several publicly displayed figures that showed chains which, to my now practiced eye, were clearly unconverged).  My guess is that if more experimental particle physicss data was made publicly available it would seed a small industry in the development of software tools that facilitated its analysis.</description>
		<content:encoded><![CDATA[<p>I have been thinking more about this, and I think the argument that &#8220;theorists will always make a hash of data analysis&#8221; is bogus.  </p>
<p>Again turning to the analogy with cosmology/astrophysics, I suspect many theorists in cosmology (myself included) are learning more about Bayesian statistics, priors, Markov Chains and all the rest of it than we would have ever dreamed. And I can certainly point to some deeply flawed papers in the literature that might never have seen the light if the raw data was not freely available.   However, theorists *can* learn this stuff, and don&#8217;t like to look silly in public, so they have plenty of motivation for doing so.  </p>
<p>In the end the theorists will either learn enough of the subtleties to do it themselves, or work with experimentalists who know how to perform the relevant analyses. Many (most?) papers are flawed in some way, and the community would react to the flurry of theorist-written data-driven papers by hiking its overall level of skepticism a notch or two. Just as it did with the arrival of the Arxiv, which does an end-run around peer review (for what that is worth, but don&#8217;t get me started)</p>
<p>As Sean pointed out above, one side-effect of the present system is that it is very hard for particle theorists to collaborate with experimentalists, if the whole collaboration needs to sign off on papers that any single member writes (and this is *after* the data is in the public domain).  Again speaking from my own experience, my foray into the world of data-analysis has largely been conducted in collaboration with someone who understood the issues involved at the outset (although not an &#8220;experimentalist&#8221; in the strict sense of the term), and it is a singularly productive mode of collaboration. To the extent that the &#8220;rules&#8221; of experimental particle physics discourage this sort of collaboration they are clearly ounter-productive.</p>
<p>Secondly, the cosmological community has benefitted greatly from the development of the Cosmomc package which greatly simplifies the Monte Carlo Markov Chain analyses of cosmological data. (It is not theorist-proof however, as I have seen several publicly displayed figures that showed chains which, to my now practiced eye, were clearly unconverged).  My guess is that if more experimental particle physicss data was made publicly available it would seed a small industry in the development of software tools that facilitated its analysis.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: adam</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43359</link>
		<dc:creator>adam</dc:creator>
		<pubDate>Sat, 24 Jun 2006 00:55:23 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43359</guid>
		<description>I'm on the side of the 'proprietary period then fully open' model for data distribution (so that the team get to use the data in the short term, then all the raw data and data products get released; the problem here is serving potentially large amounts of raw data, of course, so there might be some fees for getting the raw data).

The data belongs to taxpayers, so far as I'm concerned.</description>
		<content:encoded><![CDATA[<p>I&#8217;m on the side of the &#8216;proprietary period then fully open&#8217; model for data distribution (so that the team get to use the data in the short term, then all the raw data and data products get released; the problem here is serving potentially large amounts of raw data, of course, so there might be some fees for getting the raw data).</p>
<p>The data belongs to taxpayers, so far as I&#8217;m concerned.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: superweak</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43267</link>
		<dc:creator>superweak</dc:creator>
		<pubDate>Fri, 23 Jun 2006 20:05:49 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43267</guid>
		<description>There's a catch here: by the time a dataset is understood enough to be ready for release, the collaborations will, as Sean puts it, have swept up the low-hanging fruit.  It takes &lt;em&gt;years&lt;/em&gt; for complex detectors to be fully understood, and the calibrations, systematics checks, and corrections are on the whole done by people using that information to do an analysis.  (Even the Quaero public interface for testing hypotheses against D0's data restricts you to a few well-understood samples.)  Any early public release of all the data would most likely result in lots of junk preprints as people saw badly-understood detector effects and called them new physics -- if CDF II had just gone and immediately published the four-vectors rolling out of its reconstruction software, I'm sure someone would have noticed a huge excess of monojet+missing energy events.  Certainly there are theorists who are conversant with issues of triggers, fake rate, and such.  However they are not paid to sit around all day thinking about how the information from the detector could be &lt;em&gt;wrong&lt;/em&gt;, and experimentalists are.  

In my experience experimentalists are extremely suspicious of results, since they know what actually goes into making them -- hence the tradition of requiring confirmation from an independent experiment for discovery claims.  Without some kind of (at least short-term) data encapsulation, problems will arise: imagine experiments looking at each other's data! (Related reasons are behind the rise of "blind analyses," where a collaboration hides its data from &lt;em&gt;itself&lt;/em&gt;, for fear that it will find what it wants to find.)  Even if only a small fraction of a collaboration reads a paper thoroughly, that's still an awful lot of experience-years.

And finally, a point that vaguely amuses me: we are used to a feedback system where either (a) theorists predict something, experiment finds it, theorists claim vindication because the prediction was ante hoc instead of post hoc, or (b) experiment finds something unexpected, everyone scrambles to see how models could accomodate this result, some things can't and are excluded.  What happens to this waltz if the theorists get to look at the data at the same time the experimentalists do?

[Note: none of this implies that I don't think processed HEP data should be released after a (longish) while, or that short-term data release might not be a good thing if we find ourselves with a one-detector ILC.]</description>
		<content:encoded><![CDATA[<p>There&#8217;s a catch here: by the time a dataset is understood enough to be ready for release, the collaborations will, as Sean puts it, have swept up the low-hanging fruit.  It takes <em>years</em> for complex detectors to be fully understood, and the calibrations, systematics checks, and corrections are on the whole done by people using that information to do an analysis.  (Even the Quaero public interface for testing hypotheses against D0&#8217;s data restricts you to a few well-understood samples.)  Any early public release of all the data would most likely result in lots of junk preprints as people saw badly-understood detector effects and called them new physics &#8212; if CDF II had just gone and immediately published the four-vectors rolling out of its reconstruction software, I&#8217;m sure someone would have noticed a huge excess of monojet+missing energy events.  Certainly there are theorists who are conversant with issues of triggers, fake rate, and such.  However they are not paid to sit around all day thinking about how the information from the detector could be <em>wrong</em>, and experimentalists are.  </p>
<p>In my experience experimentalists are extremely suspicious of results, since they know what actually goes into making them &#8212; hence the tradition of requiring confirmation from an independent experiment for discovery claims.  Without some kind of (at least short-term) data encapsulation, problems will arise: imagine experiments looking at each other&#8217;s data! (Related reasons are behind the rise of &#8220;blind analyses,&#8221; where a collaboration hides its data from <em>itself</em>, for fear that it will find what it wants to find.)  Even if only a small fraction of a collaboration reads a paper thoroughly, that&#8217;s still an awful lot of experience-years.</p>
<p>And finally, a point that vaguely amuses me: we are used to a feedback system where either (a) theorists predict something, experiment finds it, theorists claim vindication because the prediction was ante hoc instead of post hoc, or (b) experiment finds something unexpected, everyone scrambles to see how models could accomodate this result, some things can&#8217;t and are excluded.  What happens to this waltz if the theorists get to look at the data at the same time the experimentalists do?</p>
<p>[Note: none of this implies that I don't think processed HEP data should be released after a (longish) while, or that short-term data release might not be a good thing if we find ourselves with a one-detector ILC.]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: The Story So Far&#8230; &#187; Blog Archive &#187; Swear To God. I Am Going To Start A Separate Science Blog Roll</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43261</link>
		<dc:creator>The Story So Far&#8230; &#187; Blog Archive &#187; Swear To God. I Am Going To Start A Separate Science Blog Roll</dc:creator>
		<pubDate>Fri, 23 Jun 2006 19:50:37 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43261</guid>
		<description>[...] Daily Kos diarist Darksyde give me a link to another good one, Cosmic Variance.&#160; Jo Ann blogs on the topic of who owns the data, and should it be made public... [...]</description>
		<content:encoded><![CDATA[<p>[...] Daily Kos diarist Darksyde give me a link to another good one, Cosmic Variance.&nbsp; Jo Ann blogs on the topic of who owns the data, and should it be made public&#8230; [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: anonymous</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43244</link>
		<dc:creator>anonymous</dc:creator>
		<pubDate>Fri, 23 Jun 2006 19:15:29 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43244</guid>
		<description>Collin writes:

&lt;i&gt;The main problem is that in order to make all the data public, you have to understand all the data in a very global way.&lt;/i&gt;

Of course, you know at least one experimentalist who claims to take such a global, um, vista.</description>
		<content:encoded><![CDATA[<p>Collin writes:</p>
<p><i>The main problem is that in order to make all the data public, you have to understand all the data in a very global way.</i></p>
<p>Of course, you know at least one experimentalist who claims to take such a global, um, vista.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lmot</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43229</link>
		<dc:creator>lmot</dc:creator>
		<pubDate>Fri, 23 Jun 2006 18:58:19 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43229</guid>
		<description>The fact is, that theorists who cultivate relationships with the right experimentalists, will often get a heads up on upcoming discoveries months before they are made public.  This seems unfair and opens possiblities for corruption, but it is understandable why experimentalists would want to hold on to this power.</description>
		<content:encoded><![CDATA[<p>The fact is, that theorists who cultivate relationships with the right experimentalists, will often get a heads up on upcoming discoveries months before they are made public.  This seems unfair and opens possiblities for corruption, but it is understandable why experimentalists would want to hold on to this power.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Erwin</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43226</link>
		<dc:creator>Peter Erwin</dc:creator>
		<pubDate>Fri, 23 Jun 2006 18:46:30 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43226</guid>
		<description>&lt;i&gt;Finally, this is all a moving target. Algorithms change. Object ID changes as parts of the detector are better understood. At what point do you publish the data? When the experiment is over and no new changes are going to be made? As soon as itâ€™s well understood? Once you publish it, do you get to go back and change it? What happens if someone produces a bogus result on old data?&lt;/i&gt;

Some of these issues exist for astronomical archives.  For example, some of the instrumental idiosyncrasies of the Hubble Space Telescope become better understood over time, and the post-processing and corrections get updated (or new correction stages are introduced).  The archive implements an "on-the-fly" recalibration system, so if you request data from the archive, they are always processed with the latest approved algorithms and calibration files.  In a sense, the data are continually "re-published", and it's up to the archive users to make sure that the data they retrieved a year ago haven't been made obsolete by significant improvements in the calibration since then (this is quite rare in practice).

Of course, the fact that the HST data come in small, discrete chunks (be they images or spectra) makes it easier to implement such a system.  But "publishing" data from an instrument does not have to be a one-time, can't-go-back-and-fix-it affair.</description>
		<content:encoded><![CDATA[<p><i>Finally, this is all a moving target. Algorithms change. Object ID changes as parts of the detector are better understood. At what point do you publish the data? When the experiment is over and no new changes are going to be made? As soon as itâ€™s well understood? Once you publish it, do you get to go back and change it? What happens if someone produces a bogus result on old data?</i></p>
<p>Some of these issues exist for astronomical archives.  For example, some of the instrumental idiosyncrasies of the Hubble Space Telescope become better understood over time, and the post-processing and corrections get updated (or new correction stages are introduced).  The archive implements an &#8220;on-the-fly&#8221; recalibration system, so if you request data from the archive, they are always processed with the latest approved algorithms and calibration files.  In a sense, the data are continually &#8220;re-published&#8221;, and it&#8217;s up to the archive users to make sure that the data they retrieved a year ago haven&#8217;t been made obsolete by significant improvements in the calibration since then (this is quite rare in practice).</p>
<p>Of course, the fact that the HST data come in small, discrete chunks (be they images or spectra) makes it easier to implement such a system.  But &#8220;publishing&#8221; data from an instrument does not have to be a one-time, can&#8217;t-go-back-and-fix-it affair.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Erwin</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43225</link>
		<dc:creator>Peter Erwin</dc:creator>
		<pubDate>Fri, 23 Jun 2006 18:31:19 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43225</guid>
		<description>One of the things that makes astronomical archives so useful is the fact that many observations made for one purpose will contain data useful for other purposes.  A simple example would be a long-exposure image made to study a quasar, which will automatically also have data on galaxies and foreground stars in the same field of view.  These are irrelevant for the quasar study, but may turn out to be quite useful later for other projects.

So an interesting question is whether something like this might be true for LHC datasets.  Not being a high-energy physicist in any way, shape, or form, I have no idea.  (Thomas Dent's comment suggests that perhaps there could be such cases.)</description>
		<content:encoded><![CDATA[<p>One of the things that makes astronomical archives so useful is the fact that many observations made for one purpose will contain data useful for other purposes.  A simple example would be a long-exposure image made to study a quasar, which will automatically also have data on galaxies and foreground stars in the same field of view.  These are irrelevant for the quasar study, but may turn out to be quite useful later for other projects.</p>
<p>So an interesting question is whether something like this might be true for LHC datasets.  Not being a high-energy physicist in any way, shape, or form, I have no idea.  (Thomas Dent&#8217;s comment suggests that perhaps there could be such cases.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: collin</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43222</link>
		<dc:creator>collin</dc:creator>
		<pubDate>Fri, 23 Jun 2006 18:30:18 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43222</guid>
		<description>Fun topic... There are HEP experimentalists who believe the data should be made public. But it isn't as easy for the experimentalists as just writing out an ascii file of four vectors of the objects in an event. 

The main problem is that in order to make all the data public, you have to understand all the data in a very global way. This isn't the way particle physics of this scale is traditionally done. On any given analysis, you have some touchstones to ensure your sanity and some control regions to ensure your methods and some signal regions to measure or search for something. And all this is on just a small portion of the data. Generally, any given analysis will have some object ID requirements and some fudge factors, such as the probability for a jet to fake an electron or the k-factor applied to the leading order Z+3jet x-section. While these are somewhat standardized across analyses, not everybody working on every analysis will use the same thing. For example, not everybody on an experiment will agree on what objects are in a given event. If I'm measuring the W mass in the W-&#62;e nu channel, my definition of an electron is going to be much tighter (rightly so) than if I'm looking for ZZ-&#62;eeee where not only do I have four electrons, but I also have two mass constraints.

Another problem is that a list of four vectors from ATLAS will not mean the same thing as a list of four vectors from CMS. So then, not only do you have to have the data for each experiment, but also the monte carlo. 

Finally, this is all a moving target. Algorithms change. Object ID changes as parts of the detector are better understood. At what point do you publish the data? When the experiment is over and no new changes are going to be made?  As soon as it's well understood?  Once you publish it, do you get to go back and change it? What happens if someone produces a bogus result on old data?</description>
		<content:encoded><![CDATA[<p>Fun topic&#8230; There are HEP experimentalists who believe the data should be made public. But it isn&#8217;t as easy for the experimentalists as just writing out an ascii file of four vectors of the objects in an event. </p>
<p>The main problem is that in order to make all the data public, you have to understand all the data in a very global way. This isn&#8217;t the way particle physics of this scale is traditionally done. On any given analysis, you have some touchstones to ensure your sanity and some control regions to ensure your methods and some signal regions to measure or search for something. And all this is on just a small portion of the data. Generally, any given analysis will have some object ID requirements and some fudge factors, such as the probability for a jet to fake an electron or the k-factor applied to the leading order Z+3jet x-section. While these are somewhat standardized across analyses, not everybody working on every analysis will use the same thing. For example, not everybody on an experiment will agree on what objects are in a given event. If I&#8217;m measuring the W mass in the W-&gt;e nu channel, my definition of an electron is going to be much tighter (rightly so) than if I&#8217;m looking for ZZ-&gt;eeee where not only do I have four electrons, but I also have two mass constraints.</p>
<p>Another problem is that a list of four vectors from ATLAS will not mean the same thing as a list of four vectors from CMS. So then, not only do you have to have the data for each experiment, but also the monte carlo. </p>
<p>Finally, this is all a moving target. Algorithms change. Object ID changes as parts of the detector are better understood. At what point do you publish the data? When the experiment is over and no new changes are going to be made?  As soon as it&#8217;s well understood?  Once you publish it, do you get to go back and change it? What happens if someone produces a bogus result on old data?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas Dent</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43210</link>
		<dc:creator>Thomas Dent</dc:creator>
		<pubDate>Fri, 23 Jun 2006 18:08:12 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43210</guid>
		<description>I spoke to one theorist at the conference (Dermisek... DE Kaplan is doing similar things) who has a model with a light Higgs decaying in non-standard channels and can matchup to the largest excess of 'Higgs-like' events from LEP II ... now to really compare the model with data you need to look in more detail at what these events are. But that is still locked up inside what is left of LEP collaborations. So a model is de facto "untestable" if the data are being kept secret and none of the experimentalists feels like making an analysis of it. Maybe LEP data are being 'kept safe', but it makes no difference if they will never be seen or used again.

A few more points. "Experimental papers written by theorists will never be believed" - maybe true, but this doesn't prove that data should be secret. In fact the more true it is, the bigger disincentive there is for theorists to encounter raw data and the more likely that experimenters will have the field to themselves.

"Theorists might scoop experimental grad students" ... this seems in contradiction with the previous point. If experimentalists are so much better at dealing properly with detectors, backgrounds, etc. then there is no way they can get scooped because you need to control those aspects before making any credible claim of discovery. Something is a bit odd if a major scientific discovery has to be delayed until one student has the time to write up a thesis. But I don't think experiments really work like that.

There is no real theory-experiment conflict. The experimental work is for example designing and implementing a trigger and doing detector simulations for some type of signal and calibrating the detector and background once the machine is running, and without this there would be nothing at all for theorists to use. Morally, any 'discovery' paper should credit those experimentalists who were crucial to the existence of the data. But traditionally, individuals are *not* credited, the experimental collaboration publishes collectively and is cited collectively and no-one knows exactly who did what. The question is, how is the experiment run (democracy? benevolent dictator?), who decides when and how and by whom data analysis is done, how do individual experimenters get credit for their own work apart from word-of-mouth? These are not questions which involve theorists unless they are really competent to do data analysis. I don't think experimental secrecy will solve any problems.

How about if the equations of GR or supersymmetry were kept secret from anyone who wasn't a theorist...</description>
		<content:encoded><![CDATA[<p>I spoke to one theorist at the conference (Dermisek&#8230; DE Kaplan is doing similar things) who has a model with a light Higgs decaying in non-standard channels and can matchup to the largest excess of &#8216;Higgs-like&#8217; events from LEP II &#8230; now to really compare the model with data you need to look in more detail at what these events are. But that is still locked up inside what is left of LEP collaborations. So a model is de facto &#8220;untestable&#8221; if the data are being kept secret and none of the experimentalists feels like making an analysis of it. Maybe LEP data are being &#8216;kept safe&#8217;, but it makes no difference if they will never be seen or used again.</p>
<p>A few more points. &#8220;Experimental papers written by theorists will never be believed&#8221; - maybe true, but this doesn&#8217;t prove that data should be secret. In fact the more true it is, the bigger disincentive there is for theorists to encounter raw data and the more likely that experimenters will have the field to themselves.</p>
<p>&#8220;Theorists might scoop experimental grad students&#8221; &#8230; this seems in contradiction with the previous point. If experimentalists are so much better at dealing properly with detectors, backgrounds, etc. then there is no way they can get scooped because you need to control those aspects before making any credible claim of discovery. Something is a bit odd if a major scientific discovery has to be delayed until one student has the time to write up a thesis. But I don&#8217;t think experiments really work like that.</p>
<p>There is no real theory-experiment conflict. The experimental work is for example designing and implementing a trigger and doing detector simulations for some type of signal and calibrating the detector and background once the machine is running, and without this there would be nothing at all for theorists to use. Morally, any &#8216;discovery&#8217; paper should credit those experimentalists who were crucial to the existence of the data. But traditionally, individuals are *not* credited, the experimental collaboration publishes collectively and is cited collectively and no-one knows exactly who did what. The question is, how is the experiment run (democracy? benevolent dictator?), who decides when and how and by whom data analysis is done, how do individual experimenters get credit for their own work apart from word-of-mouth? These are not questions which involve theorists unless they are really competent to do data analysis. I don&#8217;t think experimental secrecy will solve any problems.</p>
<p>How about if the equations of GR or supersymmetry were kept secret from anyone who wasn&#8217;t a theorist&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: anonymous</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43190</link>
		<dc:creator>anonymous</dc:creator>
		<pubDate>Fri, 23 Jun 2006 17:20:21 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43190</guid>
		<description>"As regards four-vectors: the processed data isnâ€™t a collection of particles with well-defined momenta, energies or even identities. Itâ€™s a collection of tracks recorded in various parts of the detector."

Not really. The final production data consists of photons, electrons, muons, taus, and jets, constructed out of all of the things recorded in the detector.

Of course, this means some set of tracks and energy deposits and so on that have passed certain identification criteria, and one must understand fake rates and so on.

Nonetheless, the final objects of an analysis are, at some approximate level, just four-vectors labelled as a certain particle type, with extra supplementary information.

Of course, outsiders analyzing data are always troublesome: see e.g. de Boer's claims of dark matter discovery in EGRET data. On the other hand, after some reasonable length of time, it seems wise to make the data public. (How many interesting LEP analyses remain that aren't happening because experimenters have [reasonably] moved on?)</description>
		<content:encoded><![CDATA[<p>&#8220;As regards four-vectors: the processed data isnâ€™t a collection of particles with well-defined momenta, energies or even identities. Itâ€™s a collection of tracks recorded in various parts of the detector.&#8221;</p>
<p>Not really. The final production data consists of photons, electrons, muons, taus, and jets, constructed out of all of the things recorded in the detector.</p>
<p>Of course, this means some set of tracks and energy deposits and so on that have passed certain identification criteria, and one must understand fake rates and so on.</p>
<p>Nonetheless, the final objects of an analysis are, at some approximate level, just four-vectors labelled as a certain particle type, with extra supplementary information.</p>
<p>Of course, outsiders analyzing data are always troublesome: see e.g. de Boer&#8217;s claims of dark matter discovery in EGRET data. On the other hand, after some reasonable length of time, it seems wise to make the data public. (How many interesting LEP analyses remain that aren&#8217;t happening because experimenters have [reasonably] moved on?)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Troublemaker</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43188</link>
		<dc:creator>Troublemaker</dc:creator>
		<pubDate>Fri, 23 Jun 2006 17:18:58 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43188</guid>
		<description>&lt;b&gt;Peter Woit&lt;/b&gt; said: &lt;i&gt;...the data is not readily comparable to astronomy data, which I gather is often the output of some conceptually rather simple device at a focal plane.&lt;/i&gt;

This is not always true.  Radio interferometry is conceptually quite intricate and requires that a fair amount of thinking be done in Fourier space.  GLAST, scheduled to be launched next year, is a stack of particle detectors and calorimeters.</description>
		<content:encoded><![CDATA[<p><b>Peter Woit</b> said: <i>&#8230;the data is not readily comparable to astronomy data, which I gather is often the output of some conceptually rather simple device at a focal plane.</i></p>
<p>This is not always true.  Radio interferometry is conceptually quite intricate and requires that a fair amount of thinking be done in Fourier space.  GLAST, scheduled to be launched next year, is a stack of particle detectors and calorimeters.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sean</title>
		<link>http://cosmicvariance.com/2006/06/23/should-the-data-be-public/#comment-43175</link>
		<dc:creator>Sean</dc:creator>
		<pubDate>Fri, 23 Jun 2006 16:19:10 +0000</pubDate>
		<guid isPermaLink="false">http://cosmicvariance.com/?p=889#comment-43175</guid>
		<description>In principle I'm in favor of releasing the data, in practice I doubt that it would work.  Without an intimate knowledge of the idiosyncrasies of the detector, too many spurious results would be hard to resist.

In fact I think that people tend to underestimate the extent to which the experimental collaborations will sweep up all the low-hanging fruit, when it comes to interpreting surprises in the data.  These folks know about all the major models, and they'll definitely put a lot of work into matching the data to the theories before they ever release any results.  Which is okay -- theorists, instead of ambulance-chasing, will be left to do the hard work of puzzling out the results that &lt;em&gt;don't&lt;/em&gt; fit into any of the popular models lying around.

One related (and, I would think, simpler to solve) problem is the closed nature of the collaboration-based publication process.  Given all the blessing and godparenting and so on that must take place before an analysis sees the light of day, it's hard-to-impossible for an experimentalist to actually collaborate directly with a theorist on attacking some particularly interesting puzzle.  And that is just a shame.</description>
		<content:encoded><![CDATA[<p>In principle I&#8217;m in favor of releasing the data, in practice I doubt that it would work.  Without an intimate knowledge of the idiosyncrasies of the detector, too many spurious results would be hard to resist.</p>
<p>In fact I think that people tend to underestimate the extent to which the experimental collaborations will sweep up all the low-hanging fruit, when it comes to interpreting surprises in the data.  These folks know about all the major models, and they&#8217;ll definitely put a lot of work into matching the data to the theories before they ever release any results.  Which is okay &#8212; theorists, instead of ambulance-chasing, will be left to do the hard work of puzzling out the results that <em>don&#8217;t</em> fit into any of the popular models lying around.</p>
<p>One related (and, I would think, simpler to solve) problem is the closed nature of the collaboration-based publication process.  Given all the blessing and godparenting and so on that must take place before an analysis sees the light of day, it&#8217;s hard-to-impossible for an experimentalist to actually collaborate directly with a theorist on attacking some particularly interesting puzzle.  And that is just a shame.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
