International Federation of Digital Seismograph Networks

Thread: FW: data quality metrics

None
Started: Nov. 24, 2015, 3:44 p.m.
Last activity: Nov. 24, 2015, 6:29 p.m.
Reinoud Sleeman
Nov. 24, 2015, 3:44 p.m.


Dear WG-II members,

I believe this email (early November) never reached you. Please find it again.

Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
proposed metrics where the green highlighted text refers to, in my opinion, differences in both
systems and/or definitions that requires agreement.

I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.

Looking forward for your feedback.

Cheers,
Reinoud

  • Rick Benson
    Nov. 24, 2015, 9:29 a.m.
    Hello WG-II
    I have only 1 comment/suggestion, namely adding one additional field to properly distinguish “unique” time series records
    in the SEED domain.

    A “time series” is defined on the opening page as being:
    to belong to a data stream uniquely identified by a SEED network code, stations code, channel code and location code.

    However, I think that the Quality code should be included from field 2 of the FSDH that’s been around since April 2004, so that there is NO ambiguity.

    Thank you ,
    Rick


    Dear WG-II members,

    I believe this email (early November) never reached you. Please find it again.

    Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
    the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
    ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
    proposed metrics where the green highlighted text refers to, in my opinion, differences in both
    systems and/or definitions that requires agreement.

    I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.

    Looking forward for your feedback.

    Cheers,
    Reinoud
    <Proposal definition QC metrics.pdf>
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    ================
    Rick Benson
    Director of Data Management
    IRIS DMC
    (206)547-0393 ext. 119(office)
    rick<at>iris.washington.edu <rick<at>iris.washington.edu>





    • Florian Haslinger
      Nov. 24, 2015, 6:29 p.m.
      Hi all,

      it took me a little while to figure out (I think) what Rick meant: the Quality Code (D/R/Q/M) would allow to distinguish between the same time-series streams that are quality checked vs those that are not. In that line, any stream for which the quality metrics under discussion were computed would qualify as ‘checked’ (‘Q’), correct? (or would it be ‘M’ - because the data center had added the quality metrics?)
      Could this interpretation be extended such that a Quality Code of Q (or M, see above) would mean that the FDSN agreed quality metrics are available on that data stream (segment)?

      One further comment / question to the proposal: In the definition of ‘Continuous time series’ (p7 bottom)
      a) there is a typo 'ε is the time tolerance is s'
      should probably read 'ε is the time tolerance in s'

      b) the addition of the ‘0’ default tolerance might be impractical? Given that we usually qualify digitizers to have a timing accuracy of ~10e-4 / 10e-5, this in principle translates into a possible deviation of the actual from the nominal (epsilon not zero) - even though it may be practically quite impossible to determine precisely. But if one takes the ‘default 0’ seriously, almost all adjacent time series should have a ‘Gap’ between them?

      Further, there seems no way of defining a (non-default) value for epsilon?

      kind regards,
      florian



      On 24 Nov 2015, at 18:31, Rick Benson <rick<at>iris.washington.edu<rick<at>iris.washington.edu>> wrote:

      Hello WG-II
      I have only 1 comment/suggestion, namely adding one additional field to properly distinguish “unique” time series records
      in the SEED domain.

      A “time series” is defined on the opening page as being:

      to belong to a data stream uniquely identified by a SEED network code, stations code, channel code and location code.

      However, I think that the Quality code should be included from field 2 of the FSDH that’s been around since April 2004, so that there is NO ambiguity.

      Thank you ,
      Rick


      Dear WG-II members,

      I believe this email (early November) never reached you. Please find it again.

      Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
      the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
      ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
      proposed metrics where the green highlighted text refers to, in my opinion, differences in both
      systems and/or definitions that requires agreement.

      I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.

      Looking forward for your feedback.

      Cheers,
      Reinoud
      <Proposal definition QC metrics.pdf>
      ----------------------
      FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
      Update subscription preferences at http://www.fdsn.org/account/profile/

      ================
      Rick Benson
      Director of Data Management
      IRIS DMC
      (206)547-0393 ext. 119(office)
      rick<at>iris.washington.edu<rick<at>iris.washington.edu>





      ----------------------
      FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
      Update subscription preferences at http://www.fdsn.org/account/profile/


      • Dan Auerbach
        Nov. 24, 2015, 11:14 a.m.
        Not sure I follow all of the potential subtleties, but I support the idea that the Quality Code of a time-series for which metrics have been computed should be know. While an end-user/consumer may not care (much) about the Quality Code if they are using the metrics to evaluate data quality, for Data Centers that perform QC, and produce Q data, it is certainly helpful to know whether metrics have been computed using Q versus R.

        Best,
        Dan


        Dan Auerbach, Application Developer
        Project IDA Data Coordinating Center, Rm 2120
        Institute of Geophysics & Planetary Physics, MS 0225
        Scripps Institution of Oceanography, UC San Diego
        La Jolla, CA 92093
        858-822-0797



        On Nov 24, 2015, at 10:30 AM, Florian Haslinger <florian.haslinger<at>sed.ethz.ch> wrote:

        Hi all,

        it took me a little while to figure out (I think) what Rick meant: the Quality Code (D/R/Q/M) would allow to distinguish between the same time-series streams that are quality checked vs those that are not. In that line, any stream for which the quality metrics under discussion were computed would qualify as ‘checked’ (‘Q’), correct? (or would it be ‘M’ - because the data center had added the quality metrics?)
        Could this interpretation be extended such that a Quality Code of Q (or M, see above) would mean that the FDSN agreed quality metrics are available on that data stream (segment)?

        One further comment / question to the proposal: In the definition of ‘Continuous time series’ (p7 bottom)
        a) there is a typo 'ε is the time tolerance is s'
        should probably read 'ε is the time tolerance in s'
        b) the addition of the ‘0’ default tolerance might be impractical? Given that we usually qualify digitizers to have a timing accuracy of ~10e-4 / 10e-5, this in principle translates into a possible deviation of the actual from the nominal (epsilon not zero) - even though it may be practically quite impossible to determine precisely. But if one takes the ‘default 0’ seriously, almost all adjacent time series should have a ‘Gap’ between them?

        Further, there seems no way of defining a (non-default) value for epsilon?

        kind regards,
        florian


        On 24 Nov 2015, at 18:31, Rick Benson <rick<at>iris.washington.edu <rick<at>iris.washington.edu>> wrote:

        Hello WG-II
        I have only 1 comment/suggestion, namely adding one additional field to properly distinguish “unique” time series records
        in the SEED domain.

        A “time series” is defined on the opening page as being:
        to belong to a data stream uniquely identified by a SEED network code, stations code, channel code and location code.

        However, I think that the Quality code should be included from field 2 of the FSDH that’s been around since April 2004, so that there is NO ambiguity.

        Thank you ,
        Rick


        Dear WG-II members,

        I believe this email (early November) never reached you. Please find it again.

        Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
        the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
        ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
        proposed metrics where the green highlighted text refers to, in my opinion, differences in both
        systems and/or definitions that requires agreement.

        I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.

        Looking forward for your feedback.

        Cheers,
        Reinoud
        <Proposal definition QC metrics.pdf>
        ----------------------
        FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/
        ================
        Rick Benson
        Director of Data Management
        IRIS DMC
        (206)547-0393 ext. 119(office)
        rick<at>iris.washington.edu <rick<at>iris.washington.edu>





        ----------------------
        FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/


        ----------------------
        FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/


        • Doug Neuhauser
          Nov. 24, 2015, 5:08 p.m.
          Comments on the FDSN Proposed Data Quality Metrics:

          1. In several places the time window is stated as
          [t0,t1]
          but the description then goes on to define this as a half-open
          (or half-closed) interval that does not include t1.
          The proper international notation ISO 33-11 for this half-open interval is:
          [t0,t1)
          See: https://en.wikipedia.org/wiki/ISO_31-11

          2. I believe that the the metric
          ms_timing_correction
          should be defined as:
          Number of records fitting time window [t0,t1) in which the MiniSEED
          "time correction" field is non-zero.

          The reason for this is that a time correction can be made to the record
          by having a non-zero value in the timing correction field, but NOT added
          into the "Record start time", in which case bit 1 of the Activity Flag
          will be set to 0. I believe my proposed definition handles both cases.

          3. I suspect that computing metrics for gaps and overlaps using a time
          tolerance epsilon of 0 is overkill and man not convey useful information
          depending on the timestamp resolution. "Standard" MiniSEED had a time
          resolution of 0.0001 seconds, but the addition of blockette 1001
          adds resolution to 1 microsecond, 0.000001 seconds. Since the ms_timing_
          quality uses values from the blockette 1001, I assume that the timestamp
          of the time series will also use the microsecond resolution (usec99) of
          the blockette 1001 if it is available.

          The number of gaps and overlaps will vary depending on the precision
          of the timestamp. Do users really care if there is dithering in the
          1 microsecond timestamp that may occur when maintaining a phase lock loop?
          Even in a closely temperature controlled environment I see the microseconds
          field changing by +- 2 microseconds during a 24 hour period.

          Data streams that have timestamps of 1 microsecond resolution
          will certainly have a larger number of reported gaps and overlaps
          than streams that use the standard 0.0001 MiniSEED clock resolution.

          However, if we do NOT use an epsilon of 0, we need to standardize
          on what value to use, and the value may be sample-rate dependent.

          4. Many of the metrics are defined as "the number of records"
          that match some criteria. Is this a meaningful value, since the
          record size of data streams are not identical. Common MiniSEED record
          sizes are 512, 1024, 2048, and 4096 bytes, and even with a
          given record size, the number of samples may vary by a factor of
          6 or more depending on the type of compression. Would we do better
          to compute these metrics based on the number of seconds of data
          or the number of data samples rather than the number of MiniSEED records?

          - Doug N

          On 11/24/2015 11:16 AM, Dan Auerbach wrote:
          Not sure I follow all of the potential subtleties, but I support the
          idea that the Quality Code of a time-series for which metrics have
          been computed should be know. While an end-user/consumer may not care
          (much) about the Quality Code if they are using the metrics to
          evaluate data quality, for Data Centers that perform QC, and produce
          Q data, it is certainly helpful to know whether metrics have been
          computed using Q versus R.

          Best,
          Dan


          Dan Auerbach, Application Developer
          Project IDA Data Coordinating Center, Rm 2120
          Institute of Geophysics & Planetary Physics, MS 0225
          Scripps Institution of Oceanography, UC San Diego
          La Jolla, CA 92093
          858-822-0797



          On Nov 24, 2015, at 10:30 AM, Florian Haslinger <florian.haslinger<at>sed.ethz.ch <florian.haslinger<at>sed.ethz.ch>> wrote:

          Hi all,

          it took me a little while to figure out (I think) what Rick meant:
          the Quality Code (D/R/Q/M) would allow to distinguish between the
          same time-series streams that are quality checked vs those that are
          not. In that line, any stream for which the quality metrics under
          discussion were computed would qualify as ‘checked’ (‘Q’), correct?
          (or would it be ‘M’ - because the data center had added the quality
          metrics?)
          Could this interpretation be extended such that a Quality Code of Q
          (or M, see above) would mean that the FDSN agreed quality metrics
          are available on that data stream (segment)?

          One further comment / question to the proposal: In the definition
          of ‘Continuous time series’ (p7 bottom)
          a) there is a typo 'ε is the time tolerance is s'
          should probably read 'ε is the time tolerance in s'

          b) the addition of the ‘0’ default tolerance might be impractical?
          Given that we usually qualify digitizers to have a timing accuracy
          of ~10e-4 / 10e-5, this in principle translates into a possible
          deviation of the actual from the nominal (epsilon not zero) - even
          though it may be practically quite impossible to determine
          precisely. But if one takes the ‘default 0’ seriously, almost all
          adjacent time series should have a ‘Gap’ between them?

          Further, there seems no way of defining a (non-default) value for epsilon?

          kind regards,
          florian


          On 24 Nov 2015, at 18:31, Rick Benson <rick<at>iris.washington.edu <rick<at>iris.washington.edu>> wrote:

          Hello WG-II
          I have only 1 comment/suggestion, namely adding one additional
          field to properly distinguish “unique” time series records
          in the SEED domain.

          A “time series” is defined on the opening page as being:

          *_/to belong to a data stream uniquely identified by a SEED
          network code, stations code, channel code and location code. /_*

          /*However, I think that the *_*Quality code *_*_should be
          included_**__from field 2 of the FSDH that’s been around since
          April 2004, so that there is NO ambiguity.*/
          */
          /*
          Thank you ,
          Rick


          Dear WG-II members,
          I believe this email (early November) never reached you. Please find it again.
          Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
          the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
          ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
          proposed metrics where the green highlighted text refers to, in my opinion, differences in both
          systems and/or definitions that requires agreement.
          I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.
          Looking forward for your feedback.
          Cheers,
          Reinoud
          <Proposal definition QC metrics.pdf>
          ----------------------
          FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences athttp://www.fdsn.org/account/profile/

          /
          /================/
          /Rick Benson/
          /Director of Data Management/
          /IRIS DMC/
          /(206)547-0393 ext. 119(office)/
          /rick<at>iris.washington.edu <rick<at>iris.washington.edu>/

          /




          ----------------------
          FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences at http://www.fdsn.org/account/profile/


          ----------------------
          FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences at http://www.fdsn.org/account/profile/




          ----------------------
          FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

          Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
          Update subscription preferences at http://www.fdsn.org/account/profile/


          --
          ------------------------------------------------------------------------
          Doug Neuhauser University of California, Berkeley
          doug<at>seismo.berkeley.edu Berkeley Seismological Laboratory
          Office: 510-642-0931 221 McCone Hall # 4760
          Fax: 510-643-5811 Berkeley, CA 94720-4760
          Remote: 530-752-5615 (Wed,Fri)