Thread: FW: data quality metrics

Started: 2015-11-24 23:44:19

Last activity: 2015-11-25 02:29:09

Topics: FDSN Working Group II

This thread is from a mailing list that has moved to Google Groups. Use the following links to browse the updated archives.

FDSN Working Group II

Reinoud Sleeman

FW: data quality metrics

2015-11-24 23:44:19

Dear WG-II members,

I believe this email (early November) never reached you. Please find it again.

Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
proposed metrics where the green highlighted text refers to, in my opinion, differences in both
systems and/or definitions that requires agreement.

I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.

Looking forward for your feedback.

Cheers,
Reinoud

View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/B59AB912AEFF914CB3F804834D5DAFCC012A02228D%40BCSML1.knmi.nl.

Rick Benson

Re: data quality metrics

2015-11-24 17:29:42

Hello WG-II
I have only 1 comment/suggestion, namely adding one additional field to properly distinguish “unique” time series records
in the SEED domain.

A “time series” is defined on the opening page as being:
to belong to a data stream uniquely identified by a SEED network code, stations code, channel code and location code.

However, I think that the Quality code should be included from field 2 of the FSDH that’s been around since April 2004, so that there is NO ambiguity.

Thank you ,
Rick

Dear WG-II members,

I believe this email (early November) never reached you. Please find it again.

Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
proposed metrics where the green highlighted text refers to, in my opinion, differences in both
systems and/or definitions that requires agreement.

I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.

Looking forward for your feedback.

Cheers,
Reinoud
<Proposal definition QC metrics.pdf>
----------------------
FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)

Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
Update subscription preferences at http://www.fdsn.org/account/profile/

================
Rick Benson
Director of Data Management
IRIS DMC
(206)547-0393 ext. 119(office)
rick<at>iris.washington.edu <rick<at>iris.washington.edu>

View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/3E9D92F3-2756-44CE-8630-952CF1497FDF%40iris.washington.edu.
- Florian Haslinger
  
  Re: data quality metrics
  
  2015-11-25 02:29:09
  
  Hi all,
  
  it took me a little while to figure out (I think) what Rick meant: the Quality Code (D/R/Q/M) would allow to distinguish between the same time-series streams that are quality checked vs those that are not. In that line, any stream for which the quality metrics under discussion were computed would qualify as ‘checked’ (‘Q’), correct? (or would it be ‘M’ - because the data center had added the quality metrics?)
  Could this interpretation be extended such that a Quality Code of Q (or M, see above) would mean that the FDSN agreed quality metrics are available on that data stream (segment)?
  
  One further comment / question to the proposal: In the definition of ‘Continuous time series’ (p7 bottom)
  a) there is a typo 'ε is the time tolerance is s'
  should probably read 'ε is the time tolerance in s'
  
  b) the addition of the ‘0’ default tolerance might be impractical? Given that we usually qualify digitizers to have a timing accuracy of ~10e-4 / 10e-5, this in principle translates into a possible deviation of the actual from the nominal (epsilon not zero) - even though it may be practically quite impossible to determine precisely. But if one takes the ‘default 0’ seriously, almost all adjacent time series should have a ‘Gap’ between them?
  
  Further, there seems no way of defining a (non-default) value for epsilon?
  
  kind regards,
  florian
  
  On 24 Nov 2015, at 18:31, Rick Benson <rick<at>iris.washington.edu<rick<at>iris.washington.edu>> wrote:
  
  Hello WG-II
  I have only 1 comment/suggestion, namely adding one additional field to properly distinguish “unique” time series records
  in the SEED domain.
  
  A “time series” is defined on the opening page as being:
  
  to belong to a data stream uniquely identified by a SEED network code, stations code, channel code and location code.
  
  However, I think that the Quality code should be included from field 2 of the FSDH that’s been around since April 2004, so that there is NO ambiguity.
  
  Thank you ,
  Rick
  
  Dear WG-II members,
  
  I believe this email (early November) never reached you. Please find it again.
  
  Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
  the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
  ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
  proposed metrics where the green highlighted text refers to, in my opinion, differences in both
  systems and/or definitions that requires agreement.
  
  I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.
  
  Looking forward for your feedback.
  
  Cheers,
  Reinoud
  <Proposal definition QC metrics.pdf>
  ----------------------
  FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
  
  Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
  Update subscription preferences at http://www.fdsn.org/account/profile/
  
  ================
  Rick Benson
  Director of Data Management
  IRIS DMC
  (206)547-0393 ext. 119(office)
  rick<at>iris.washington.edu<rick<at>iris.washington.edu>
  
  ----------------------
  FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
  
  Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
  Update subscription preferences at http://www.fdsn.org/account/profile/
  
  View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/D5ACD243-9FBA-4653-A10E-D339F4492793%40sed.ethz.ch.
  - Dan Auerbach
    
    Re: data quality metrics
    
    2015-11-24 19:14:49
    
    Not sure I follow all of the potential subtleties, but I support the idea that the Quality Code of a time-series for which metrics have been computed should be know. While an end-user/consumer may not care (much) about the Quality Code if they are using the metrics to evaluate data quality, for Data Centers that perform QC, and produce Q data, it is certainly helpful to know whether metrics have been computed using Q versus R.
    
    Best,
    Dan
    
    Dan Auerbach, Application Developer
    Project IDA Data Coordinating Center, Rm 2120
    Institute of Geophysics & Planetary Physics, MS 0225
    Scripps Institution of Oceanography, UC San Diego
    La Jolla, CA 92093
    858-822-0797
    
    On Nov 24, 2015, at 10:30 AM, Florian Haslinger <florian.haslinger<at>sed.ethz.ch> wrote:
    
    Hi all,
    
    it took me a little while to figure out (I think) what Rick meant: the Quality Code (D/R/Q/M) would allow to distinguish between the same time-series streams that are quality checked vs those that are not. In that line, any stream for which the quality metrics under discussion were computed would qualify as ‘checked’ (‘Q’), correct? (or would it be ‘M’ - because the data center had added the quality metrics?)
    Could this interpretation be extended such that a Quality Code of Q (or M, see above) would mean that the FDSN agreed quality metrics are available on that data stream (segment)?
    
    One further comment / question to the proposal: In the definition of ‘Continuous time series’ (p7 bottom)
    a) there is a typo 'ε is the time tolerance is s'
    should probably read 'ε is the time tolerance in s'
    b) the addition of the ‘0’ default tolerance might be impractical? Given that we usually qualify digitizers to have a timing accuracy of ~10e-4 / 10e-5, this in principle translates into a possible deviation of the actual from the nominal (epsilon not zero) - even though it may be practically quite impossible to determine precisely. But if one takes the ‘default 0’ seriously, almost all adjacent time series should have a ‘Gap’ between them?
    
    Further, there seems no way of defining a (non-default) value for epsilon?
    
    kind regards,
    florian
    
    On 24 Nov 2015, at 18:31, Rick Benson <rick<at>iris.washington.edu <rick<at>iris.washington.edu>> wrote:
    
    Hello WG-II
    I have only 1 comment/suggestion, namely adding one additional field to properly distinguish “unique” time series records
    in the SEED domain.
    
    A “time series” is defined on the opening page as being:
    to belong to a data stream uniquely identified by a SEED network code, stations code, channel code and location code.
    
    However, I think that the Quality code should be included from field 2 of the FSDH that’s been around since April 2004, so that there is NO ambiguity.
    
    Thank you ,
    Rick
    
    Dear WG-II members,
    
    I believe this email (early November) never reached you. Please find it again.
    
    Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
    the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
    ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
    proposed metrics where the green highlighted text refers to, in my opinion, differences in both
    systems and/or definitions that requires agreement.
    
    I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.
    
    Looking forward for your feedback.
    
    Cheers,
    Reinoud
    <Proposal definition QC metrics.pdf>
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    ================
    Rick Benson
    Director of Data Management
    IRIS DMC
    (206)547-0393 ext. 119(office)
    rick<at>iris.washington.edu <rick<at>iris.washington.edu>
    
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/0129F3C4-F58F-48CF-9F0D-F14395535176%40ucsd.edu.
    
    Doug Neuhauser
    
    Re: data quality metrics
    
    2015-11-25 01:08:26
    
    Comments on the FDSN Proposed Data Quality Metrics:
    
    1. In several places the time window is stated as
    [t0,t1]
    but the description then goes on to define this as a half-open
    (or half-closed) interval that does not include t1.
    The proper international notation ISO 33-11 for this half-open interval is:
    [t0,t1)
    See: https://en.wikipedia.org/wiki/ISO_31-11
    
    2. I believe that the the metric
    ms_timing_correction
    should be defined as:
    Number of records fitting time window [t0,t1) in which the MiniSEED
    "time correction" field is non-zero.
    
    The reason for this is that a time correction can be made to the record
    by having a non-zero value in the timing correction field, but NOT added
    into the "Record start time", in which case bit 1 of the Activity Flag
    will be set to 0. I believe my proposed definition handles both cases.
    
    3. I suspect that computing metrics for gaps and overlaps using a time
    tolerance epsilon of 0 is overkill and man not convey useful information
    depending on the timestamp resolution. "Standard" MiniSEED had a time
    resolution of 0.0001 seconds, but the addition of blockette 1001
    adds resolution to 1 microsecond, 0.000001 seconds. Since the ms_timing_
    quality uses values from the blockette 1001, I assume that the timestamp
    of the time series will also use the microsecond resolution (usec99) of
    the blockette 1001 if it is available.
    
    The number of gaps and overlaps will vary depending on the precision
    of the timestamp. Do users really care if there is dithering in the
    1 microsecond timestamp that may occur when maintaining a phase lock loop?
    Even in a closely temperature controlled environment I see the microseconds
    field changing by +- 2 microseconds during a 24 hour period.
    
    Data streams that have timestamps of 1 microsecond resolution
    will certainly have a larger number of reported gaps and overlaps
    than streams that use the standard 0.0001 MiniSEED clock resolution.
    
    However, if we do NOT use an epsilon of 0, we need to standardize
    on what value to use, and the value may be sample-rate dependent.
    
    4. Many of the metrics are defined as "the number of records"
    that match some criteria. Is this a meaningful value, since the
    record size of data streams are not identical. Common MiniSEED record
    sizes are 512, 1024, 2048, and 4096 bytes, and even with a
    given record size, the number of samples may vary by a factor of
    6 or more depending on the type of compression. Would we do better
    to compute these metrics based on the number of seconds of data
    or the number of data samples rather than the number of MiniSEED records?
    
    - Doug N
    
    On 11/24/2015 11:16 AM, Dan Auerbach wrote:
    
    Not sure I follow all of the potential subtleties, but I support the
    idea that the Quality Code of a time-series for which metrics have
    been computed should be know. While an end-user/consumer may not care
    (much) about the Quality Code if they are using the metrics to
    evaluate data quality, for Data Centers that perform QC, and produce
    Q data, it is certainly helpful to know whether metrics have been
    computed using Q versus R.
    
    Best,
    Dan
    
    Dan Auerbach, Application Developer
    Project IDA Data Coordinating Center, Rm 2120
    Institute of Geophysics & Planetary Physics, MS 0225
    Scripps Institution of Oceanography, UC San Diego
    La Jolla, CA 92093
    858-822-0797
    
    On Nov 24, 2015, at 10:30 AM, Florian Haslinger <florian.haslinger<at>sed.ethz.ch <florian.haslinger<at>sed.ethz.ch>> wrote:
    
    Hi all,
    
    it took me a little while to figure out (I think) what Rick meant:
    the Quality Code (D/R/Q/M) would allow to distinguish between the
    same time-series streams that are quality checked vs those that are
    not. In that line, any stream for which the quality metrics under
    discussion were computed would qualify as ‘checked’ (‘Q’), correct?
    (or would it be ‘M’ - because the data center had added the quality
    metrics?)
    Could this interpretation be extended such that a Quality Code of Q
    (or M, see above) would mean that the FDSN agreed quality metrics
    are available on that data stream (segment)?
    
    One further comment / question to the proposal: In the definition
    of ‘Continuous time series’ (p7 bottom)
    a) there is a typo 'ε is the time tolerance is s'
    should probably read 'ε is the time tolerance in s'
    
    b) the addition of the ‘0’ default tolerance might be impractical?
    Given that we usually qualify digitizers to have a timing accuracy
    of ~10e-4 / 10e-5, this in principle translates into a possible
    deviation of the actual from the nominal (epsilon not zero) - even
    though it may be practically quite impossible to determine
    precisely. But if one takes the ‘default 0’ seriously, almost all
    adjacent time series should have a ‘Gap’ between them?
    
    Further, there seems no way of defining a (non-default) value for epsilon?
    
    kind regards,
    florian
    
    On 24 Nov 2015, at 18:31, Rick Benson <rick<at>iris.washington.edu <rick<at>iris.washington.edu>> wrote:
    
    Hello WG-II
    I have only 1 comment/suggestion, namely adding one additional
    field to properly distinguish “unique” time series records
    in the SEED domain.
    
    A “time series” is defined on the opening page as being:
    
    *_/to belong to a data stream uniquely identified by a SEED
    network code, stations code, channel code and location code. /_*
    
    /*However, I think that the *_*Quality code *_*_should be
    included_**__from field 2 of the FSDH that’s been around since
    April 2004, so that there is NO ambiguity.*/
    */
    /*
    Thank you ,
    Rick
    
    Dear WG-II members,
    I believe this email (early November) never reached you. Please find it again.
    Please find the proposal for the definition of FDSN waveform quality metrics as suggested during
    the Prague meeting. After exploring the systems in place and/or in development at IRIS DMC and
    ORFEUS EIDA there are a number of basic metrics in common. The attached document describes the
    proposed metrics where the green highlighted text refers to, in my opinion, differences in both
    systems and/or definitions that requires agreement.
    I believe the 2 systems are pretty close but some details must be defined slightly better and agreed upon.
    Looking forward for your feedback.
    Cheers,
    Reinoud
    <Proposal definition QC metrics.pdf>
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences athttp://www.fdsn.org/account/profile/
    
    /
    /================/
    /Rick Benson/
    /Director of Data Management/
    /IRIS DMC/
    /(206)547-0393 ext. 119(office)/
    /rick<at>iris.washington.edu <rick<at>iris.washington.edu>/
    
    /
    
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    ----------------------
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    --
    ------------------------------------------------------------------------
    Doug Neuhauser University of California, Berkeley
    doug<at>seismo.berkeley.edu Berkeley Seismological Laboratory
    Office: 510-642-0931 221 McCone Hall # 4760
    Fax: 510-643-5811 Berkeley, CA 94720-4760
    Remote: 530-752-5615 (Wed,Fri)
    
    View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/56550A0A.3020202%40seismo.berkeley.edu.

Thread: FW: data quality metrics

Attachments