#### Assumptions about the data

- All data are scaled the same way...they all overlap, and no normalization and/or rotations are required
- All data have the same x values

The specific use case was categorizing results from repeated tests on the same family of hardware. Each test run is exactly the same parameters, but the results vary from noise, operator mistakes, etc.

#### Basic algorithm

- Make a training set of curves
- Classify each one in the set
- Get the average value of the good curves at each x value
- Take curve 1 from your training set
- At each x, get the distance from curve 1 to the average of the good curves
- Repeat for all curves, and store those distances by x value
- Take curve 1 from your test set
- At each x, get distance the from curve 1 to the average of the good curves
- Compare this with the distance at that x value from each curve in the training set
- Sum the errors across all x values for each curve in the training set
- The classification of curves that have the smallest error here are the classification of the test curve if the distance is under some threshold

This is very similar to just comparing a test curve with each curve in the training set and finding the one with the smallest error. The only difference is that by comparing errors off of average, you get the same result for +1 noise and -1 noise.

This also allows for finer tuning of the algorithm. An obvious thing that you might want to do is say 'all points within x distance from the good average are ok', so you can set an offset above and below the good average and take distance from there.

This also allows for finer tuning of the algorithm. An obvious thing that you might want to do is say 'all points within x distance from the good average are ok', so you can set an offset above and below the good average and take distance from there.

#### Simple example

To show this in action, I made a training set of 100 roughly sinusoidal curves. You can play with the code here:

https://colab.research.google.com/drive/1WDbzjyYgVDNcxPj_UEPliD6G4NVyO4pX

It has 25 curves per type and 4 different types:

https://colab.research.google.com/drive/1WDbzjyYgVDNcxPj_UEPliD6G4NVyO4pX

It has 25 curves per type and 4 different types:

- good/category 0: sin(x/10) + gaussian noise
- bad/category 1: sin(x/10) + sin(x/8) + gaussian noise
- bad/category 2: round(sin(x/10) + gaussian noise
- bad category 3: 0.25*sin(x/10) + gaussian noise

A sample training data set looks like:

From there, I just followed the algorithm above. To determine a match, I used the closest 12 training curves. If more than 6 of the matches are of the same category, that's the apparent category of the test curve. The % of matches from that category is the confidence.

#### Speeding it up

This is a slow algorithm. For L training curves, M test curves, and N points per curve, the algorithm is O(LMN). That blows up quickly. There are many ways to make this faster. The ones most likely to apply for the specific type of data I was working with are:

- Exclude some ranges. For example, the signal that matters might be from 5,000 < x < 15,000, but data were collected from 0 to 50,000. Restricting the algorithm to 5,000 < x < 15,000 would reduce N and speed things up.
- Downsample the data. Simply take every 2nd, 5th, 10th, or whatever point from each curve to compare.
- Reduce the size of the training set. Do 20 curves per category for example.
- Stop early. If you know that a 'good' curve typically has a total error of < 50 and your test curve error is 22, you probably don't need to continue.
- Look for max errors. If the test curve's total error is 50, and two of the 1,000 x values gave an error of 48 combined, quickly scan to see if a category typically fails at those 2 x values. Consider also just testing x values that identify a category. If category 1 always has large errors at x = 5 and x = 37 and nowhere else, there's not much point in comparing the test curve with category 1 training curves at any x values other than 5 and 37.

#### Conclusion

So this was really crude but it worked pretty well for my use case and I hadn't seen it laid out anywhere else. Hopefully this helps someone.

I have a problem I need you to solve. Consider the following fact:

ReplyDelete"By the end of this year [2014], more than half of all industrial emissions of carbon dioxide since the dawn of the Industrial Revolution will have been released since 1988 — the year it became widely known that these emissions are warming the climate.

I recently learned this startling fact from my colleague Richard Heede at the Climate Accountability Institute. Heede drew upon historic estimates of annual global carbon emissions from fossil fuel burning and cement manufacturing by the U.S. Department of Energy’s Carbon Dioxide Information Analysis Center (CDIAC) and the 2014 annual update on the global carbon budget and trends published by the Global Carbon Project (GCP), an international scientific research consortium studying the global carbon cycle.

https://blog.ucsusa.org/peter-frumhoff/global-warming-fact-co2-emissions-since-1988-764"

If you add subsequent CO2 equivalent greenhouse gas emissions since that statement was made, then I imagine the date after which "more than half of all CO2 emissions ever emitted" must move forward, say perhaps to the early 90's. What exact year would it be?

Using the numbers here:

Deletehttps://www.icos-cp.eu/GCP/2018

and adding 2018, I get that half of all emissions have occurred since:

- 1990 including only emissions from fossil fuels and industry

- 1981 including emissions from fossil fuels, industry, and land use

1990 is the updated version of the 'fossil fuel burning and cement manufacturing' one that you referenced.