Simple Algorithm for Classifying Curves ~ Random Problems

I needed to classify curves recently and the tutorials I found for classification algorithms all used single points instead of curves. I settled on something crude but effective for my use case.

Assumptions about the data

All data are scaled the same way...they all overlap, and no normalization and/or rotations are required
All data have the same x values

The specific use case was categorizing results from repeated tests on the same family of hardware. Each test run is exactly the same parameters, but the results vary from noise, operator mistakes, etc.

Basic algorithm

Make a training set of curves
Classify each one in the set
Get the average value of the good curves at each x value
Take curve 1 from your training set
At each x, get the distance from curve 1 to the average of the good curves
Repeat for all curves, and store those distances by x value
Take curve 1 from your test set
At each x, get distance the from curve 1 to the average of the good curves
Compare this with the distance at that x value from each curve in the training set
Sum the errors across all x values for each curve in the training set
The classification of curves that have the smallest error here are the classification of the test curve if the distance is under some threshold

This is very similar to just comparing a test curve with each curve in the training set and finding the one with the smallest error. The only difference is that by comparing errors off of average, you get the same result for +1 noise and -1 noise.

This also allows for finer tuning of the algorithm. An obvious thing that you might want to do is say 'all points within x distance from the good average are ok', so you can set an offset above and below the good average and take distance from there.

Simple example

To show this in action, I made a training set of 100 roughly sinusoidal curves. You can play with the code here:

https://colab.research.google.com/drive/1WDbzjyYgVDNcxPj_UEPliD6G4NVyO4pX

It has 25 curves per type and 4 different types:

good/category 0: sin(x/10) + gaussian noise
bad/category 1: sin(x/10) + sin(x/8) + gaussian noise
bad/category 2: round(sin(x/10) + gaussian noise
bad category 3: 0.25*sin(x/10) + gaussian noise

A sample training data set looks like:

From there, I just followed the algorithm above. To determine a match, I used the closest 12 training curves. If more than 6 of the matches are of the same category, that's the apparent category of the test curve. The % of matches from that category is the confidence.

Speeding it up

This is a slow algorithm. For L training curves, M test curves, and N points per curve, the algorithm is O(LMN). That blows up quickly. There are many ways to make this faster. The ones most likely to apply for the specific type of data I was working with are:

Exclude some ranges. For example, the signal that matters might be from 5,000 < x < 15,000, but data were collected from 0 to 50,000. Restricting the algorithm to 5,000 < x < 15,000 would reduce N and speed things up.
Downsample the data. Simply take every 2nd, 5th, 10th, or whatever point from each curve to compare.
Reduce the size of the training set. Do 20 curves per category for example.
Stop early. If you know that a 'good' curve typically has a total error of < 50 and your test curve error is 22, you probably don't need to continue.
Look for max errors. If the test curve's total error is 50, and two of the 1,000 x values gave an error of 48 combined, quickly scan to see if a category typically fails at those 2 x values. Consider also just testing x values that identify a category. If category 1 always has large errors at x = 5 and x = 37 and nowhere else, there's not much point in comparing the test curve with category 1 training curves at any x values other than 5 and 37.

Conclusion

So this was really crude but it worked pretty well for my use case and I hadn't seen it laid out anywhere else. Hopefully this helps someone.

3 comments:

MikeMarch 17, 2019 at 1:56 PM
I have a problem I need you to solve. Consider the following fact:

"By the end of this year [2014], more than half of all industrial emissions of carbon dioxide since the dawn of the Industrial Revolution will have been released since 1988 — the year it became widely known that these emissions are warming the climate.

I recently learned this startling fact from my colleague Richard Heede at the Climate Accountability Institute. Heede drew upon historic estimates of annual global carbon emissions from fossil fuel burning and cement manufacturing by the U.S. Department of Energy’s Carbon Dioxide Information Analysis Center (CDIAC) and the 2014 annual update on the global carbon budget and trends published by the Global Carbon Project (GCP), an international scientific research consortium studying the global carbon cycle.

https://blog.ucsusa.org/peter-frumhoff/global-warming-fact-co2-emissions-since-1988-764"

If you add subsequent CO2 equivalent greenhouse gas emissions since that statement was made, then I imagine the date after which "more than half of all CO2 emissions ever emitted" must move forward, say perhaps to the early 90's. What exact year would it be?
เกมสล็อตAugust 12, 2021 at 3:28 PM
🔥 IPRO356 📌 ไม่มีที่ไหนกล้า ให้ได้มากเท่านี้อีกแล้ว มีเพียง USER เดียว เล่นจบ ครบทุกอย่าง
🎰สล็อต♥️คาสิโน♠️บาคาร่า
⚽️กีฬา💥ไฮโล🎲หวย🔢

✨สุ่มแจกรางวัล Jackpot ให้กับผู้เล่นทุกชั่วโมง
💸 รางวัลละ 5,000-10,000บาท กว่า100รางวัล

Random Problems

Here are solutions to some random problems

Saturday, March 16, 2019

Simple Algorithm for Classifying Curves

Assumptions about the data

Basic algorithm

Simple example

Speeding it up

Conclusion

3 comments:

Popular Posts

Categories

Search