-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAerial Image Segmentation with PyTorch About Guided Projects In this Guided Project, you will get assigned a cloud desktop that has all the required software pre-installed. This will allow you to follow along with the instructor to complete the above mentioned tasks. After all, we learn best with active, hands-on learning.
9199 lines (9089 loc) · 459 KB
/
Aerial Image Segmentation with PyTorch About Guided Projects In this Guided Project, you will get assigned a cloud desktop that has all the required software pre-installed. This will allow you to follow along with the instructor to complete the above mentioned tasks. After all, we learn best with active, hands-on learning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Computer Vision course materials
## Module 1: Getting Started with OpenCV in Python
## Module 2: Video IO and GUI
## Module 3: Binary Image Processing
## Module 4: Image Enhancement and Filtering
## Module 5 : Advanced Image Processing and Computational Photography
## Module 6 : Geometric Transforms and Image Features
## Module 7 : Image Segmentation and Recognition
## Module 8 : Video Analysis
A fully functioning object detection system
that runs on an embedded system.
Computer vision, or the idea of having computers
interpret images and videos
has been around since the 1960s.
A lot of advancements in technology
are possible because of computer vision.
One of the most notable applications
in recent years is self-driving cars,
which rely on a variety of
sensors to survey their surroundings.
A few of these sensors often involve cameras and require
complex vision models in order to detect
various objects and road signs
to assist the driving algorithms.
Without such models, the car would not be able to figure
out what's a pedestrian or another vehicle to avoid.
Computer vision has plenty of uses in manufacturing too.
You could use anomaly detection
on images to look for things
like rust or mechanical parts not moving correctly.
Many robots rely on cameras and
vision algorithms to figure out where to place an object,
turn a screw, or weld two pieces of metal together.
For example, this engineer built
a robotic arm in his garage that
automatically looks for the charging port on
his Tesla and attaches the charging plug.
He does this with a Raspberry Pi 4.
Vision systems can also be
used to identify different types
of components that need to be sorted
or picked for an assembly process.
This is something we will explore in this course.
Computer vision can be used for analyzing
satellite images to look for things
like wildfires and deforestation.
My Wyze camera here has
a built-in person detection model that can
alert me whenever it sees a person in the frame.
This has a lot of
potential security applications where you
may not want to have someone
say watching a screen all the time.
As machine learning and
computer vision technology gets
better and more efficient,
we can start running these algorithms on
embedded systems that allows us to create smart sensors
that are capable of making
decisions without needing to stream
raw video data out to
a more powerful computer all the time.
In this course, we'll start by going
over what makes up a digital image
and how we can use that information
as input to a neural network.
We'll also give a brief overview
of how neural networks operate.
But I highly recommend taking my introduction to
embedded Machine Learning course
first if you have not done so already,
to get more information about
general machine learning on embedded systems.
We'll start by going over image classification,
where we'll construct a simple neural network that
attempts to predict the main subject of the image.
Note that it won't be able to identify
multiple objects in that image
or tell us where they're located.
We'll use Edge Impulse to
create and train the model and then I'll
show you how to deploy it to
the OpenMV camera as well as a Raspberry Pi.
Note that in this course,
I recommend having some experience
with Python as I'd like to
have you do some work in
Google Colab to examine and manipulate images,
curate your datasets, and analyze your models.
My goal is to give you enough examples and reference
material so that you can
successfully complete any of the projects.
But knowing some Python
ahead of time will definitely help.
For the initial release of this course,
I plan to show the Raspberry Pi 4
and OpenMV camera H7 plus.
You can get most of the examples from
this course working on either of these boards,
note that deploying a model to
the Raspberry Pi requires writing code in Python,
and deploying a model to
the OpenMV requires writing micro Python.
The open NV-IDE with micro Python also supports
the Arduino Portenta which
should work for most of this course.
However, note that the camera on
the vision shield only does greyscale,
which will limit the capabilities of some vision models.
Once we have a trained model,
I'll show you how to write some micro Python code for
the open-end V or Python code for
the Raspberry Pi to use the model.
You're welcome to try using other boards for this course,
but I likely won't be able to help you
troubleshoot any issues you might run across.
I recommend using the discussion forum
in this course to ask me
and fellow students about the content
and projects found in this course.
If you run into technical issues with Edge Impulse,
I highly recommend posting something to
their forums at forums.edgeimpulse.com.
If you ask in the discussion forum for this course,
there's a good chance I'm just going to copy
your message to the Edge Impulse forum anyway.
You will likely get a faster answer from
the Edge Impulse team if you post
there for technical help using their tool.
It should be possible to write
C++ programs to perform inference.
However, I found it a lot easier just to stick
with Python so we can focus on the concepts.
That being said, if I happen to
find some examples or create some that are
in C++ that run on something like
the Arduino Nano 33 or the Arduino Portenta,
I'll make sure to include them
in the recommended reading sections.
From there, we'll dive into
convolutional neural networks to see how they work and
why they make for better image classification models
than regular dense neural networks.
Finally, we'll look at
several popular object detection models that can be
used to locate and
identify more than one object in an image.
I'll show you how to train one such model to identify
objects of your choice and then
deploy it to an embedded system.
Note that at this time,
object detection models are still quite slow,
even on something like a Raspberry Pi 4.
At the initial release of this course,
the object detection model only runs
on single-board computers like the Pi.
Once there is support for them on microcontrollers,
I will update the course to hopefully show it
working on something like the OpenMV camera.
My goal is to give you
enough tools and knowledge so you can get started
creating your own embedded vision systems
with Machine Learning. Let's get started.
While I plan to cover most of
the topics in this course myself,
I've invited some guests to talk about
their active areas of research
and to showcase some projects.
Computer vision and machine learning
are very popular topics right now.
I think it'll be worthwhile to
see what some other people are working on.
Mat Kelcey is
an applied at machine learning research engineer
at Edge Impulse.
He has previously worked at Amazon and Google,
and he has advised a number of
startup companies when it comes
to implementing machine learning.
His interests, include
deep reinforcement learning for robotics,
information extraction and search ranking.
We'll hear from him about how features from models can be
reused to create self supervised learning systems.
We'll also hear from Dmitry Maslov,
who is a computer vision engineer
with a background in machine learning and
robotics and who works for
the Seeed Studio single-board computer department.
He also runs the Hardware.ai YouTube channel,
that talks about various applications
of artificial intelligence in robots.
He'll give us a demonstration of
his latest project that uses
multi-stage inference to detect cars in a video,
and then identify the type of car.
This is me. I run
my own freelance and consulting business where I
make courses like this and help
companies create technical content.
I used to work at SparkFun Electronics
first as an engineer designing products,
and then as a content creator,
making videos and writing blogs.
I'm currently enjoying working with
embedded systems and machine learning to
teach these concepts as well
as make fun projects like this.
This is an open envy camera running
a machine learning model that looks
for a particular Lego brick.
It's a bit slow,
but the idea is to have something that would
save you time searching through such a pile.
It uses a convolutional neural network,
which is something we will cover in this course.
I hope that this project,
along with the projects of our guest instructors,
will inspire you to use
embedded computer vision in your next project.
Please note that a large amount of foundational math goes into training neural networks and using them for inference. This course does not assume a background in such knowledge, and as such, it is meant as a course in applying machine learning (using various tools and libraries) in embedded systems without needing to understand the finer details of neural networks (and other machine learning algorithms).
In this course, we will focus on applying machine learning tools, techniques, and models to computer vision problems.
We first review several concepts around neural networks, including training, evaluation, and deployment. We then dive into how convolutional neural networks operate in order to classify digital images. Finally, we cover object detection systems. You will have the opportunity to train, test, and deploy your own deep learning models to a microcontroller and/or single board computer to perform live image classification and object detection.
Syllabus
Here is a broad outline of the topics that will be covered in the course:
• What is computer vision (CV)?
• How can machine learning (ML) be used to accomplish CV tasks
• Ethics and limitations of CV
• How digital images are created and stored
• How digital images can be manipulated and transformed in code
• Using embedded ML to solve CV problems
• Data collection and curation
• Using the Edge Impulse tool to create and train an embedded ML model
• Convolution as a way to filter digital images
• Pooling as a way to downsample digital images
• Using convolution, pooling, and dense neural networks to create a convolutional neural network (CNN)
• How CNNs can be used to classify digital images
• Training a CNN
• Deploying a CNN to an embedded system (microcontroller and/or single board computer)
• Performing continuous image classification using a CNN
• Data augmentation to increase the accuracy of an image classification model
• Transfer learning
• Object detection
• Evaluating an object detection model
• Image segmentation
Required Hardware
You are welcome to take this course without attempting the projects, as they are not graded. However, I highly recommend doing the projects (or at least running the provided solutions on your embedded system(s)) to get the most out of the course. In my experience, challenging yourself with hands-on projects is where the real learning occurs.
Here are your options for hardware (you only need to choose one of these options):
• None: you can use your computer and smartphone to capture images to complete some of the projects. You will not be able to complete any projects that require deploying machine learning models to an embedded system.
• OpenMV Camera: the best option for using a microcontroller for this course. It runs MicroPython, which makes completing the projects easier (as the syntax is the same as Python). I recommend the OpenMV H7 Plus model, but the OpenMV H7 should work for most projects (you will likely need a micro SD card). Important: at this time, the OpenMV Camera does NOT run object detection models, so you will not be able to complete the final project in the course.
• Raspberry Pi 4 with Pi Camera: the Raspberry Pi 4 is a single board computer that will work for all projects in the course. It supports full Python. Some webcams may work for the projects, but due to the variety of such cameras, I will not be able to help troubleshoot issues with them. As a result, I recommend using the official Pi Camera Module v2, and project solutions are written for the Pi Camera. Note that you will need a micro SD card, USB-C power cable and likely a keyboard, mouse, and monitor to use the Raspberry Pi.
Note that it might be possible to accomplish some or all projects in the course using hardware not listed above. However, I will likely not be able to help you troubleshoot issues if you use other hardware.
I chose to use primarily Python and MicroPython for the course so that we can focus on the concepts of computer vision and machine learning using a single language. Translating a Python (or MicroPython) program to C/C++ is possible, but it usually requires effort outside the scope of this course. You are welcome to try implementing some of the projects in C/C++, but I doubt I will be able to assist with any issues you run into.
If I am able to get the projects in the course to run on other boards (such as the Arduino Portenta, Arduino Nano 33 BLE Sense, ESP32 Cam, etc.), I will list them here and update the project descriptions.
I recommend searching on the following sites for the recommended hardware:
Global
• Seeed Studio
• Digi-Key Electronics
• Mouser Electronics
Australia
• Pakronics
India
• Fab.to.Lab
United Kingdom (UK)
• Cool Components LTD
United States (US)
• Adafruit
• SparkFun Electronics
• If you have any questions regarding the material and quizzes or you run into technical problems with the projects, I recommend searching in the Discussion Forums first to see if other students had the same question. If you do not find a satisfactory answer, please create a new post. I will try to answer within a few days.
• I also encourage you to help other students if you see an unanswered question in the forums and you know the answer!
• If you run into technical issues with the Edge Impulse tool, I recommend posting your question or issue to the Edge Impulse forum. There is is a good chance that I will not be able to replicate your exact issue, as I do not have administrative access to the Edge Impulse tool (i.e. I cannot see your project). Additionally, I will likely copy-and-paste your question to that forum anyway, and the Edge Impulse staff is much faster (and more experienced) than I am at assisting people with such problems.
• Computer vision is the science and engineering of
• teaching computers to assign meaning to images and video.
• The idea of capturing
• an image has been around for a long time,
• and digital cameras have been around since the 1970s.
• To capture an image,
• we need some sensor.
• Most modern digital cameras have
• a complimentary metal oxide semiconductor
• or CMOS image sensor.
• You'll sometimes find charge-coupled device sensors,
• but these are usually found in older digital cameras.
• Light entering the camera can be bent or
• refracted through lenses and
• possibly bounced off a mirror,
• as in the case of my DSLR camera.
• Either way, as light strikes the sensor,
• tiny sections of the sensor respond to
• the amount and color of that portion of light.
• Each of these tiny portions, known as pixels,
• generate an electrical signal
• proportional to the amount of color and light hitting it.
• A computer or microcontroller reads
• these electrical signals and stores them
• as numerical values in an array.
• Often, you'll find three arrays: one for red,
• one for green, and one for blue.
• These arrays which represent the full-colored photo
• that we just took are saved to some non-volatile memory,
• such as an SD card.
• Sometimes you'll find that these arrays are
• compressed in a way that saves storage space,
• even if it means losing some of the information in them.
• For example, JPEG images are compressed this way.
• We can then plug the SD card or maybe it's
• our phone into our computer and view the image.
• The computer knows how to read those stored arrays of
• numbers and convert their individual values
• into colors on our screen.
• This allows us to view the image we captured.
• The more pixels in the image,
• the greater the detail we can make out.
• This is not the only way to
• construct a digital image however.
• A variety of sensors can be used to create
• a digital representation of the world around us.
• An image sensor in a camera works with visible light,
• so it's similar to how we might see with our eyes.
• However, we can use
• infrared sensors to create an array of infrared values,
• which is great for seeing things at night or
• looking at the relative temperature of objects.
• We can also use things like radar to get an idea
• of how far away things are or map out terrain,
• or maybe we use something like
• ultrasound to get a cross-section inside our bodies.
• In all of these cases,
• we are producing digital images.
• While this hopefully gives you an idea of
• how digital images are captured and stored,
• simply recording something isn't computer vision.
• In all these cases,
• it requires a human to interpret the digital images.
• Computer vision is when we have a computer automatically
• interpret and assign meaning
• to images or parts of an image.
• For example, we can use
• computer vision to locate the trees in this photo.
• Or maybe we automatically identify
• potential energy leaks in
• a home from a captured infrared image.
• Computer vision might be able to identify
• dangerous lava flow routes from a terrain map.
• Health care workers could rely on computer vision to
• automatically identify
• potential issues when taking x-rays,
• ultrasounds, or CAT scans.
• I'm not a doctor or ultrasound tech,
• so I don't actually know what I'm
• looking at in this particular image,
• but you get the idea.
• Most people credit Larry Roberts as
• being the founder of the field of computer vision.
• His 1963 PhD thesis,
• machine perception of three-dimensional solids,
• proposes methods for extracting information about
• 3D objects from a simple two-dimensional image or photo.
• From here, a whole field of study was born that attempts
• to automate the process of
• extracting meaning from images.
• Throughout the 1970s,
• the British neuroscientist, David Marr,
• published a number of papers that describe
• how images captured by two eyes can
• be constructed into three-dimensional representations
• of scenes in the brains of living creatures.
• From there, researchers have worked
• to automate this process in computers.
• For example, we can use two cameras mounted
• a fixed distance from each other
• to take photos of the same scene.
• These photos will be ever so slightly different
• from each other thanks to how the cameras are separated,
• much like human eyes.
• Here is an example taken from
• this ArduCam stereo HAT for Raspberry Pi.
• The two grayscale images are from each of the cameras.
• With some math, it produces the image on the left.
• Greens and blues are objects that are farther
• away and the orange and red blobs
• are images that are closer to the cameras.
• This is known as a depth map and
• it helps us figure out where objects are in
• relation to the cameras without relying on
• distance sensors like ultrasound or LiDAR.
• The process of extracting
• three-dimensional information using a pair of
• cameras set at a fixed distance from
• each other is known as stereoscopic vision.
• Another common objective in
• computer vision is to find
• the boundaries between objects.
• This is often accomplished
• using edge detection algorithms,
• which filter an image and
• output one or more images such as these.
• You can see how only the edges of
• the objects in the photo are shown,
• much like someone drawing a sketch of the scene.
• You can choose to pick up more or less detail in
• the edges depending on
• the particular algorithm and parameters used.
• Image segmentation is another popular area
• of study in computer vision.
• Various algorithms exist to help divide a picture into
• various parts or objects to
• assist in providing meaning to that image.
• The goal of most image
• segmentation algorithms is to assign
• a value to each pixel
• and group associated pixels together.
• These groupings can be colored and redrawn as shown in
• the right image to help detect or
• classify objects in that image.
• While all of these are great examples of computer vision,
• we haven't yet seen how
• machine learning fits into the picture.
• As I just showed,
• computer vision is not
• the same thing as machine learning.
• However, machine learning can be
• a very useful tool for computer vision,
• and computer vision can be
• a very useful tool for machine learning.
• Both fields are usually considered
• to be a part of artificial intelligence.
• They are different from each other,
• but there is some overlap.
• In this course, we will focus on
• using machine learning to accomplish
• computer vision goals but there are plenty of
• things in computer vision that we will not cover.
• Specifically, we will go over
• image classification and
• object detection using neural networks.
• Image classification is the process of
• attempting to comprehend an entire image.
• For example, we might train a classifier to recognize
• the first image as that of a dog
• and the second image as that of a cat.
• It would not be able to tell you
• where in the image each animal was found,
• just that the image contained that animal.
• However, it would likely
• struggle with an image like this,
• which contains instances of both animals.
• It would make a guess based on
• prominent features in the image
• and depend on where the model
• looks in the image for those features.
• Object detection is
• a harder problem than image classification,
• but it allows us to identify things
• in a picture and where they are located.
• It also allows us to
• identify more than one object per image,
• which is a big limitation of image classification.
• OpenMV comes with a person detection example.
• Here, if the camera thinks
• there is a person in the frame,
• it will update the label in
• the output image to show that.
• This could be useful for determining
• if someone is at your front door or
• maybe monitoring a room to
• automatically control the lights and air conditioning.
• Now, let's say you've been
• tasked with designing a new smart lighting,
• heating, ventilation,
• and air conditioning system for an office building.
• Rather than old passive infrared sensors,
• you've decided to deploy person detection cameras,
• which you found to be much more reliable at
• determining when someone is actually in the room.
• However, you need 30 of them
• to cover all the office spaces.
• You could stream all this video data to
• a central server on the network or across the Internet.
• This server would be in charge of doing
• the vision processing to determine if
• a person was in each frame.
• Let's calculate what kind of
• bandwidth you might require to do that.
• We'll assume each camera needs a modest 240 by
• 240 pixel resolution to
• correctly identify people in a frame.
• We don't need color, so each frame is
• a grayscale image where each pixel is an eight-bit value.
• We'll need 30 cameras,
• and we'll say that each camera really only
• needs to take a photo once every second.
• This isn't a live video stream,
• we just want to know if someone is in the room every
• second to make changes to
• the lights and air conditioning systems.
• Under these conditions, we'd need around 13.8 megabits
• per second of network capacity
• devoted entirely to this new sensor system.
• Of course, there are ways to
• compress the images to reduce this sum,
• but you get the idea.
• While modern Wi-Fi can support
• at least 10 times that amount,
• it's still seems like a big waste.
• Alternatively, we could move
• that classification problem to the cameras themselves.
• These smart cameras could be just like
• the OpenMV camera demo I showed you a moment ago.
• Each camera would perform
• whatever inference was necessary to determine
• if a person was in the frame
• and just send that result to the server.
• Now, we essentially need one bit for that value.
• Was a person in the frame or not?
• Thirty bits per second is a lot less
• than 13.8 megabits per second.
• I'm making some assumptions about
• minimum packet length and message headers,
• but you get the idea.
• As you can see, using micro-controllers or
• low-power computers can save
• bandwidth and processing power on remote servers.
• This form of embedded computer vision
• offers an alternative to streaming
• raw or even compressed data to
• a centralized location for processing.
• Self-driving cars rely on a variety of sensors
• to help them navigate the roads and avoid collisions.
• Most used cameras with a combination of
• radar or distance sensors to help
• create a clear picture of objects in the distance.
• The car needs to use computer vision and
• likely some machine learning
• to figure out what's around it.
• For example, it needs to be able to
• read road signs and signals,
• watch for other cars and avoid pedestrians.
• Object detection can help
• the car see these things and take
• appropriate actions like turning on
• a light or stopping for pedestrians.
• You can't necessarily guarantee
• an Internet connection in cars,
• so a lot of this has to be
• computed in the car's computer.
• While a car can transport a powerful computer,
• it's still somewhat of an embedded system.
• I hope this helps illustrate the need for using
• embedded machine learning to tackle
• some computer vision problems.
• Alex Fred-Ojala talked about the ethics of
• data acquisition and machine learning
• algorithms in the introductory course.
• Let's revisit that topic
• and see how it applies to computer vision.
• Alex talked about the three pillars that help
• create trust in an artificial intelligence system.
• The system should follow laws,
• be robust against any sort of attack,
• and guarantee high reliability.
• They should also guarantee fairness by
• not promoting any sort of discrimination,
• biases or social injustices.
• For example, here is a tweet that went viral in 2017
• showing how a soap dispenser
• struggles to work with a dark-skinned individual.
• Bias in a soap dispenser is pretty benign and that
• soap dispenser likely wasn't
• using machine learning anyway.
• However, you can see how this might be a problem for
• critical computer vision systems like self-driving cars.
• If you were designing a system to
• work with and for people,
• make sure you take everyone into
• account when training and testing the model,
• not just people who look like you.
• There's also the notion of privacy.
• Are you creating a new type of
• smart security camera that says,
• records every face that walks by a corner.
• Let's say you then attempt to identify each person by
• matching their face to
• available photos on their social media account.
• Even if this is legal,
• you have to consider the privacy implications
• of this type of project.
• Do these people consent to having their faces and
• possibly names recorded whenever they walk by the corner?
• For more information on ethical and trustworthy AI,
• I highly recommend checking out
• the European Union's AI Alliance page.
• They offer some good guidance on
• the various factors that make up an ethical AI system.
• Licenses.ai has some good templates
• for creating end-user license agreements.
• These cover various ethical concerns
• that someone creating
• an AI system might have and hopefully prevent its misuse.
• Edge Impulse uses a similar responsible AI license
• that outlines how you may or may not use their tool.
• I definitely recommend reading
• through this license before getting started.
• I hope this has helped give you
• an understanding of how embedded machine learning
• can fit into computer vision and how you
• can use it to create responsible AI systems.
• Now I think it's time we dive into some technical stuff.
• Before we go over using images with machine learning,
• I'd like to cover some concepts about how
• digital images are made and stored on your computer.
• Some of you may be familiar with these concepts already,
• but I find that it provides
• a useful vocabulary when working with images.
• Let's start with a simple grayscale photo.
• Then let's zoom way
• in on a portion of this elephant's ear.
• Digital photos are made up of a grid of
• simple building blocks known as
• picture elements or pixels for short.
• This grid of pixels can be expressed
• as a simple two-dimensional array of values.
• Let's take an even smaller subset
• of these pixels to examine.
• One way to express these pixel values is by
• using a number between zero and one,
• where zero is black and one is white.
• This could be interpreted as the amount of
• light being given off or reflected by each pixel.
• White is 100 percent or one.
• However, storing and doing math with
• floating point numbers like this
• is often difficult for computers,
• especially low power devices like microcontrollers.
• So one way to handle that is to quantize
• these values to some integer values
• that fit nicely into bytes.
• For example, we can quantize
• those 0-1 percentage values to one byte or eight bits.
• Now, zero is black and 255 is white.
• However, this means that
• only 256 shades of
• gray can be represented by these values.
• This is known as bit depth or colored depth.
• Each pixel or element of this array is
• an eight-bit number that
• describes the shade of gray to be displayed.
• Higher bit depth for these grayscale images
• means that more shades of gray can be displayed.
• Remember that we were only looking at
• a small piece of the whole image.
• The original image contains 2,290
• pixel columns and 1,487 pixel rows.
• In other words, the image is 2,250
• pixels wide and 1,487 pixels high.
• These dimensions are known as
• the resolution of the image.
• You will almost always see
• resolution expressed as width by height.
• You can find how much space this raw photo
• would take up by multiplying the width by the height,
• by the number of bytes per pixel.
• If stored raw, this photo would
• need around 3.4 megabytes.
• However, most image formats need
• some bytes reserved for header information,
• so this might be higher.
• Additionally, many image formats like
• JPEG use one or more algorithms to compress the image,
• resulting in a smaller file size.
• Lossy compression like JPEG
• can lose some information in the data,
• resulting in a slightly imperfect picture.
• We won't get into compression in this course,
• but know that it's what allows us to store
• digital images in smaller files,
• than what we calculated here.
• Let's examine our five by
• four segment of the elephant's ear using Python.
• We'll use NumPy to store our number arrays.
• NumPy is incredibly popular
• in the machine learning community as
• it's free and offers
• efficient ways to perform matrix operations.
• We'll also need pyplot from
• the matplotlib library to view our array as an image.
• PIL, short for Python imaging library,
• is a common Python libraries use
• to read and write to various image files.
• All three of these packages
• should come pre-installed in Colab.
• Remember that you can press "Shift
• Enter" to run a cell in Colab.
• Next, I'll upload the image.
• I'll open the image in an editing program.
• I cropped out the tiny section
• from the elephant photo that we looked at earlier.
• I saved this test image in bitmap or
• BMP format with a bit depth of eight bits.
• The bitmap format is not compressed,
• so it's useful for storing and working with raw images.
• If I zoom in, you can see that it really
• is just a five by four grayscale image.
• In Colab, I can click on the
• "File browser" and click the "Upload" button.
• I then find the five by four
• bitmap image and click "Open".
• If we go up one folder,
• you can see that we're working with
• a Linux instance on a remote computer.
• Colab is limited to
• essentially python and a few system calls,
• but it's very helpful for working with
• things like TensorFlow for machine learning.
• Our files will be stored in the content directory.
• I save the path to
• the uploaded file in this image path variable.
• Note that you can also right-click on
• the file and select "Copy path".
• Next, I use PIL to open the image.
• PIL attempts to automatically
• scale the values in the pixels.
• We need to call the convert function with
• the L parameter to keep them
• in the eight-bit gray-scale format.
• This image object has
• some extraneous information as
• it's unique to the PIL library.
• However, we can call the NumPy array as
• array function to convert it to a NumPy array.
• All NumPy arrays have a dot shape
• attribute that we can
• print to see the shape of the array,
• even though image resolution is given as width by height,
• two-dimensional NumPy array shapes
• are given as number of rows first,
• followed by number of columns.
• This means arrays are given as height by width.
• If I talk about image resolution,
• it will be width by height.
• If I talk about two-dimensional NumPy arrays,
• it will be height first, then width.
• When we print the array,
• you can see the pixel values that we saw earlier.
• These go between zero and 255.
• We can also normalize
• the array by dividing all the values by
• 255 to convert the pixels to that 0-1 scale.
• You normally don't want to store
• an image in this floating point format,
• but this will be helpful later when
• working with some neural network inputs.
• Finally, we can use pyplot to draw the image for us.
• Note that we want to draw the eight bit grayscale image,
• and we need to tell imshow to
• use the grayscale map for drawing and that it should
• expect a minimum value of zero and
• a maximum value of 255 for each pixel.
• For the first couple modules in this course,
• we will stick to grayscale images.
• In many computer vision applications,
• you will find that color is probably not necessary.
• However, in some cases,
• it is necessary as it can
• convey extra information about the image.
• Let's zoom in on a section of this sea turtle.
• Here, we have another five by four section of pixels,
• but they're in color this time.
• Instead of a bit depth of eight bits,
• each pixel now contains 24 bits of information.
• Eight bits describe the amount of red in the pixel,
• eight bits are for green,
• and eight bits are for blue.
• Now, each pixel has three bytes needed to describe it.
• You can see that the bluish pixels have
• more of the blue channel than the others,
• and the reddish ones have
• more of the red channel present.
• As with the grayscale images,
• we can use more bits to
• describe colors than what we're showing here,
• but you'll often run into
• three bytes per pixel for many color images.
• Sometimes, you'll see an Alpha channel present.
• This determines the transparency of each pixel,
• and is common in image formats like PNG.
• We won't need to worry about
• the Alpha channel for this course.
• The red, green, blue,
• or RGB color model
• uses additive light to describe colors.
• The higher the value of one of those color channels,
• the more light is emitted in that color.
• We can combine the three different colors
• to produce any other color.
• When all three are at their max,
• they combine to create white.
• Computers use this model to
• interpret RGB images and then light up
• pixels on our monitors to
• display images in a variety of colors.
• I created a colab script to load a color image,
• just like we did for the grayscale image.
• However, I'm using that five-by-four pixel sample
• from the edge of the turtle shell.
• I use PIL to open the image,
• but I need to convert it to RGB format this time,
• then I convert it to a NumPy array.
• The first three elements are the red,
• green, and blue values of the first pixel.
• The next three elements belong to
• the second pixel in the first row.
• This group describes the five pixels in the first row.
• This continues to the last row of pixels.
• Here, you can see that the final pixel
• has more red and less than green and blue.
• We'll verify that in a minute.
• Now, let's draw the channels separately.
• You could extract each plot,
• but Matplotlib has a habit of
• coloring grayscale images in an odd manner.
• We're going to create three copies of the original array.
• In the first, we'll set all of
• the green and blue values to zero.
• In the second, we set all of
• the red and blue values to zero.
• Then in the third, we set red and green to zero.
• Notice that I can index into the arrays as follows.
• A colon means give me everything from that axis.
• Colon, colon zero is
• a two-dimensional array containing
• all the values in the red channel.
• Colon colon one would be
• all the values in the green channel.
• Finally, we print the channels separately.
• You can see how there's more red in
• the bottom right and more blue in the top left.
• There's a bright stripe of green going
• diagonally from the bottom left to the top right.
• Now, let's print all of these channels together.
• I hope you can see how
• those channels combined to form this image.
• There's more blue in the top left,
• mostly green in the middle diagonal,
• and a lot of red in the bottom right.
• I find it easiest to think about color images as
• a collection of three different two-dimensional arrays.
• When those arrays get combined,
• the computer is capable of producing
• nearly any color in the visible spectrum.
• As with grayscale images,
• there are lots of ways to compress them,
• but we won't get into that.
• Having the extra information in
• color channels can be useful but I recommend
• seeing if grayscale will meet your needs first as
• it uses less data and less computing power.
• I hope this helps you get an idea of
• how images are stored on your computer.
In order to create an image classifier,
we first need to collect some data.
There are plenty of pre-made datasets
out there that include thousands of images,
but I encourage you to try collecting your own.
I will show you how to do this using the OpenMV camera,
as well as a smartphone,
but you are welcome to collect
digital images in any way you see fit.
The goal is the same.
You'll want to collect around 50 images of
the same object for each class you want to identify.
I recommend starting with
three or four classes so you can
see how to work with multiple classes.
You can choose to identify anything you want.
Clothing, fruit, animals, and so on.
There should be a large difference between the shapes of
the objects as our model will be fairly simple.
For example, the model might have
trouble classifying breeds of dogs,
but it has a good chance of working if it's trying to
pick between dog and cat classes.
Each photo needs to be scaled and
cropped to 96 by 96 pixels.
They can be colored or grayscale,
but we will ultimately convert everything to
grayscale to make the model
smaller and easier to understand.
We will also resize or scale these images to make
them smaller before feeding them to our neural network.
Additionally, you will want them in bitmap or PNG format,
as those are commonly used uncompressed formats.
The object you're trying to identify
should be mostly centered in the image,
and take up a large portion of the frame.
Multiple photos should have
the same object in a similar position,
with similar lighting,
and the same background every time.
You'll also want to keep the camera at
about the same distance from the subject each time.
The background should be the same among all your classes.
If you want to train a model to identify something
in a variety of situations, lighting conditions,
and positions, you're going
to need a lot more than 50 images,
probably on the order of a few thousand.
You'll also likely need
a more complex model which we'll explore later.
But for now, try to keep everything about the same.
I collected photos of
a few different electronic components: a resistor,
a capacitor, a diode, and an LED.
I have 50 photos of each
stored in a folder named after the class.
Note that I only used one component for each.
I didn't try to use different sizes,
shapes, or colors of LEDs, for example.
Also, I highly recommend
collecting some photos of just the background.
This will be its own class.
Many times, you'll find that you want to
identify when something is in the frame or not,
such as detecting a person in a room.
You'll want photos of the empty room,
or of the white background with
no electronic components, in my case.
You are welcome to use my dataset
if you do not want to collect your own.
Head
to github.com/shawnhymel/computer-vision-
with-embedded-machine-learning.
Click on the Datasets folder,
and download the electronic-components ZIP file.
BMP files are good for examining raw data,
but you'll ultimately need
the PNG files for uploading to Edge Impulse.
Unzip it somewhere on your computer.
Feel free to look through the folders.
Each folder has 50 color images in them.
I kept them in color in case you
wanted to try working with color images,
but we'll be converting them to grayscale in
a future project to train the actual classifier.
Note that the images are fairly
similar with little variation.
I tried to keep the component body
close to the center of the image.
The leads point either left or right,
the capture photos with the OpenMV or
Portenta, head to openmv.io.
Go to downloads, and download the latest OpenMV IDE.
Run the installer, accepting all the defaults.
If you're working with the OpenMV H7 basic model,
you'll want to use a microSD card as there's
not enough internal storage to store images.
Make sure it has been formatted
with the FAT32 file system.
Plug the SD card into the OpenMV camera,
and plug the board into your computer with a USB cable.
If you're finding that your photos are not in focus,
you can adjust the focus of the lens by
unscrewing the set screw and twisting the top.
This might take some experimentation
to get the images to look great.
In the same GitHub repo,
go to the Data Collection folder, OpenMV,
and view the raw code for
ImageCapture.py. Copy this code.
Paste the code into a new file in the OpenMV IDE.
Feel free to look through this code,
and see how we capture and store images to the SD card.
Note that we initialize the camera with a
320 by 240 QVGA resolution,
but we crop it to 96 by 96,
which is what ultimately gets displayed and stored.
Whenever we run this program,
it will show what the camera sees in the upper right.
Count down from three,
then snap a photo,
which it saves to the internal storage or SD card.
Let's run it to collect a couple of samples.
Click the "Connect" button.
If asked, agree to update the firmware on your board.
Click the "Serial Terminal" button to
open a console connected to the board.
Click the "Run" button.
It will take a moment to initialize,
so use that time to frame your object.
You should see it count down from three.
When it reaches one,
it should flash black for a moment to
let you know that the photo is being saved.
It should print the name of