Motivation, Goals, & Thankyou

Motivation

I initially worked through the first half of the fastai course with my wonderful colleagues at work a year ago or so (sometime during the pandemic beginning, time is non-linear in my mind now). Whilst it was instructive and I learnt a lot discussing and teaching to each other what we learned each week, I made the cardinal mistake of spending far too much time reading and watching rather than doing and trying. It has left me with a more surface understanding of the concepts than I’d like and a lacking practical ability for me to exercise what I know. I am determined to do it differently this time and work very hard to go through this course differently than how I would usually MOOC myself to death.

The main motivation for working through the fastai course is pure enjoyment and love for deep learning and machine learning, I honestly think its the magic of the future and its also re-ignited my love for math which was beaten out of my by demonic high school teachers. Secondly I’d like to professionally move to a role more dedicated to machine learning (deep learning if I can) and putting those models into production to drive excellent products. I love my current role and I owe my mentors I’ve gotten to know over the last ~5 years for my love of software engineering (and the belief that it’s a high craft to hone over a lifetime), my data obsession, my desire (and practice) to continually learn, and my deep enjoyment of collaborating to build great products and services. However I’ve spent many years deep in the bowels of a very large organisation (for Australia, we are a tiny nation full of spiders, not people) building platforms and tools for others to build products with and I think I’m finally at a cross-roads where I need to find a new challenge and domain to apply the skills I’ve got as well as learn from & teach new wonderful people I’m yet to collaborate with. I’m keen to be closer to customers and get a better feedback signal on what I’m building rather than constructing purely internal tooling that only makes sense or is useful in the context of the extremely bespoke organisation it powers.

Goals & Plan of Attack

Goal	Plan of Attack
Enjoy the time & Be Happy	Make sure to do problems and work that I’m excited about, difficult to track but something I want to not forget and make sure to reflect and focus on, hence why I’m declaring it as the first goal here!
Drastically Reduce Procrastination	Use Pomodoro style apps (Focus Keeper) to use time effectively & Do at least 25 minutes of study/work everyday
Build a Portofolio of Work	Contribute regularly to the Fastai Discord/Forums/Study Group (fastbunnies come join us on discord), Use Active Kaggle Competitions as data and problems to apply lessons to and blog my work here

Thankyou

I’d also like to shoutout to a few names that have both inspired me to begin and engage on a journey like this as well as even making a journey like this possible

Jeremy Howard | fastai
- As the creator of the software package, one of the authors of the fastai book (Sylvain much love to you as well!), an incredible teacher of many topics (in particular the lessons on startups & APL), the chief motivator of prospective students to just get engaged and learn and post about what you’re doing as well as write what you learn (hopefully I’m taking on what you’re preaching by writing here?). None of what I’m doing and learning is feasibly as easy or possible without your work, thankyou.

Rachel Thomas | fastai
- Without your class on data ethics and exposing me to this entire problem space, I don’t know what would have dropped the penny for me. You inspired and showed me awesome thinkers like Timnit Gebru, and got me interested in books like Weapons of Math Destruction by Cathy O’Neil. Thinking about data ethics and your talks have completely changed the way I view data I work with at work and personally in my life, I think I will forever be a better data practictioner and conscious of what I’m doing as well as what others are doing with data because of your work. You’ve also connected real problems I see in the world with the data that observes these problems and the models and data that cause these problems to the ethical and emotional experience of being a human. The welding of ethical philosophy and data work is a really powerful connection in my heart and mind that I’m thankful you showed me, your talks were truly mind blowing and expanding! I’m also really excited to take your fastai linear algebra course once I’ve completed the fastai lectures and written them here, thankyou.

Zach Meuller | fastai & huggingface
- Without your hand-holding, I wouldn’t either be confident enough or hurdle the leap it is to contribute to open source. Your help and guidance with my first fastai contribution and PR solidified how great the fastai community was and you actively participated in an important learning moment for me. I’d also not feel like its possible to go from fastai to an incredible role like working at huggingface, all the while being a whole bunch younger than I am! I’m excited to take your walkwithfasti course after I’ve completed Rachel’s course and written about it here as well. You’re an inspiration, thankyou.

Radek Osmulski | NVIDIA & fastai
- Radek, your journey similar to Zach’s from fastai to NVIDIA is equally inspiring and your book Meta Learning feels like a blueprint for the lessons and behaviours I need to work on. I strongly connected to the chapters and personal writing you expressed, your kaggle notebooks are also awesome and I’ve already learnt so much from you despite not consuming much of your content. Similar to Jeremy and Zach, the inspiration to just get out there and build, write, contribute, and share has actively changed the way I see the world and I’m spending my days differently because of what you’ve written, thankyou.

Paul Kennedy | University of Technology Sydney
- Paul without your introduction to Data Mining course which I took in 2017 (if I’m honest I picked it on a whim thinking I’d learn something about marketing which I thought would be important, I’m happy I was so wrong) I wouldn’t have even started writing code professionally and I wouldn’t have dove deep into machine learning and data science like I have. Your first lesson explaining your cancer research using ML is still a vivid memory and constant source of inspiration for me loving what I do and being so fascinated with data science and ML. That single elective completely changed my professional path and graduate program rotations at work, I totally shifted from being a business analyst (extremely useful and practical skills for shipping software in a big org but completely unsatisfying long term) to finding anyone and anywhere doing ‘data’ which led me to the early ‘big data platform’ which has been some of the best years of my life in learning. I hope one day to be able to payback the inspiration or collaborate in some way, thankyou.

Tim Spencer | Westpac
- Tim, I probably wouldn’t have found fastai and so many wonderful data science writers and contributors out there without you flagging them to me. Almost all my favourite contributors that I follow are from tweets or blogs you’ve shared. You’ve got an awesomely practical mind and I’ve learnt so much from your working behaviours + passion and interest for doing good work and doing great data work. You’re a cool operator and I hope one day to have the poise and calmness of thought you ooze. So many of the people I’ve mentioned above are because you’ve put me onto them, its difficult to overstate the impact the exposure has had despite the content not being yours. I genuinely aim to be an engineer and data professional like you over the next ~15 years and if you’re ever trying to build something and want help, please always feel comfortable giving me a buzz! Thankyou.

If anyone mentioned in this list ever read this, please know you’ve made a demonstrable and incredibly large positive influence on my life, I owe you a debt and if you’d ever feel comfortable reaching out to me, please do! I’d love to explain in more detail how thankful I am.

Chapter 2

Deploy and Serve Models Straight Away

Checkout this notebook in colab

Gathering Data

I would like to use the Kaggle competition that is currently running called “RSNA Screening Mammography Breast Cancer Detection” through this chapter. Hopefully I can apply what I learn in subsequent chapters to get better within the competition.

The first step is to get our data, I’m pretty sure I can use the kaggle python APIs in order to do this.

!kaggle competitions list

ref                                                                                 deadline             category            reward  teamCount  userHasEntered  
----------------------------------------------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
https://www.kaggle.com/competitions/nfl-player-contact-detection                    2023-03-01 23:59:00  Featured          $100,000        218           False  
https://www.kaggle.com/competitions/nfl-big-data-bowl-2023                          2023-01-09 23:59:00  Analytics         $100,000          0           False  
https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting       2023-03-14 23:59:00  Featured           $60,000        438           False  
https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations    2023-03-14 23:59:00  Featured           $55,000        152           False  
https://www.kaggle.com/competitions/santa-2022                                      2023-01-17 23:59:00  Featured           $50,000        575           False  
https://www.kaggle.com/competitions/rsna-breast-cancer-detection                    2023-02-27 23:59:00  Featured           $50,000        599            True  
https://www.kaggle.com/competitions/otto-recommender-system                         2023-01-31 23:59:00  Featured           $30,000       1740           False  
https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction           2023-01-03 23:59:00  Featured           $25,000       2409           False  
https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves  2023-01-03 23:59:00  Research           $25,000        905           False  
https://www.kaggle.com/competitions/titanic                                         2030-01-01 00:00:00  Getting Started  Knowledge      13917           False  
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques     2030-01-01 00:00:00  Getting Started  Knowledge       4516            True  
https://www.kaggle.com/competitions/spaceship-titanic                               2030-01-01 00:00:00  Getting Started  Knowledge       2803           False  
https://www.kaggle.com/competitions/digit-recognizer                                2030-01-01 00:00:00  Getting Started  Knowledge       1287           False  
https://www.kaggle.com/competitions/nlp-getting-started                             2030-01-01 00:00:00  Getting Started  Knowledge        912           False  
https://www.kaggle.com/competitions/connectx                                        2030-01-01 00:00:00  Getting Started  Knowledge        204           False  
https://www.kaggle.com/competitions/tpu-getting-started                             2030-06-03 23:59:00  Getting Started  Knowledge        154           False  
https://www.kaggle.com/competitions/store-sales-time-series-forecasting             2030-06-30 23:59:00  Getting Started  Knowledge        909           False  
https://www.kaggle.com/competitions/gan-getting-started                             2030-07-01 23:59:00  Getting Started     Prizes         91            True  
https://www.kaggle.com/competitions/contradictory-my-dear-watson                    2030-07-01 23:59:00  Getting Started     Prizes         70           False

# If I wanted to download the data via the API I would run the below command but its ~300GB and I'd like to continue to explore with this notebook in the mean-time so I
# will set this to download in the background and I'll continue with an example dataset that I build along with the book.

# !kaggle competitions download rsna-breast-cancer-detection -p ../data/rsna_data

Ok upon realising it will take a day or so to download the competition data, I will work with a smaller dataset in the mean-time! Lets use sharks instead of bears from the book example whilst I wait for my rsna data to download

import nbdev
from fastbook import *
from fastai.vision.widgets import *

results = search_images_ddg(term="great white shark")
results

(#200) ['https://cdn.mos.cms.futurecdn.net/VmUwgUhgCKRBcYx7YUwEMo-1200-80.jpg','https://markophotographer.com/wp-content/uploads/2020/01/IMG_0453-Edit-2-2_underwater.jpg','http://1.bp.blogspot.com/-wtbYOn_PpNw/UhzjGz2tycI/AAAAAAAAAlM/Lw4c8aYOKrI/s1600/Guadalupe+09-15-12+(17).jpg','https://mymodernmet.com/wp/wp-content/uploads/2019/01/cam-grant-ocean-ramsey-sharks-2.jpg','http://a.abcnews.com/images/US/gty_Great_white_shark_mm_150616_12x5_1600.jpg','https://d.ibtimes.co.uk/en/full/1534719/great-white-shark.jpg?w=400','http://www.bolman.nl/cgtalk/greatwhite.jpg','https://nautilusliveaboards.com/wp-content/uploads/2019/08/greatWhiteShark-006.jpg','https://i0.wp.com/techdrive.co/wp-content/uploads/2014/11/http-i.kinja-img.com-gawker-media-image-upload-s-z9723pk3-lknrncsznm2tianhf79x.jpg?ssl=1','https://3.bp.blogspot.com/-hyYrpZWawec/T71fpKSFIgI/AAAAAAAAGHA/nCRAKYPDKXk/s1600/great-white-shark_1200x900.jpg'...]

destination = "../data/sharks/great_white_shark.jpg"
download_url(results[0], destination)

109.08% [65536/60078 00:00<00:00]

Path('../data/sharks/great_white_shark.jpg')

im = Image.open(destination)
im.to_thumb(128,128)

Looks great just like the book, I’m using the duck duck go search however instead of bing because its easier, no key nonsense to worry about.

shark_types = "great white", "bull", "hammerhead", "tiger", "lemon"

path = Path("../data/sharks")

path.resolve()

Path('C:/Users/Nick/Documents/GitHub/blog/data/sharks')

Path looks good, lets download all our scary sharks.

%%time

path.mkdir(exist_ok=True)
# make the path first

for shark in shark_types:
    dest = path/shark
    dest.mkdir(exist_ok=True)
    results = search_images_ddg(f"{shark} shark")
    download_images(dest, urls=results)

CPU times: total: 26.6 s
Wall time: 4min 42s

files = get_image_files(path)
files

(#1901) [Path('../data/sharks/great_white_shark.jpg'),Path('../data/sharks/bull/00cfb435-c74c-4138-ace8-7b2ab0d6cf08.jpg'),Path('../data/sharks/bull/0115797a-c688-4de3-896a-3c7d8c25b1cc.png'),Path('../data/sharks/bull/0245c5d2-3027-4e37-bbaa-927e952ca432.jpg'),Path('../data/sharks/bull/026b853a-f302-4cc9-95e1-4a34373e06b2.jpg'),Path('../data/sharks/bull/02a94619-6eb7-4bc8-8c50-8bb2739b380e.jpg'),Path('../data/sharks/bull/037a15b9-ab9f-4fd1-aaad-fcf9d091978a.jpg'),Path('../data/sharks/bull/04101761-76cc-47d1-a8ad-107a0abb9a59.jpg'),Path('../data/sharks/bull/041d9f98-2b48-4e2e-8212-d0680efbfad4.jpg'),Path('../data/sharks/bull/05f4963c-f13b-40b2-897e-18e604265ff0.jpg')...]

failed = verify_images(files)
failed

(#20) [Path('../data/sharks/bull/08914ba7-3ba2-416c-bd86-b54ef59dc7af.jpg'),Path('../data/sharks/bull/1717e988-5f87-4d20-bc92-66971a27bf17.jpg'),Path('../data/sharks/bull/379b33db-180b-4fb2-bdb8-45038005238b.jpg'),Path('../data/sharks/bull/3c945e68-dc01-4d3d-a033-0183c9a27df1.jpg'),Path('../data/sharks/bull/6eebf1c1-cdcd-4b0a-99dd-ab27a44ab64e.jpg'),Path('../data/sharks/bull/71e9d83f-101b-442b-905c-bab7361c25fe.jpg'),Path('../data/sharks/bull/bd5ddb71-a8bd-4b38-a3ae-17c29101bb7d.jpg'),Path('../data/sharks/bull/c7315fa2-f6e0-4fde-9db9-4130e66652d1.jpg'),Path('../data/sharks/great white/66bf05ed-ca7c-4891-9c21-b92bc2cfa41c.JPG'),Path('../data/sharks/great white/837a7762-16d2-46b4-a087-8b0a31235573.jpg')...]

OK so at this point I can see there are some interesting functions like map being run directly from this failed object rather than the python inbuilt and I’m curious as to what this failed object type actually is and what map does

type(failed), doc(failed.map)

L.map

L.map(f, *args, gen=False, **kwargs)

Create new `L` with `f` applied to all `items`, passing `args` and `kwargs` to `f`

Show in docs

(fastcore.foundation.L, None)

This is cool, this fastcore ‘L’ type acts like a list but I can just call map and apply a given function to all items in it. I suspect this is tickling the surface of how this could be used but thats a helpful api instead of playing with python’s inbuilt map function. Although I’m fairly sure the syntax is very similar. Lets also checkout unlink which we’re about to run over the failed images.

doc(Path.unlink)

Path.unlink

Path.unlink(missing_ok=False)

Remove this file or link. If the path is a directory, use rmdir() instead.

Ok simply a delete, easy as. So in this case, we have a bunch of items contained within failed.items where we will call unlink on each one

failed.items

[Path('../data/sharks/bull/08914ba7-3ba2-416c-bd86-b54ef59dc7af.jpg'),
 Path('../data/sharks/bull/1717e988-5f87-4d20-bc92-66971a27bf17.jpg'),
 Path('../data/sharks/bull/379b33db-180b-4fb2-bdb8-45038005238b.jpg'),
 Path('../data/sharks/bull/3c945e68-dc01-4d3d-a033-0183c9a27df1.jpg'),
 Path('../data/sharks/bull/6eebf1c1-cdcd-4b0a-99dd-ab27a44ab64e.jpg'),
 Path('../data/sharks/bull/71e9d83f-101b-442b-905c-bab7361c25fe.jpg'),
 Path('../data/sharks/bull/bd5ddb71-a8bd-4b38-a3ae-17c29101bb7d.jpg'),
 Path('../data/sharks/bull/c7315fa2-f6e0-4fde-9db9-4130e66652d1.jpg'),
 Path('../data/sharks/great white/66bf05ed-ca7c-4891-9c21-b92bc2cfa41c.JPG'),
 Path('../data/sharks/great white/837a7762-16d2-46b4-a087-8b0a31235573.jpg'),
 Path('../data/sharks/great white/9b5ae7d2-0009-4903-94d8-e63a074f392b.jpg'),
 Path('../data/sharks/hammerhead/28d0436f-e2ca-424c-bf44-a68135481744.jpg'),
 Path('../data/sharks/hammerhead/7015a66d-bbba-4023-84b7-0878fae0a23b.jpg'),
 Path('../data/sharks/hammerhead/788812c7-6c31-460f-9011-21b4dc7dd565.jpg'),
 Path('../data/sharks/hammerhead/87377226-84b6-4ff5-91da-5225f8fd8a00.jpg'),
 Path('../data/sharks/lemon/90dd9fe9-9a34-4fa1-a2e6-00cb3fc48db7.jpg'),
 Path('../data/sharks/tiger/5c77d674-883c-47f0-a57c-10af79ad6649.jpg'),
 Path('../data/sharks/tiger/66984b31-ce9a-4bf4-b5a8-e9a7f21de03e.jpg'),
 Path('../data/sharks/tiger/6f562632-15c2-4ceb-ae9c-0b8dceccca1f.jpg'),
 Path('../data/sharks/tiger/70cc5970-2afd-4c1b-b2b9-5865b385585a.jpg')]

failed.map(Path.unlink)

(#20) [None,None,None,None,None,None,None,None,None,None...]

now if I verify the images again, I should turn up with an empty ‘L’

failed = verify_images(path)
failed

'WindowsPath' object is not iterable

(#0) []

From Data to DataLoaders

Ok now that we’ve gathered a bunch of data, we can now move it into one of the key concepts and classes of fastai which is the DataLoader object, lets have a quick look at the docs

doc(DataLoaders)

DataLoaders

DataLoaders(*loaders, path:str|pathlib.Path='.', device=None)

Basic wrapper around several `DataLoader`s.

Show in docs

doc(DataLoader)

DataLoader

DataLoader(dataset=None, bs=None, num_workers=0, pin_memory=False, timeout=0, batch_size=None, shuffle=False, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)

API compatible with PyTorch DataLoader, with a lot more callbacks and flexibility

Show in docs

Ok so this is a pytorch concept directly which has been wrapped with extra functionality, lets have a quick look at the pytorch docs

torch.utils.data.dataloader.DataLoader?

Init signature:
torch.utils.data.dataloader.DataLoader(
    dataset: torch.utils.data.dataset.Dataset[+T_co],
    batch_size: Optional[int] = 1,
    shuffle: Optional[bool] = None,
    sampler: Union[torch.utils.data.sampler.Sampler, Iterable, NoneType] = None,
    batch_sampler: Union[torch.utils.data.sampler.Sampler[Sequence], Iterable[Sequence], NoneType] = None,
    num_workers: int = 0,
    collate_fn: Optional[Callable[[List[~T]], Any]] = None,
    pin_memory: bool = False,
    drop_last: bool = False,
    timeout: float = 0,
    worker_init_fn: Optional[Callable[[int], NoneType]] = None,
    multiprocessing_context=None,
    generator=None,
    *,
    prefetch_factor: int = 2,
    persistent_workers: bool = False,
    pin_memory_device: str = '',
)
Docstring:     
Data loader. Combines a dataset and a sampler, and provides an iterable over
the given dataset.
The :class:`~torch.utils.data.DataLoader` supports both map-style and
iterable-style datasets with single- or multi-process loading, customizing
loading order and optional automatic batching (collation) and memory pinning.
See :py:mod:`torch.utils.data` documentation page for more details.
Args:
    dataset (Dataset): dataset from which to load the data.
    batch_size (int, optional): how many samples per batch to load
        (default: ``1``).
    shuffle (bool, optional): set to ``True`` to have the data reshuffled
        at every epoch (default: ``False``).
    sampler (Sampler or Iterable, optional): defines the strategy to draw
        samples from the dataset. Can be any ``Iterable`` with ``__len__``
        implemented. If specified, :attr:`shuffle` must not be specified.
    batch_sampler (Sampler or Iterable, optional): like :attr:`sampler`, but
        returns a batch of indices at a time. Mutually exclusive with
        :attr:`batch_size`, :attr:`shuffle`, :attr:`sampler`,
        and :attr:`drop_last`.
    num_workers (int, optional): how many subprocesses to use for data
        loading. ``0`` means that the data will be loaded in the main process.
        (default: ``0``)
    collate_fn (Callable, optional): merges a list of samples to form a
        mini-batch of Tensor(s).  Used when using batched loading from a
        map-style dataset.
    pin_memory (bool, optional): If ``True``, the data loader will copy Tensors
        into device/CUDA pinned memory before returning them.  If your data elements
        are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type,
        see the example below.
    drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
        if the dataset size is not divisible by the batch size. If ``False`` and
        the size of dataset is not divisible by the batch size, then the last batch
        will be smaller. (default: ``False``)
    timeout (numeric, optional): if positive, the timeout value for collecting a batch
        from workers. Should always be non-negative. (default: ``0``)
    worker_init_fn (Callable, optional): If not ``None``, this will be called on each
        worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
        input, after seeding and before data loading. (default: ``None``)
    generator (torch.Generator, optional): If not ``None``, this RNG will be used
        by RandomSampler to generate random indexes and multiprocessing to generate
        `base_seed` for workers. (default: ``None``)
    prefetch_factor (int, optional, keyword-only arg): Number of batches loaded
        in advance by each worker. ``2`` means there will be a total of
        2 * num_workers batches prefetched across all workers. (default: ``2``)
    persistent_workers (bool, optional): If ``True``, the data loader will not shutdown
        the worker processes after a dataset has been consumed once. This allows to
        maintain the workers `Dataset` instances alive. (default: ``False``)
    pin_memory_device (str, optional): the data loader will copy Tensors
        into device pinned memory before returning them if pin_memory is set to true.
.. warning:: If the ``spawn`` start method is used, :attr:`worker_init_fn`
             cannot be an unpicklable object, e.g., a lambda function. See
             :ref:`multiprocessing-best-practices` on more details related
             to multiprocessing in PyTorch.
.. warning:: ``len(dataloader)`` heuristic is based on the length of the sampler used.
             When :attr:`dataset` is an :class:`~torch.utils.data.IterableDataset`,
             it instead returns an estimate based on ``len(dataset) / batch_size``, with proper
             rounding depending on :attr:`drop_last`, regardless of multi-process loading
             configurations. This represents the best guess PyTorch can make because PyTorch
             trusts user :attr:`dataset` code in correctly handling multi-process
             loading to avoid duplicate data.
             However, if sharding results in multiple workers having incomplete last batches,
             this estimate can still be inaccurate, because (1) an otherwise complete batch can
             be broken into multiple ones and (2) more than one batch worth of samples can be
             dropped when :attr:`drop_last` is set. Unfortunately, PyTorch can not detect such
             cases in general.
             See `Dataset Types`_ for more details on these two types of datasets and how
             :class:`~torch.utils.data.IterableDataset` interacts with
             `Multi-process data loading`_.
.. warning:: See :ref:`reproducibility`, and :ref:`dataloader-workers-random-seed`, and
             :ref:`data-loading-randomness` notes for random seed related questions.
File:           c:\users\nick\anaconda3\envs\fastai\lib\site-packages\torch\utils\data\dataloader.py
Type:           type
Subclasses:

The important statements on my first read seem to be understanding that a dataloader gives you a way to iterate/sample the dataset in multiple ways, whether that be a sample or a batch or a single record.

Then the concept of a DataBlock is introduced which is paraphrased as “a way to fully customise every stage of the creation of your DataLoaders.”

Lets now look at the docs and make a Dataloaders for the shark dataset we downloaded

doc(DataBlock)

DataBlock

DataBlock(blocks:list=None, dl_type:TfmdDL=None, getters:list=None, n_inp:int=None, item_tfms:list=None, batch_tfms:list=None, get_items=None, splitter=None, get_y=None, get_x=None)

Generic container to quickly build `Datasets` and `DataLoaders`.

Show in docs

sharks = DataBlock(blocks=(ImageBlock,CategoryBlock),
                  get_items=get_image_files,
                  splitter=RandomSplitter(valid_pct=0.2, seed=42),
                  get_y=parent_label,
                  item_tfms=Resize(128))

Lets checkout a couple more classes and methods we’re running into like ImageBlock, CategoryBlock, get_image_files(), and parent_label()

doc(get_image_files)

get_image_files

get_image_files(path, recurse=True, folders=None)

Get image files in `path` recursively, only in `folders`, if specified.

Show in docs

Ok looks like it will search all image files to organise all items

doc(parent_label)

parent_label

parent_label(o)

Label `item` with the parent folder name.

Show in docs

Ok so the labelling is done by the parent folder to ‘classify’ the images which makes sense in our situation as each shark type is in its own folder

doc(ImageBlock)

ImageBlock

ImageBlock(cls:fastai.vision.core.PILBase=)

A `TransformBlock` for images of `cls`

Show in docs

ImageBlock??

Signature: ImageBlock(cls: 'PILBase' = <class 'fastai.vision.core.PILImage'>)
Source:   
def ImageBlock(cls:PILBase=PILImage):
    "A `TransformBlock` for images of `cls`"
    return TransformBlock(type_tfms=cls.create, batch_tfms=IntToFloatTensor)
File:      c:\users\nick\anaconda3\envs\fastai\lib\site-packages\fastai\vision\data.py
Type:      function

It returns a TransformBlock which we also haven’t seen yet

doc(TransformBlock)

TransformBlock

TransformBlock(type_tfms:list=None, item_tfms:list=None, batch_tfms:list=None, dl_type:fastai.data.core.TfmdDL=None, dls_kwargs:dict=None)

A basic wrapper that links defaults transforms for the data block API

Show in docs

Ok looks like this is a more generic type to connect transformations of some type that I’m not sure of yet to the DataBlocks. Not quite sure yet what these mean but march on we must.

doc(CategoryBlock)

CategoryBlock

CategoryBlock(vocab:list|pandas.core.series.Series=None, sort:bool=True, add_na:bool=False)

`TransformBlock` for single-label categorical targets

Show in docs

CategoryBlock?

Signature:
CategoryBlock(
    vocab: 'list | pd.Series' = None,
    sort: 'bool' = True,
    add_na: 'bool' = False,
)
Docstring: `TransformBlock` for single-label categorical targets
File:      c:\users\nick\anaconda3\envs\fastai\lib\site-packages\fastai\data\block.py
Type:      function

Looks like CategoryBlock also returns a TransformBlock so once we figure that out, it’ll be the same logic but important for Category instead of Image

The book describes the ‘blocks’ keywords as a tuple of the independent and dependent variables, so we’re inputting an ImageBlock and outputting a CategoryBlock, the importance of the TransformBlock class isn’t obvious at the moment but I’m hypothesising its a standard way to link A–>B. For example in this case we want to link each ImageBlock–>CategoryBlock.

The splitter is random in this case as we don’t mind how we manage our train, test, and validation sets but this isn’t always the case, Rachel Thomas has an awesome blog on creating good validation sets here

The item_tfms keyword is to manage transformations we need to apply on our input (images in this case) and we need to make sure all our images are the same size to feed my tiny baby GPU in batches. This blog & 2023’s work will hopefully fry my GeForce GTX 960 so that I can have an excuse for a new PC build, but for now my 2GB of 2015 memory will have to do.

Ok we should now be ready to create our DataLoaders object, using the path object we created our dataset with.

path

Path('../data/sharks')

dls = sharks.dataloaders(path)

doc(sharks.dataloaders)

DataBlock.dataloaders

DataBlock.dataloaders(source, path:str='.', verbose:bool=False, bs:int=64, shuffle:bool=False, num_workers:int=None, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)

Create a `DataLoaders` object from `source`

Show in docs

dls.valid.show_batch(max_n=4, nrows=1)

Jeez that was really easy, this API is super nice so far! I’m also pretty chuffed with these shark photos, did you know hammerheads have 360 degree vision and stereo vision ahead and behind because of their eye shape?

Lets copy the book and modify the resize method to squish so we don’t lose some of our image signal

doc(sharks.new)

DataBlock.new

DataBlock.new(item_tfms:list=None, batch_tfms:list=None)

Create a new `DataBlock` with other `item_tfms` and `batch_tfms`

Show in docs

sharks = sharks.new(item_tfms=Resize(128,ResizeMethod.Squish))
dls = sharks.dataloaders(path)
dls.valid.show_batch(max_n=4,nrows=1)

Ok our sharks looks squishy but we can see more of them at least

sharks = sharks.new(item_tfms=Resize(128,ResizeMethod.Pad, pad_mode='zeros'))
dls = sharks.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1)

Less squishy, easier to interpret here

At this point the book described the problems with each of these approaches, in particular how we either lose signal with a crop, waste compute and lose resolution with the pad, or create unrealistic images with the squish and stretch. I won’t repeat them here but just re-iterating the significance of these steps, below we will run the RandomResizedCrop method which will randomly resize and crop each epoch which creates the effect of looking at the same image with slightly different framing each time.

sharks = sharks.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
dls = sharks.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1, unique=True)

This isn’t a great example since our shark appears similarly in each photo, lets try another.

sharks = sharks.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
dls = sharks.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1, unique=False)

Ok we can see the hammerhead and great white get a bit squished and cut up here, this shows the point a bit better

Lets have a look at some data augmentation transformations under ‘batch transformations’ which follow the keyword ‘batch_tfms’

Data Augmentation

sharks = sharks.new(item_tfms=RandomResizedCrop(128, min_scale=0.3), batch_tfms=aug_transforms(mult=2))
dls = sharks.dataloaders(path)
dls.train.show_batch(max_n=8, nrows=2, unique=True)

doc(aug_transforms)

aug_transforms

aug_transforms(mult:float=1.0, do_flip:bool=True, flip_vert:bool=False, max_rotate:float=10.0, min_zoom:float=1.0, max_zoom:float=1.1, max_lighting:float=0.2, max_warp:float=0.2, p_affine:float=0.75, p_lighting:float=0.75, xtra_tfms:list=None, size:int|tuple=None, mode:str='bilinear', pad_mode='reflection', align_corners=True, batch=False, min_scale=1.0)

Utility func to easily create a list of flip, rotate, zoom, warp, lighting transforms.

Show in docs

aug_transforms looks like a really nice method to generate a bunch of transformations, how helpful!

Training a Model and Cleaning our Data

sharks = sharks.new(item_tfms=RandomResizedCrop(128, min_scale=0.5),
                   batch_tfms=aug_transforms())
dls = sharks.dataloaders(path)

# Make a learner

learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(4)

C:\Users\Nick\Anaconda3\envs\fastai\lib\site-packages\torchvision\models\_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
C:\Users\Nick\Anaconda3\envs\fastai\lib\site-packages\torchvision\models\_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)

epoch	train_loss	valid_loss	error_rate	time
0	2.417486	1.404847	0.409574	01:29

C:\Users\Nick\Anaconda3\envs\fastai\lib\site-packages\PIL\Image.py:979: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
  warnings.warn(

epoch	train_loss	valid_loss	error_rate	time
0	1.510436	1.147126	0.340426	01:31
1	1.216866	0.997361	0.271277	01:30
2	0.971167	0.774593	0.255319	01:31
3	0.792209	0.765518	0.234043	01:31

Ok time to checkout what errors were made

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

OK a decent chunk of mistakes but surprisingly good for how tiny the model is and how small the photographs are. The biggest error being Tiger sharks being mistaken for great white sharks but great white sharks not being mistaken for Tigers which is interesting. I’m sure a lot of this leads to strange photographs with edits in them such as words and other problems like lighting and in general underwater photography being more complicated. Also who’s getting close enough to sharks to take these kind of photos, not me certainly.

interp.plot_top_losses(5, nrows=2)

doc(interp.plot_top_losses)

Interpretation.plot_top_losses

Interpretation.plot_top_losses(k:int|list, largest:bool=True, **kwargs)

Show `k` largest(/smallest) preds and losses. Implementation based on type dispatch

Show in docs

Really interesting is that second photo of a bull shark being mistaken for a hammerhead, I think the angle of the photograph with the side fins are being mistaken for the distinctive hammerhead eyes.

Lets now use the cleaner object to improve the dataset

cleaner = ImageClassifierCleaner(learn)
cleaner

No wonder a bunch of bull sharks were misinterpeted as great white’s, they bloody well are! I’m not a shark expert but the distinctive white underbelly and distinctive rippling in between the grey and white with the pointy nose are features of great white sharks and a few bull sharks here I think are actually great white sharks so I’ve changed the labelling to suit. Great white sharks are also known as white pointers, I suspect due to this very pointy nose shape they have

# Lets make those changes that I did in the above section
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]),path/cat)

Turning our Model into an Online Application

Now is the section where we go deploying this model we’ve made, I’m actually keen on using some Azure services as I’ve had to interact with that ecosystem at work and I’d like to get a little bit more in the weeds. If I end up spending way too much time deploying said model then I’ll retreat back to the guides provided within fastai but I’m willing to step a little bit outside of my comfort zone here.

Using the Model for Inference

Lets quickly play around with exporting the model pickle and loading it back up as well to make a prediction

model_path = Path("../data/models")
Path.mkdir(model_path, exist_ok=True)
learn.export("../data/models/shark.pk1")

model_path.ls()

(#1) [Path('../data/models/shark.pk1')]

And we can load back this model into a new object despite having the learner object in memory to prove we can re-initialise the model we built.

inf_model = load_learner(model_path/"shark.pk1")

inf_model.predict("../data/sharks/great white/8bd3988d-a246-4cf9-9b74-390c1e704521.jpg")

('great white',
 TensorBase(1),
 TensorBase([0.0015, 0.9845, 0.0060, 0.0022, 0.0014, 0.0044]))

As mentioned in the book, we have the predicted category, the index of the predicted category in the vocab of our learner and the probabilities of each category, we can checkout the model vocab on the object and see the 2nd value (1st index) being ‘great white’

inf_model.dls.vocab

['bull', 'great white', 'hammerhead', 'lemon', 'sharks', 'tiger']

btn_upload = widgets.FileUpload()
btn_upload

img = PILImage.create(btn_upload.data[-1])

img

out_pl = widgets.Output()
out_pl.clear_output()
with out_pl: display(img.to_thumb(512,512))
out_pl

pred, pred_idx, probs = inf_model.predict(img)

lbl_pred = widgets.Label()
lbl_pred.value = f"Prediction: {pred}, Probability: {probs[pred_idx]:.04f}"
lbl_pred

Looks great, have some output widgets and some explanation of the model prediction all from the widgets library and the learner object out the box.

btn_run = widgets.Button(description="Classify")
btn_run

def on_click_classify(change):
    img = PILImage.create(btn_upload.data[-1])
    out_pl.clear_output()
    with out_pl: display(img.to_thumb(128,128))
    lbl_pred.value = f"Prediction: {pred}, Probability: {probs[pred_idx]:.04f}"
    
btn_run.on_click(on_click_classify)

VBox([widgets.Label("Select your Shark"),
     btn_upload, btn_run, lbl_pred])

Apologies if You’re Reading on Quarto

Some the the widgets and cleaners that I’ve shown here do not render on Quarto natively, please checkout the source code under my posts folder of notebooks that this website is driven by, this should be titled Chapter 2

Moving to a Real App

This is the section where the book describes voila, I think in the lecture, Gradio is used, I’ve also come across Vercel as infrastructure to deploy to from the fastai study group contributer @ielka. I myself have thought about using Azure, I’ll revisit this in the future as I’d like to get deeper in the course before getting stuck down in the details of deployment but I will keep this in the back of my mind.

Deployed on Gradio and HuggingFace Spaces

I’m going to follow Jeremy’s example and simply use Gradio and HuggingFace for now, below is the exported script I’m going to use with nbdev to deploy to hugging face.

Checkout the Shark Classifier Here

::: {#ff477ac6-99ea-4d7f-926e-22b64bc6e4a2 .cell 0=‘e’ 1=‘x’ 2=‘p’ 3=‘o’ 4=‘r’ 5=‘t’ execution_count=102}

# import numpy as np
# import gradio as gr
# from pathlib import Path
# from fastai.vision.all import *


# import pathlib
# plt = platform.system()
# if plt == 'Linux': pathlib.WindowsPath = pathlib.PosixPath

# model_path = Path("./shark.pk1")
# inf_model = load_learner(model_path)

# categories = ['Bull', 'Great White', 'Hammerhead', 'Lemon', 'Sharks', 'Tiger']

# def classify_image(img):
#     pred, idx, probs = inf_model.predict(img)
#     return dict(zip(categories, map(float,probs)))

# image = gr.inputs.Image(shape=(192,192))
# label = gr.outputs.Label()

# examples = ["great white.jpg", "hammerhead.jpg", "bull.jpg"]

# demo = gr.Interface(fn=classify_image, inputs=image, outputs=label, examples=examples)
# demo.launch()

:::

# from nbdev.export import nb_export

# nb_export('2022-12-23-Fastai Chapter 2.ipynb', name="app")

from fastai.vision.all import *
from pathlib import Path

data_base = Path("../data/rsna/")
get_image_files(data_base/"train_images")

(#2379) [Path('../data/rsna/train_images/10006/20221225/10006_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_1459541791.jpg'),Path('../data/rsna/train_images/10006/20221225/10006_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_1864590858.jpg'),Path('../data/rsna/train_images/10006/20221225/10006_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_1874946579.jpg'),Path('../data/rsna/train_images/10006/20221225/10006_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_462822612.jpg'),Path('../data/rsna/train_images/10011/20221225/10011_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_1031443799.jpg'),Path('../data/rsna/train_images/10011/20221225/10011_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_220375232.jpg'),Path('../data/rsna/train_images/10011/20221225/10011_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_270344397.jpg'),Path('../data/rsna/train_images/10011/20221225/10011_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_541722628.jpg'),Path('../data/rsna/train_images/10025/20221225/10025_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_1365269360.jpg'),Path('../data/rsna/train_images/10025/20221225/10025_jpg/UnknownDate_UnknownTime_UnknownModality_UnknownAccNum/Ser_288394860.jpg')...]