# Deep Learning Paper Reading Notes

creation date: 2018-01-11, latest update: 2020-06-17

## Background

Based on Awesome Deep Learning Papers plus my own addition of literature summary

## Famous Machine Learning Conferences

• NIPS - general machine learning (US)
• ICML - general machine learning (international)
• CVPR - computer vision (US)
• ECCV - computer vision (european)
• ICCV - computer vision (international)
• SIGGRAPH - animation, computer graphic

Human Related

Baby Related

## Grasping

### Dataset

• (2016) ACRV Picking Benchmark (APB)
• (2015) YCB Object Set
• GGCNN paper proposes a set of 20 reproducible items for testing, com-prising comprising 8 3D printed adversarial objects from Dex-Net paper and 12 items from the APB and YCB object sets, which provide a wide enough range of sizes, shapes and difficulties to effectively compare results while not excluding use by any common robots, grippers or camera
• (2013) Cornell grasping dataset - 1k RGBD with grasp bbox labels

## Body pose estimation

### Dataset

big list of both body and hand dataset

### Papers

===================================================================

## Hand pose estimation

The most challenging part about this is not the architecture, but the lack of large, clean, public dataset.

### Dataset

list of more datasets here

### Hand Papers

Most of the papers use Depth-only or RGB+D data to estimate hand-pose... It is probably possible to convert RGB to depth with another model, but it might be even slower.

• List of generally good papers with performance benchmark here --> Awesome hand pose estimation
• List of papers with notes from researcher student's personal wiki --> inria wiki
• Accepted papers from Hands 2017 conference

• (2017) Hand Keypoint Detection in Single Images using Multiview Bootstrapping - openpose
• good accuracy but speed is quite slow. the paper says it can be run in real-time but never provide benchmark any.
• 2D hand pose estimation from RGB image
• starts from building multiview dataset with good labels - important - crop each hand images using body pose to estimate area
• train a detector to predict joint location on each images
• average & contrain in 3D space from multiple view (but same hand instance)
• get 3D point labels (use as ground truth for next interations)
• continue until all the images are properly labeled
• Detector Architecture: based on CPM with some modifications
• Stage 1:
• Pass input images into a few CNN+Pooling layers to extract feature-maps.
• pass through a few more CNN layers to predict belief maps
• Stage 2:
• Again, pass input images into a few CNN+Pooling layers to extract feature-maps. These layers have different weights from Stage 1
• concatenate with belief maps from Stage 1
• use that to pass through a few more CNN layers to predict a more refined belief maps
• Stage 3 and onward: Use stage 2 architecture and repeat.
• (2017) Learning to Estimate 3D Hand Pose from Single RGB Images
• This is the Zimmerman paper
• 3 Networks are used sequentially
• hand localization through segmentation
• 21 keypoint (2D) localization in hand
• deduction of 3D hand pose from 2D keypoints
• (2017) SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition
• (2015) A novel finger and hand pose estimation technique for real-time hand gesture recognition - potential
• several ways to represent the hand model, with varying complexities -- good way to think about feature representation
• This is not a deep learning paper, but there are several techniques for pre-processing the RGB images to make them easier for the architecture to learn hand pose.

## Anomaly Detection (Images / Videos)

• Overview
• currently there are 3 main approaches
1. clustering or nearest neighbor
2. learn from 1-class (normal) data and draw a boundary using SVM etc.
3. feature reconstruction of what is considered "normal" and compared diff against the sample.
• recently DL methods focus on the 3rd approach using autoencoders and GANs
• Awesome list of anomaly detection
• (2017) Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery (AnoGAN), Schlegl. / code
• train normal GAN setup to get D and G (in this case they use DCGAN)
• now get new (potential anomaly) image called x
• back-optimize the input z of G, using x
• we then use 2 kind of losses to measure anomaly score
• residual loss RL(x) = sum(abs(x - G(z)))
• feature discrimination loss DL(X) = sum(abs(Df(x) - Df(G(z)))
• where Df is a function to get mid-level features from D
• totalloss A(x) = lambda * DL(x) + (1 - lambda) * RL(x) where they found lamda = 0.1 works best
• (2018) Efficient GAN-Based Anomaly Detection, Zenati / open-review / code
• From AnoGAN, replacing DCGAN with BiGAN, so that we can have (E)ncoder as inverse mapping from x to z
• they use the following score function to detect anomalies
• total score A(x) = alpha*LG(x) + (1 - alpha)*LD(x)
• reconstruction loss LG(x) = abs( x - G(E(x)) )
• Discriminator loss LD(x) can be defined in two ways
• cross-entropy (CE): between D(x,E(x)) and 1
• feature-matching (FM): L0 loss (absolute-diff) between mid-level logits of D(x,E(x)) and D(G(E(x)),E(x))
• experiments show that performance between CE and FM is data-specific
• This is the follow-up work from the Efficient Anogan paper author
• they added Spectral Normalization and additional Discriminators to get higher accuracy. (All reasonable ideas, however the improvement isn't that clear-cut, looking at the ablation study)
• Dataset Tested: KDD, Arrhythmia, CIFAR10, SVHN
• (2019) [ICLR'19] Do Deep Generative Models Know What They Don't Know?
• (2018) Generative Ensembles for Robust Anomaly Detection
• (2018) An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos, Kiran
• this applies specifically to anomaly detection in videos, with these datasets:
• UCSD Dataset: pedestrians (normal) vs cyclist/wheelchairs (abn) etc.
• CUHK Avenue Dataset: unusual object or behaviors in Subway
• UMN Dataset: unusual crowd activity
• Train Dataset: unusual movement of people on trains
• London U-turn dataset: normal traffic vs jaywalking/firetruck
• Methods categorized as following
• Representation learning: PCA, Autoencoders (AEs) --> monitor deviation
• Predictive modeling: autoregressive models, LSTMs --> predict next frame distributions
• Generative model: VAEs, GANs, adversarial AEs (AAEs) --> likelihood
• evalutaion:
• there are two input options: raw images or optical flow. Flow works much better across the board
• no model came out consistently on top, and PCA with flow did surprisingly well.
• (2017) Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks, Liang / open-review
• train a DNN model with class of in-distribution data = 1 and others = 0. (I think at training time, the target is always 1)
• at test time, two transformations are proposed for better detection
• temperature scaling (T) of softmax probabilities (per Hinton's distillation paper. T is within range [1,1000]
• small perturbations by a gradient of its own raw image's softmax-score. the scaling factor is in [0,0.004]
• two key insights:
• Temperature scaling makes the network less sure and expand the outlier area (90-100% prob. part)
• Perturbations mainly affects in-distribution data, almost has no effect for out-distribution data
• (2018) [NIPS'18] Deep Anomaly Detection Using Geometric Transformations
• using target as "transformation #i" for the labels while training
• for simple normality score, take the softmaxed prediction for each Transformation, then compute mean. The higher, the more likely to be normal image.
• for full dirichlet normality score, we need to estimate alpha first and the formula is a bit more complex.
• intuition is that:
• while training (which are all normal images), the model will learn to detect types of geometric transformation.
• on testing, if we have abnormal images, the model will be less sure of the type of transformation used.
• (2018) [NIPS'18] A loss framework for calibrated anomaly detection
• (2018) GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training
• (2018) Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders
• for reconstruction-type anomaly segmentation, using SSIM instead of L2 Loss improved the quality substantially.
• these guys are from Machine vision company, so this idea is probably in actual production.

## Anomaly Detection (Time Series)

• (2018) (Articles) GAN Series (from the beginning to the end)
• (2014) Generative adversarial nets, I. Goodfellow et al.
• Objective is to get distribution of generated sample (Pg) to be as close to distribution of real data (Py) as much as possible
• using a minimax game of fight between discriminator (D) and generator (G)
• the learning process is like this: uniform z --> G(z) --> D(G(z))
• we switch between D(x) and D(G(z)) to learn D
• the loss is like this: C(D,G) = minimize log(D(x)) + log(1 - D(G(z)))
• this is equivalent to C(D,G) = -log(4) + 2*JS(Px || Pg)
• JS is Jensen-Shannon Divergence
• a little trick for G to get sizable gradients, the loss used is instead: maximize D(G(z))
• note that the theory calls for optimizing Pg but in practive we approximate with function G. the better or more powerful G, the closer to Pg
• (2016) Adversarial Feature Learning (BiGAN), Donahue
• add an Encoder to do inverse mapping. the setup is like this:
• (G)enerator: G(z) approximates x
• (E)ncoder: E(x) approximates the latent space vector z (200D of [-1,1])
• (D)iscriminator: recieves input tuple of either z,G(z) or E(x),x then output a probability of input being real
• this papers show proof that if we have a perfect Discriminator, the G and E must be an inverse mapping of each other
• they tried with MNIST, works quite well. Then failed with Imagenet -- the model fails to generate realistic looking images, although comparing x and G(E(x)) shows some superficial consistency, like same structure or color etc.
• (2016) Improved techniques for training GANs, T. Salimans et al.

## Understanding / Generalization / Transfer

• keypoints
• through empirical evidence, researchers notice that for all CNN models, the first 1-3 layers are similar
• the higher layers (after three) are more specific to the classification task
• we want to test how "general" or "specific" for each layer
• train a real-image classification CNN (7 layers) model-A and model-B, using completely seperate classes
• freeze 3 lowest layers from model A, then put the 4 higher layer with random weight, then train with model B dataset
• the resulting accuracy does not change
• and actually if we don't freeze (let it fine-tune), the accuracy is higher (it generalizes better)
• keypoints
• comparison of state-of-the-art "manual" feature engineering (SIFT etc.) vs "OVERFEAT" CNN
• Summary from the paper:

It’s all about the features! SIFT and HOG descriptors produced big performance gains a decade ago and now deep convolutional features are providing a similar breakthroughfor recognition.

Thus, applying the well-established com-puter vision procedures on CNN representations should potentially push the reported results even further. In any case,if you develop any new algorithm for a recognition task thenitmustbe compared against the strong baseline ofgenericdeep features+simple classifier.

• keypoints
• same idea as the "transferable features in DNN" paper
• use the pre-trained weights from task A (ImageNet) to apply to task B (Pascal)
• they transferred all the weights (all CNN and FCs layers), froze them , and added 2 FC layers at the end to adapt to new output
• for task B (Pascal), the pictures are cropped to specific object, so they use a sliding window to generate new pics + "background" class
• keypoints
• Building from 2011 papers, they use deconvnet to analyze the CNN layers.
• (2014) Decaf: A deep convolutional activation feature for generic visual recognition, J. Donahue et al.
• keypoints
• train the complex model first (model-A)
• then train a simpler one using loss function that combines (same dataset) and (model-A prediction)
• divide by certain constant (lambda) to change how sensitive the difference for each classes is
• keypoints
• use the CNN model's prediction probabilities as input
• use an evolution algorithm to evolve a random image to fool the model
• some images are similar to the "real" thing, some looks just like static TV noise
• using the "static" images to retrain, still difficult to patch up the weakness
• is this similar to adversarial network?

## Image: Object Detection

• (2020) Object Detection and Tracking in 2020 (medium article)
• (2020) IterDet: Iterative Scheme for ObjectDetection in Crowded Environments
• could be useful for edge device, simialr to gif img loading.
• (2018-09) recent advances in object detection in the age of deep CNNs
• YOLO family
• YOLOv1
• simple network design, one-shot detector
• result (voc 07-12) - mAP(0.5) 63.4 with 45 FPS at 554x554 on Titan X
• YOLOv2
• add batch normalization, able to train deeper network
• double input resolution 224x224 --> 448x448 (also in Imagenet pretraining)
• add anchor box priors, will custom clustering to find best priors
• result (voc 07-12) - mAP(0.5) 78.6 with 40 FPS at 554x554 on Titan X
• YOLOv3
• predict boxes at 3 different scales (similar to SSD)
• use skip connection (upsampled then concat layers)
• much deeper feature extractors (Darknet-53)
• result (COCO) - mAP(0.5) 57.9 with 20 FPS at 608x608 on Titan X
• R-CNN family
• R-CNN: Selective search → Cropped Image → CNN
• Fast R-CNN: Selective search → Crop feature map of CNN
• Faster R-CNN: CNN → Region-Proposal Network → Crop feature map of CN**
• Best accuracy but slow

## Reinforcement Learning / Robotics

• (2016) End-to-end training of deep visuomotor policies, S. Levine et al.
• (2016) Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection, S. Levine et al.
• (2016) Asynchronous methods for deep reinforcement learning, V. Mnih et al.
• (2016) Deep Reinforcement Learning with Double Q-Learning, H. Hasselt et al.
• (2016) Mastering the game of Go with deep neural networks and tree search, D. Silver et al.
• (2015) Continuous control with deep reinforcement learning, T. Lillicrap et al.
• (2015) Human-level control through deep reinforcement learning, V. Mnih et al.
• (2015) Deep learning for detecting robotic grasps, I. Lenz et al.
• keypoints
• Q-learning is a reinforcement learning algorithm. It is suitable for problem which has finite number of states and we know the value of all state's immediate reward.
• the main idea is do semi-random exploring to eventually map out an expected rewards value of that state. The expected value is the sum of current and all future rewards value (given discount factors).
• So we will have a big rewards matrix (R) where row equals current state and column equals an action to next state. The values are the rewards when taking that action (and arriving at a new state).
• We will also have a memory matrix (Q). which contains a sum of expected immediate and future rewards. Row is current state and column is the next future state.
• the update formula is as follows:
• Q(state,action) = R(currentstate,action) + Gamma * max[ Q(immediatenextstate,allactions) ]
• where...
• R = reward matrix
• Q = memory matrix
• Gamma = discount factor
• This assumes a learning rate of 1. If we want a different learning rate, we can do:
• Qnew = Qold + learningrate * (Qupdate - Qold)
• keypoints
• aasdf
• (2015) David Silver's excellent reinforcement learning course with video
• Agents, Environments, Actions, Rewards
• Full information game --> Agent state = Environment state
• History = sequences of Observations, Agent States and Actions.
• Markov process means P(St) = P(St | St+1..), so previous states don't matter.
• partially observable markovs (POMDP)
• Policy = function that maps from Agent state to Action
• Value function = estimates total future reward given current state St
• keypoints
• In this survey, we begin withan introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms indeep reinforcement learning, including the deep Q-network,trust region policy optimisation, and asynchronous advantage actor-critic.
• General RL concepts
• Reward-Driver Behavior
• the essense of RL is interaction. the interaction loop is simple.
1. given current state --> choose action
2. execute action
3. arrives at new state (received new state data and its rewards)
4. go to 1. until terminal state
• Per sequence above, we want to derive "optimal policy" so that the agents can asymtotically get "optimal" rewards --> which means a highest expected value of aggregated future rewards with a certain discount factor.
• Formally, RL can be described as a Markov decision process (MDP). For (only) partially-observable states like in the real world, there is a generalization of MDP called POMDP.
• Challenges in RL: long sequences until reward (credit assignment problem) and temporal sequence correlation
• Reinforcement Learning Algorithms
• Concept I: estimating Value function (total expected Rewards)
• Dynamic Programming:
• define: V = total expected Rewards (R) , Q|s,a is conditional V given state s and action a
• define: Y = R(t) + disc * Q|s(t+1),a(t+1)
• define: Temporal difference (TD) error = Y - Q|s,a
• to get Q|s,a , we use Q-learning method and try to minimize the TD error
• Concept II: sampling -- random walk till the end to get all Rs
• we can use Monte Carlo (MC) to get multiple returns and average them.
• it is easier to learn that one actions lead to much better consequences than the other (a fork in the road)
• define: relative advantage A = V - Q
• we use an idea of "advantage update" in many recent algorithms
• Concept III: policy search
• instead of estimating value function, we try to contruct policy directly. (so we can sample actions from it)
• try several policies to get the optimal one, using either gradient-based or gradient-free optimization.
• get the approximate V diff from different policies
• interate policy parameters to know the diff on each one
• change the params to optimize policy
• there are several ways to estimate the diff -- Finite Diference, Likelihood Ratio etc.
• Actor-Critic Methods
• Use Actor (policy driven) to choose actions and learn feedback from Critic (value function).
• Alphago uses this
• Summary
• Shallow sequence, no branching --> one-step TD learning
• Shallow sequence, many branching --> dynamic programming
• Deep sequences, no branching --> many-steps (MC) TD learning
• Deep sequence, many branching --> exhaustive search

## Credit card fraud detection

• (2014) Literature Survey

• algorithms
• HMM
• NN
• Decision Tree
• SVM
• Genetic Algorithm
• Meta Learning Strategy
• Biologicla Immune System

## Weather Classification

• Overall Summary as of [2018-10]

There are no agreed upon public dataset and very few DL papers dedicated to the topic.

The common dataset used is (2014) sunny/cloudy dataset with 10k images. Other recent papers (2018) have contructed their own dataset which are not opened to public yet. However, BDD100K dataset also has weather attribute labeled, so we should consider using that.

There are 3 type of models proposed thus far.

1. (2014) traditional feature engineering then use SVM/other clustering methods.
2. (2015) pure CNN feature extraction then classify
3. (2018) CNN-RNN and/or the combination of DL and traditional features.

so far the DL method did out-perform traditional ones.

New alternative would be to add new sensor data (temperature/humidity) and ensemble with CNN model. For that matter, how accurate would predictions from sensor data alone be?

• (2018) (2 Dataset) A CNN–RNN architecture for multi-label weather recognition (use sci-hub to get the link)

• keypoints
• recognize that weather classes are not exclusive to each other (for example, can be both sunny and foggy) so should classify accordingly (not using softmax or binary)
• add 2 new datasets (8k - 7 classes) and (10k - 5 classes) for multi-labeling comparison
• use CNNs as feature extractor
• use "channel-wise attentions" which is a set of weights to amplify/lower each channel' response.
• use "Convolutional" LSTM to retain spatial information (not flattening to 1-D vectors)
• flatten the output "hidden state" to predict weather class
• then we repeat the step (in LSTM + getting new attention weights) to predict next weather class. If there are 5 classes, the LSTM will run for 5 steps. (This is weird.. because the problem is not time-based. and this runs from single image input)
• keypoints
• new dataset (3K) - use 3 classes (rain, fog, snow) with equal split
• later add sunny/cloudy from past dataset to get 5k (again, equal split)
• In addition to raw image, they use superpixel (algo to cluster pixels together for further processing - google it) to ovelay on the image then feed to CNN feature extractors
• finally, use some sort of SVMs as binary classifier for each class
• overall achieved around 80-90% accuracy, with Resnet50 being the best extractor overall.
• however, no mention of baseline (w/o superpixel) comparison. No justification of doing things, even just running their model through old sunny/cloudy dataset for comparison. bad paper.
• keypoints
• tried Resnet-18 with various experiments on custom 400k rain-no-rain dataset
• just bad all around. specific optimization to specific dataset. no baseline model. not useful.
• keypoints
• use sunny/cloudy 10k dataset
• applies AlexNet architecture to this problem
• also compared the pretrained with ImageNet AlexNet + SVM vs train with weather data from scratch - conclusion is earlier base layers are quite general
• achieved 91% accuracy (82% normalized)
• keypoints
• introduces the 10k weather dataset with 2 classes - sunny and cloudy
• use traditional computer vision method to classify
• custom feature engineering extracting 5 features -- sky, shadow, reflection, contrast, haze.
• concat all features into 621-D vectors then use complex voting schemes to classify based on the existing of combinations of features. Tried SVM but didn't work well.
• achieved 76% accuracy (53% normalized)

## Face Detection

• Dataset: WiderFace
• 30K images, 400k faces.
• metric is PR curve, split by easy / medium / hard cases
• (2004) Robust Real-time Object Detection (Viola-Jones)
• Traditional system with impressive performance

Input = 384x288 grayscale image, 15 FPS on 700 Mhz Intel Pentium III

1. Features = sum of two regions and diffs with each other (for every pixel coordinate)
2. Since there are a lot of features, use Adaboost select a set of strongest weak classifiers weak classifer is basically this --> H = if singlefeature > threshold then 1 else 0
3. Attentional cascade - train a simple 2-feature classifier to simply reject no-face image. Then queue up all the sub-windows (overlap cropping?), evaluate and reject, then use stronger classifier from #2 on the remaining sub-windows.
• (2014) One millisecond face alignment with an ensemble of regression trees - Dlib uses this
• Use cascade of regressor method to detect facial landmarks (given that the image is already cropped to face area) claims 1 ms performance with unknown CPU. has error rate of 0.049 on HELEN face dataset. (2,000 training / 500 test image)
• Algo = Default positions + features + gradient boosting + cascade
• we can set up a default landmark (smiley face) in the image center or do an average of positions from a big dataset.
• then we regress -- computing an update regressors for each landmark x,y --> moving them closer to the face in image.
• the features for regressions are diff in pixel intensities, the pixel coordinate is relative to the default face shape.
• (2017) FaceBoxes: A CPU Real-time Face Detector with High Accuracy
• custom (light-weight) CNN architecture. No novel idea. (the paper has a good summary of past papers however)
• runs at 20 FPS on a single CPU core and 125 FPS using a GPU for VGA (640x480) images.
• some strategy for lightweighted architecture
• reduce spatial size of input as quickly as possible
• choose suitable kernel size - in their case it's 7x7, 5x5, 3x3
• reduce number of output channel
• use multi-scale anchor boxes output, but know where to have "dense" number of predictions.
• postprocessing is common pipeline: lots of prediction > thresholding prob > NMS.
• (2017) Deep Face Recognition: A Survey
• Good review of modern face recognition systems. collections of recent techniques. It`s not face detection though.
• (2018) SFace: An Efficient Network for Face Detection in Large Scale Variations (Megvii Inc. Face++)
• A new dataset called 4K-Face is also introduced to evaluate the performance of face detection with extreme large scale variations.
• The SFace architecture shows promising results on the new 4K-Face benchmarks.
• In addition, our method can run at 50 frames per second (fps) with an accuracy of 80% AP on the standard WIDER FACE dataset, which outperforms the state-of-art algorithms by almost one order of magnitude in speed while achieves comparative performance.
• Benchmark - Labeled Faces in the Wild (LFW) dataset - state of the art results
• most commercial systems get > 99.0% classification accuracy, including Dlib
• update as of beginning of 2018

## Own discovery of Research Papers

• (2017) Mobilenets
• keypoints
• iterations from the 2010 paper, add unpooling reconstrucitons with switches (location info for the max-pool values)
• they are able to re-create the input-size map for all layers
• (2010) Deconvolutional Networks

• keypoints
• using "global average pooling" method with each featuremap on the last layer of CNN.
• then we can use the FC weights to combined the GAP values.
• this effectively "focuses" the network activations before connecting to FC layer.
• with this we can generate heatmap to see the activation overlays
• this is basically an autodecoder, except for CNN architecture. Also use final targets as the segmentation labels.
• precise simulation of the brain chemically is very difficult. However, we can possibly create the brain model that is "computationally" accurate. we can even use this model to experiment and fix what's wrong with our brain.
• Computationally means to understand the subject functions -- enough to create a replica of them. For example, we don't yet understand everything about kidneys about we can create artificial ones that works well now.
• What we know now: very little, but we know some "constraint" rules
• brain component allometry -- relative size of the brain components vs overall size. The relationship holds across all animal size.
• telencephalic uniformity -- neurons throughout the forebrain has similar, repeatable designs with only few exceptions. This means there is a general representation of a wide variety of tasks -- audio, visual , touch etc.
• anatomical and physiological imprecision -- the neurons are slow and sloppy (probabilistic). However, the brain is overall working in a robust way.. how?
• task specification -- a classification given freeform input. One example is a call support desk. Given a free-form input, direct the customer to appropriate channels. It is highly contextual and no hard rules applied.
• parallel processing -- the neuron circuits are painfully slow compared to computer CPU, it seems that the power of the brain lies in its massively parrallel computing power.
• Current progress
• basal ganglia -- this is the area that receive sensory input, manage reward and punishments mechanism, and learn motor skills. We are close to computationally simulate this.
• neocortex -- yeah, no way we are close. Interestingly, the neocortex is connected with basal ganglia through a loop. We are close to successfully creating all the sensory prosthetics, but no way close to simulating the neocortex (higher thoughts).
• the most exciting area of research today is about how the neocortex encode the internal representations of concepts and objects.

## Other papers still unassorted

• ABSTRACT:
• Transfer and multi-task learning have traditionally focused on either a single source-target pair or very few, similar tasks.
• Ideally, the linguistic levels of morphology, syntax and semantics would benefit each other by being trained in a single model. We introduce such a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. All layers include shortcut connections to both word representations and lower-level task predictions.
• We use a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks. Our single end-to-end trainable model obtains state-of-the-art results on chunking, dependency parsing, semantic relatedness and textual entailment.
• It also performs competitively on POS tagging. Our dependency parsing layer relies only on a single feed-forward pass and does not require a beam search.
• This is kind of like Ensembling models, but they are more "joined" at the end (softmax layer and feature layer), rather than just averaging results from softmax.
• ABSTRACT:
• Memory networks are neural networks with an explicit memory component that can be both read and written to by the network.
• The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible.
• However, this is not computationally scalable for applications which require the network to read from extremely large memories.
• On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully.
• In this paper, we explore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks.
• The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention over a flat memory.
• Specifically, we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network.
• We explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task.

## Articles and Videos

• Part 2: Understanding Medicine
• Most of the tasks Medical doctors do are related to "perception", not "decision making". The later part is relatively fast and has been done better by the Machine since MYCIN.
• perceptual tasks like identifying tree-shape patterns in X-rays -- Deep learning is very good at it.
• Most susceptible specialties are Radiology and Pathology, comprising of 25% of doctors (in Australia).
• Part 3: Understanding Automation
• Automation replaces tasks, not jobs. How much time the task takes a human determines how many jobs are lost.
• Machines that “help” or “augment” humans still destroy jobs and lower wages.
• Hybrid-chess does not prove that human/machine teams are better than computers alone. STOP SAYING THIS, tech people!
• Deep learning threatens tasks that make up a terrifyingly large portion of doctors’ jobs.
• In the developed world, demand for medical services may be unable to increase as prices fall due to automation, which normally protects jobs.
• Part 4: Radiology Escape Velocity
• even if the rate of automation of 5% per year, in 30 years there will still be one-third the current radiologist workforce remaining.
• Part 5: Understanding Regulation
• In case of USA, it usually takes 3 to 10 years to go through the whole process from concept to approval to use in the medical industry.
• "measurements"-related technology can opt to go through case-I (low-risk type) route with substantially shorter time to approval.
• There are two approach in using computer technology
• measurements to aid doctors' decisions. (CADe) -- doctors disliked them, not doing well as a result.
• measurements AND diagnosis (CADx) -- never been approved by FDA before.
• Conclusion: current regulation in developed countries is SUPER conservative and so it will take a lot of time and money to get new technology adopted. Not so for developing world, we might see it much faster there.
• Part 6: Current State-of-the-Art results and impact
• Stanford (and collaborators) trained a system to identify skin lesions that need a biopsy. Skin cancer is the most common malignancy in light-skinned populations.
• This is a useful clinical task, and is a large part of current dermatological practice.
• They used 130,000 skin lesion photographs for training, and enriched their training and test sets with more positive cases than would be typical clinically.
• The images were downsampled heavily, discarding around 90% of the pixels.
• They used a “tree ontology” to organise the training data, allowing them to improve their accuracy by training to recognise 757 classes of disease. This even improved their results on higher level tasks, like “does this lesion need a biopsy?”
• They were better than individual dermatologists at identifying lesions that needed biopsy, with more true positives and less false positives.
• While there are possible regulatory issues, the team appears to have a working smartphone application already. I would expect something like this to be available to consumers in the next year or two.
• The impact on dermatology is unclear. We could actually see shortages of dermatologists as demand for biopsy services increases, at least in the short term.
• keypoints
• Identical twins (Alex & Michael) -- study and worked in the same field (Computer Vision)
• Invented what became the Kinect camera sensor
• Keys for recognizing face:
• Humans actually recognize people based on "texture" appearance, not the 3D geometry
• facial expressions changed the projected texture to 2D, but not the actual texture if projected on the plane
• Therefore, we can use the "geodesic" distance instead of euclidean distance to measure the actual distance between important face features. If the distances are approximately the same, then it's the same face.
• Thee kind of techniques have been use to recognize diferent faces, including identical twins.
• Geometric deep learning: applying CNNs on 3D surface via heat diffusion equation.
• Use Case: Recognition, social network analysis, recommender systems
• keypoints
• Shannon's Entropy formula - H(X)
• this is a way to estimate how many bits are needed to encode given information with certain distributions
• the estimated bits are from the best possible encodings ("optimized")
• H(X) = P(X)*log2(1/P(X)) where P(X) means probabilty of X
• some interesting permutation give conditional probabilities
• P(X,Y) = P(X)*P(Y|X) = P(Y)*P(X|Y)
• H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
• H(X|Y) = sum{P(X,Y)*log2(1/P(X|Y))}
• then we can derive "mutual" [I] and "variational" [V] information
• I(X,Y) = H(X,Y) - H(X) - H(Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
• V(X,Y) = H(X,Y) - I(X,Y)
• KL-divergence [D] or [K]
• Dy(x) = K(X||Y) = H(X,Y) - H(X)
• This is a way to see how the new distribution (Y) is close to the original distribution (X)
• if it is the same, then KL is zero, otherwise it has value.
• this is not a symmetric measure. K(X||Y) <> K(Y||X)