Create Presentation
Download Presentation

Download Presentation
## Multiway Data Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Multiway Data Analysis**Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam**The “future” science faculty of the Universiteit van**Amsterdam**The Biosystems Data Analysis group officially started in**2004 as a follow up of the process analysis group at the Universiteit van Amsterdam. Its aims are: Developing and validation of new data analysis methods for summarizing and visualizing complex structured biological data (Metabolomics / Proteomics).**Three-way Data**• Three-way Models • Three-way Applications**Three-way data**• Three-way data is a set of two-way matrices of the same objects and variables. • IR, Raman, NMR spectra of the same samples will not give a three-way data set, but a multi-block data set. IR Raman NMR**Examples of three-way data**UV Emission Time Chromato graphy Batch Process Fluorescence Samples Samples Batches Chromatogram Process variables Excitation RGB Judges Sensory Analysis Image Analysis Image Products Attributes Image**From noway to multi-way**1 Scalar J J K K 1 J 1 1 4-way 1-way 1 L I I I J J K K J 5-way 2-way 1 L I I I J J J K K K 3-way M I I I**Slabs and tubes**Vertical tube Frontal slab Vertical slab Lateral tube Horizontal tube Horizontal slab**Three slabs of fluorescence data5 Samples x 60 Excitation x**200 Emission Emission Fluorescence Samples Excitation**time**time batch process variable process variable Three-way batch process data • ‘Engineering’ process data i.e. temperature, pressure, flow rate • Spectroscopic process data i.e. NIR, Raman, UV-Vis One batch A series of batches X (JK) X (IJK)**Spectroscopic three-way batch data**2 batch runs of a reaction followed with UV-Vis spectroscopy during 45 minutes**Batch Fermentation in two steps: Threeway multiblock**API Inoculum Batches Time Variables Fermentation Batches Time Variables**Composition**Conditions Composition What we measure Conditions What we want ... ... ... ... ... ... ... ... Four-way data in combinatorial catalysis**Experiments**Time Metabolites Experiments Time Gene expression Multiway data from the Omics age**M.C. Escher:**Some history Small problem with orthogonality**More history**• Psychometrics (1944-1980) • Catell 1944: Parallel Proportional profiles (Common factors fitted simultaneously to many data matrices). • Tucker 1964: Tucker models • Carroll & Chang 1970: Canonical Decomposition (CANDECOMP) • Harshman 1970: Parallel Factor Analysis (PARAFAC) • Chemistry • Ho 1978: Rank Annihilation (close to Parafac) on fluorescence data. • End 80’s beginning 90’s: Threeway methods to resolve LC-UV data.**Multiway PCA:Unfolding of three-way data**J K J JK K I I I J IK MacGregor Wold**Two ways of unfoldingDifferent assumptions in MSPC**• Wold • Nonlinear behavior in the data • Batch trajectories are monitored • Online monitoring • MacGregor • Nonlinearities removed • Whole batch is considered a measurement • Off-line monitoring**Extension of SVD to Parafac**VT X U v1T v2T = = + S u1 u2 b1 b2 B c1 c2 X A CT + = G = a1 a2**Parafac / Candecomp**• Parafac is not sequential • Need to re-estimate whole model when more components are calculated [no deflation]. • Parafac solution is unique • No rotational freedom • Changing parameters will reduce the fit. • NB! A PCA model is not unique • X = T*PT + E = T*R*R-1*PT + E = C*ST + E • Unique ≠ true**Extension of Two Mode component Analysis (TMCA)**P R CT G X A = P R B Q Q P CT X A G Tucker III P R = R**Tucker models**G • Tucker I, • Tucker II, • Tucker III Equals MPCA X A = CT G X A = B CT G X A =**Tucker models**• Core array can be fully filled • PxQxR triads (1,1,1 / 1,1,2 / 1,2,1 etc) • Not unique rotational freedom • Components can be rotated towards orthogonality. • Not sequential • Restricted Tucker models can be developed when using prior chemical knowledge**Number of parameters**• X(IxJxK) example I=50, J=9, K=100, • P = Q = R = 3 • Parafac: Rx(I + J + K) 477 • Tucker3: PxI + QxJ + RxK + PxQxR 504 • MPCA: Rx(I + JK) 2850 • Fit MPCA > Parafac (Overfit?)**Soft models vs hard models**• Two-way bilinear model: • Beer’s law • PCA • Trilinear model: • Parafac • Fluorescence No orthogonal constraints Orthogonal constraints No orthogonal constraints**Multiway Regression I**y Y X • Two step approach: Decomposition of X to A and model Regression of y on A Can be Parafac, Tucker, MPCA etc No information of Y is used in the decomposition Similar to PCR method**Multiway Regression II**y Y X • Direct approach Now X is decomposed with y in mind. This leads to a not optimal decomposition of X but an improved fit of y.**Indicator variable**Time When data are not exactly 3-way batch time process variable Time / Variable variable Indicator variable Time**Alignment problems**• Peakshifts in LCMS/GCMS • Warping methods to align the peaks • Dynamic Time Warping • Correlation optimized warping**Fluorescence data**• 5 samples with varying concentration of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. • Excitation wavelength: 240 – 300 nm • Emission wavelength: 250 – 450 nm**Unfold PCA model of Fluorescence data**99.97% explained with 3 PC’s Loadings refolded into Excitation / Emission form Overfit of data: Loading 2 has negative parts. This is not according fluorescence theory.**Parafac model of Fluorescence data**99.93% explained variation: Good Fit Loadings are very well interpretable. Intensity in A mode can be related to concentration B and C mode A mode**Fluorescence data**Florescence data perfectly fits the trilinear model that is applied by Parafac Due to uniqueness property of Parafac, the loadings found will perfectly resemble the Emission spectra and Excitation spectra of the three compounds in de mixtures. This is a nice example of Mathematical chromatography**Batch reaction monitoring**• Pseudo-first-order reaction: A + BC D + E • UV-Vis spectrum (300-500nm) measured every 10 seconds. • Obeys Lambert-Beer law • 35 NOC batches. X (35 201 271) • In addition, some disturbed batches were measured • pH disturbance during the reaction • Temperature change • Impurity**Aims and goals of research I**• Data modelling: • Improve understanding of process by interpretation of model parameters • Analysis of historical batches: • Are the current process measurements able to distinguish between ‘good’ and ‘bad’ batches? • On-line monitoring: • Rapid fault detection • Easier fault diagnosis: what is the cause of the fault? • Prediction of batch duration**Aims and goals of research II**Which batch is different ?**Unfold PCA model**• Unfold keeping the batch direction (IxJK) X PT T E = +**Unfold PCA model**Many parameters estimated, likely to overfit the data**Unrestricted Parafac model**• The simplest three-way model is the PARAFAC model: C + = I B E X batch time A wavelengths**Unrestricted Parafac model**• Loadings are highly correlated - solution may be unstable. • Model is difficult to interpret. • 99.4% fit • Can external knowledge of the process be used to improve the model?**Grey Modelling of batch data**‘Black-box’ or ‘soft’ models are empirical models which aim to fit the data as well as possible e.g. PCA, neural networks. ‘White’ or ‘hard’ models use known external knowledge of the process e.g. physicochemical model, mass-energy balances. + Easy to interpret Not always available Good fit Difficult to interpret Good fit ‘Grey’ or ‘hybrid’ models combine the two. University of Amsterdam**Modelling batch data**+ + = white part black part E X Systematic variation due to known causes Systematic variation due to unknown causes Unsystematic variation Total variation**Pure Spectra**Reaction kinetics External information • Incorporating external information can • increase model interpretability • increase model stability**Restricted ‘white’ model**• External information is introduced in the form of parameter restrictions: KNOWN SPECTRA REACTION KINETICS C + = G B E X batch time A wavelengths LAMBERT-BEER LAW**Restricted Tucker model**• Model is stable. • 97.6% fit - lower than for black model • Some systematic variation in the data is left unexplained by this model.**Grey model**White components Black components describe known effects can be interpreted • 99.8% fit (corresponds well with estimated level of spectral noise of 0.13%)