|
ABSTRACT
Advances in the silicon technology have enabled increasing support for hardware parallelism in embedded processors. Vector units, multiple processors/cores, multithreading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. To what extent the available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in the given application and the congruence between the granularity of hardware and application parallelism. This paper discusses how loop-level parallelism in embedded applications can be exploited in hardware and software. Specifically, it evaluates the efficacy of automatic loop parallelization and the performance potential of different types of parallelism, viz., true thread-level parallelism (TLP), speculative thread-level parallelism and vector parallelism, when executing loops. Additionally, it discusses the interaction between parallelization and vectorization. Applications from both the industry-standard EEMBC®,1 1.1, EEMBC 2.0 and the academic MiBench embedded benchmark suites are analyzed using the Intel®2 C compiler. The results show the performance that can be achieved today on real hardware and using a production compiler, provide upper bounds on the performance potential of the different types of thread-level parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution. 1 Other names and brands may be claimed as the property of others. 2 Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
J. R. Allen , Ken Kennedy , Carrie Porterfield , Joe Warren, Conversion of control dependence to data dependence, Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, p.177-189, January 24-26, 1983, Austin, Texas
[doi> 10.1145/567067.567085]
|
 |
4
|
|
 |
5
|
Jennifer M. Anderson , Lance M. Berc , Jeffrey Dean , Sanjay Ghemawat , Monika R. Henzinger , Shun-Tak A. Leung , Richard L. Sites , Mark T. Vandevoorde , Carl A. Waldspurger , William E. Weihl, Continuous profiling: where have all the cycles gone?, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.1-14, October 05-08, 1997, Saint Malo, France
|
| |
6
|
Arm11 Family. http://www.arm.com/products/CPUs/families/ARM11Family.html.
|
| |
7
|
ATLAS (Automatically Tuned Linear Algebra Software). http://math-atlas.sourceforge.net/.
|
| |
8
|
Joshua S. Auerbach , Arthur P. Goldberg , Germán S. Goldszmidt , Ajei S. Gopal , Mark T. Kennedy , Josyula R. Rao , James R. Russell, Concert/C: a language for distributed programming, Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, p.8-8, January 17-21, 1994, San Francisco, California
|
| |
9
|
Auto-Vectorization in GCC. http://gcc.gnu.org/projects/tree-ssa/vectorization.html.
|
 |
10
|
|
| |
11
|
|
| |
12
|
Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE. IEEE, Los Alamitos, CA, 211--243.
|
| |
13
|
Bernstein, A. J. 1966. Analysis of programs for parallel processing. IEEE Trans. Electron. Comput. 15, 5, 757--763.
|
| |
14
|
|
 |
15
|
|
| |
16
|
BLAS. BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/.
|
| |
17
|
Bodin, F., Beckman, P., Gannon, D. B., Narayana, S., and Yang, S. 1991. Distributed pC++: Basic ideas for an object parallel language. In Proceedings of Supercomputing. IEEE, Los Alamitos, CA, 273--282.
|
| |
18
|
Cell. The Cell Project at IBM Research. http://www.research.ibm.com/cell/.
|
| |
19
|
|
 |
20
|
|
| |
21
|
CM-2. 1989.Thinking Machines Corporation, Connection Machine Model CM-2 Technical Summary, Version 5.1.
|
| |
22
|
Derby, J. H. and Moreno, J. H. 2003. A high-performance embedded DSP core with novel SIMD features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing II. IEEE, Los Alamitos, CA, 301--304.
|
| |
23
|
EEMBC. EEMBC Benchmarks. http://www.eembc.org.
|
| |
24
|
|
| |
25
|
|
| |
26
|
Fisher, J. A., Faraboschi, P., and Young, C. 2004. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers, New York.
|
| |
27
|
Flynn, M. 1966. Very high-speed computing systems. Proc. IEEE 54, 12, 1901--1909.
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
GNU C Library. GNU C Library. http://www.gnu.org/software/libc/.
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
Halfhill, T. R. 2006. Cell processor isn't just for games? Microprocessor Report.
|
| |
38
|
|
| |
39
|
P. J. Hatcher , M. J. Quinn , A. J. Lapadula , B. K. Seevers , R. J. Anderson , R. R. Jones, Data-Parallel Programming on MIMD Computers, IEEE Transactions on Parallel and Distributed Systems, v.2 n.3, p.377-383, July 1991
[doi> 10.1109/71.86112]
|
| |
40
|
Herity, D. 2006. Applying Distributed System Concepts to Embedded Multiprocessor Designs. http://www.embedded.com/showArticle.jhtml?articleID=177104979.
|
 |
41
|
|
| |
42
|
HOOD. HOOD: A user-level threads library for multiprogrammed multiprocessors. http://www.cs.utexas.edu/users/hood/.
|
| |
43
|
Wen-Mei W. Hwu , Scott A. Mahlke , William Y. Chen , Pohua P. Chang , Nancy J. Warter , Roger A. Bringmann , Roland G. Ouellette , Richard E. Hank , Tokuzo Kiyohara , Grant E. Haab , John G. Holm , Daniel M. Lavery, The superblock: an effective technique for VLIW and superscalar compilation, The Journal of Supercomputing, v.7 n.1-2, p.229-248, May 1993
[doi> 10.1007/BF01205185]
|
| |
44
|
Intel. 2007. Quad-Core, kentsfield, targeted for Q1. http://www.intel.com/technology/architecture/coremicro/index.htm.
|
| |
45
|
Intel. IXP1200. Network processor. http://www.intel.com/design/network/products/npfamily/ixp1200.htm.
|
| |
46
|
Intel. IXP2850. Network processor. http://www.intel.com/design/network/products/npfamily/ixp2850.htm.
|
| |
47
|
Intel. 8.0. Math kernel library. http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm.
|
| |
48
|
Intel. Multi core processor architecture development. http://www.intel.com/cd/ids/developer/asmo-na/eng/201969.htm?page=6.
|
| |
49
|
Intel. Teraflops research chip. http://www.intel.com/research/platform/terascale/teraflops.htm.
|
| |
50
|
Intel. VTune#8482; Performance Analyzer 8.0 for Window. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.
|
| |
51
|
Intel. Yonah: Multi-core processor architecture development. http://www.intel.com/cd/ids/developer/asmo-na/eng/201969.htm?page=6.
|
| |
52
|
Jerraya, A. and Wolf, W. 2004. Multiprocessor Systems-on-Chips. Morgan Kaufmann Publishers, New York.
|
 |
53
|
Ralph E. Johnson, Components, frameworks, patterns, Proceedings of the 1997 symposium on Software reusability, p.10-17, May 17-20, 1997, Boston, Massachusetts, United States
|
| |
54
|
J. A. Kahle , M. N. Day , H. P. Hofstee , C. R. Johns , T. R. Maeurer , D. Shippy, Introduction to the cell multiprocessor, IBM Journal of Research and Development, v.49 n.4/5, p.589-604, July 2005
|
| |
55
|
Kejariwal, A. and Nicolau, A. Reading list of performance analysis, speculative execution. http://www.ics.uci.edu/_akejariw/SpeculativeExecutionReadingList.pdf.
|
 |
56
|
Arun Kejariwal , Xinmin Tian , Wei Li , Milind Girkar , Sergey Kozhukhov , Hideki Saito , Utpal Banerjee , Alexandru Nicolau , Alexander V. Veidenbaum , Constantine D. Polychronopoulos, On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
[doi> 10.1145/1183401.1183407]
|
 |
57
|
Arun Kejariwal , Alexander V. Veidenbaum , Alexandru Nicolau , Milind Girkarmark , Xinmin Tian , Hideki Saito, Challenges in exploitation of loop parallelism in embedded applications, Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, October 22-25, 2006, Seoul, Korea
[doi> 10.1145/1176254.1176298]
|
| |
58
|
|
| |
59
|
|
| |
60
|
Kuck, D. 2005. Platform 2015 software: Enabling innovation in parallelism for the next decade. ftp://download.intel.com/technology/computing/archinnov/platform2015/download/Parallelism. pdf.
|
| |
61
|
Kwong, Y.-S. 1982. On Reductions and Livelocks in Asynchronous Parallel Computation. UMI Research Press, New York, NY.
|
| |
62
|
|
| |
63
|
Lee, E. D. 2006. The Problem with Threads. Tech. rep. TR UCB/EECS-2006-1, EECS Department, University of California at Berkeley.
|
| |
64
|
Liles Jr., A. and Wilner, B. 1979. Branch prediction mechanism. IBM Tech. Disclos. Bull. 22, 7, 3013--3016.
|
 |
65
|
|
| |
66
|
Lundstrom, S. F. and Barnes, G. H. 1980. A controllable MIMD architectures. In Proceedings of the International Conference on Parallel Processing. ACM, New York, 19--27.
|
 |
67
|
Damien Lyonnard , Sungjoo Yoo , Amer Baghdadi , Ahmed A. Jerraya, Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip, Proceedings of the 38th annual Design Automation Conference, p.518-523, June 2001, Las Vegas, Nevada, United States
[doi> 10.1145/378239.379015]
|
| |
68
|
S. MacDonald , J. Anvik , S. Bromling , J. Schaeffer , D. Szafron , K. Tan, From patterns to frameworks to parallel programs, Parallel Computing, v.28 n.12, p.1663-1683, December 2002
[doi> 10.1016/S0167-8191(02)00190-4]
|
 |
69
|
Scott A. Mahlke , David C. Lin , William Y. Chen , Richard E. Hank , Roger A. Bringmann, Effective compiler support for predicated execution using the hyperblock, Proceedings of the 25th annual international symposium on Microarchitecture, p.45-54, December 01-04, 1992, Portland, Oregon, United States
|
| |
70
|
|
| |
71
|
|
| |
72
|
MiBench. MiBench Version 1.0. http://www.eecs.umich.edu/mibench/.
|
| |
73
|
|
| |
74
|
|
 |
75
|
|
| |
76
|
Nicolau, A. 1985. Percolation scheduling. In Proceedings of the International Conference on Parallel Processing. IEEE, Los Alamitos, CA.
|
 |
77
|
|
| |
78
|
Omap2420. http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateId=6123&navigationId=11990&contentId=4671.
|
| |
79
|
OpenMP. OpenMP specification, Version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
|
| |
80
|
Polychronopoulos, C. 1987. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the International Conference on Parallel Processing. ACM, New York, 235--242.
|
| |
81
|
Prakash, S. and Parker, A. C. 1992. SOS: Synthesis of application-specific heterogeneous multiprocessor systems. J. Parall. Distrib. Comput. 16, 338--351.
|
| |
82
|
|
| |
83
|
|
 |
84
|
|
 |
85
|
|
| |
86
|
|
| |
87
|
|
| |
88
|
SPEC. SPEC: Standard Performance Evaluation Corporation. http://www.spec.org/.
|
| |
89
|
|
 |
90
|
|
| |
91
|
Thakur, R., Bordawekar, R., Choudhary, A., Ponnusamy, R., and Singh, T. 1994. PASSION runtime library for parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference. IEEE, Los Alamitos, CA, 119--128.
|
| |
92
|
Tian, X., Bik, A., Girkar, M., Grey, P., Saito, H., and Su, E. 2002. Intel OpenMP C++/Fortran compiler for hyper-threading technology: Implementation and performance. Intel Techn. J. 3, 1.
|
| |
93
|
Tomasulo, R. M. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Resear. Develop. 11, 25--33.
|
 |
94
|
|
| |
95
|
Turley, J. 2003. Embedded processors of tomorrow. http://www.embedded.com/columns/technicalinsights/15201862?requestid=804418.
|
| |
96
|
UPC. http://upc.gwu.edu/.
|
 |
97
|
Sriram Vajapeyam , P. J. Joseph , Tulika Mitra, Dynamic vectorization: a mechanism for exploiting far-flung ILP in ordinary programs, Proceedings of the 26th annual international symposium on Computer architecture, p.16-27, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
98
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper Threads via Virtual Multithreading, IEEE Micro, v.24 n.6, p.74-82, November 2004
[doi> 10.1109/MM.2004.75]
|
| |
99
|
Winslett, M., Ad Y. Chen, K. E. S., Cho, Y., Kuo, S., and Subramanium, M. 1996. The PANDA library for parallel I/O of large multidimensional arrays. In Proceedings of the Scalable Parallel Libraries Conference III. IEEE, Los Alamitos, CA.
|
 |
100
|
|
 |
101
|
|
| |
102
|
Xeon. Dual-Core IntelR XeonR Processor 7000 Sequence Platform Brief. ftp://download.intel.com/products/processor/xeon/dc7kplatbrief.pdf.
|
| |
103
|
Yang, W.-S. and Ding, C. 2003. ZioLib: A Parallel I/O Library. Number LBNL-53521.
|
 |
104
|
|
INDEX TERMS
Primary Classification:
D.
Software
D.1
PROGRAMMING TECHNIQUES
D.1.3
Concurrent Programming
Additional Classification:
D.
Software
D.2
SOFTWARE ENGINEERING
D.2.1
Requirements/Specifications
Subjects:
Languages;
Methodologies (e.g., object-oriented, structured)
D.2.8
Metrics
Subjects:
Performance measures
General Terms:
Measurement,
Performance
Keywords:
Multi-cores,
libraries,
multithreading,
parallel loops,
programming models,
system-on-chip (Soc),
thread-level speculation,
vectorization
|