Secrets of Performance
如何提高程序的性能
Bin Lin
Development Manager
Microsoft Research,China
提高性能的方法
Faster hardware - $$$ = 速度
Use the right language,compiler optimizations
Design scalable application
– Architectural design,cache -> RAM -> disk
– Choose right data structures and algorithms
Tune code
– Avoid slow OS APIs
– Tune,measure,tune,measure,tune,measure…
提高性能的方法
Faster hardware - $$$ = 速度
Use the right language,compiler optimizations
Design scalable application
– Architectural design,cache -> RAM -> disk
– Choose right data structures and algorithms
Tune code
– Avoid slow OS APIs
– Tune,measure,tune,measure,tune,measure…
Design,A Case Study
Design a scalable SMTP server
– Scalable is the key
– 2-CPU,4-CPU,8-CPU machines
– Handle as many request as possible,with
relatively fast response time.
Design,A Case Study
A simple SMTP server
// Read SMTP commands/data from sockets
If (ReadFile( … ))
{ // various housekeeping removed… }
// Parse SMTP recipients and other headers
If (!ParseSMTPHeaders(…)) {
// handle errors…
}
// Parse bodies
If (!ParseSMTPBodies(…)) {
// handle errors…
}
Design,A Case Study (cont.)
// Local delivery or routing
If (LocalDelivery( … )) {
Deliver( … );
} else {
Route( … );
}
// Send SMTP response through Socket
If (WriteFile(…)) {
// various housekeeping skips…
}
Traditional Thread Architecture
SMTP Request
Receiver
(Socket)
Worker
Thread
Worker
Thread
(Other workers)
1 thread to receive and dispatch SMTP
request
64 worker threads doing:
– Parse SMTP headers
– Parse SMTP bodies
– Local delivery
– Routing
– All in the same thread sequentially…
The Evolution of Hardware
Relative Performance (Latency)
0
200
400
600
800
1992 1994 1996 1998 2000
Time
Pe
rfo
rm
an
ce CPU
RAM
Disk
Bridge the Gap - Caches
CPU L1 cache
– 8K instruction cache,plus
– 8K data cache
– Closely coupled
– 0.333 clock/instruction – practical 1 CPI
CPU L2 cache
– 512K static RAM
– Coupled with full clock-speed,64-bit,cache bus
– Latency,4-1-1-1 – 7 clocks/instruction
I/O caches (RAM based file caches)
The Price of Failure
Let’s look at the costs:
– Assume 1 second to zero a register
– L1 cache hit - 1 second (1x)
– L2 cache hit - 4 seconds (plus 3 seconds extra
work - 7x)
– RAM hit - 25-150 seconds (24x-150x)
– Disk or net hit - 3 weeks (2,000,000x)
SMTP Server Architecture
1 thread to receive and dispatch SMTP request
1 worker thread per CPU
Event Loops (4 stages)
– Parse SMTP headers/bodies
– Local delivery
– Routing
– Socket send and file I/O
One queue for each event loops
Drain queue to empty before context switch
SMTP
Requests
Dispatcher
Local
Delivery
Routing
I/O
Manager
SMTP
Header/
Body
Parser
Performance Improvement
O v e r a l l P e r f o r m a n c e
0
5000
10000
15000
20000
25000
30000
2 4 6 8
N u m b e r o f C P U s
O
p
e
r
a
t
i
o
n
s
p
e
r
S
e
c
o
n
d
S M T P S e r v e r
T h r e a d s
T r a d i t i o n a l
T h r e a d s
Putting it all together
Cache effects are a significant factor in
overall performance and scalability
Batching offers significant benefits
Spatial and temporal locality matters
– Cache lines (or pages) contain related data
– Burst or smooth usage pattern (not random)
– Significant repeated use of the cached data
Threads per user architectures are expensive
and don’t scale well
Now let’s look at memory
The footprint of Windows is large
16MB Win98 system and 32MB Windows
2000 systems cannot host most applications
MS applications need more memory than
Windows can provide
To increase performance,we need to reduce
our foodprints
What is the Footprint?
The footprint is the unconstrained working
set of an application
– All image code and data recently used by the
application
– All dynamic data recently used by the app
– System resources used to support the app
The working set consists of resident pages
that are using real physical memory
Virtual Memory Usage
Virtual Memory
Your address space starts
out empty
Code n Static data from
your,exe is part of
the address space
Code
Static Data
EXE
Disk Image
Code and Static d ta from
DLLs that our app uses
are part of the address space
Code
Static Data
DLLs
Code
Static Data
Code
Static Data
Dynamic Data from heap &
Virt alAlloc allocations
Dynamic Allocation
Dynamic DataDynamic Data
Dynamic Data
Heap
OS S pport s ructures like
page table pages are part of
the address space
System DataSystem Data
System Data
OS Support
Working Set
Virtual Memory
Paging reduces the
working set
Hard Disk
The pages remaining are
the current working set
Physical Memory
Working
Set
A page fault will bring in
a page adding to the
rki set
Page Fault
Application Thrashing
Page Fault Rate vs,Memory
0 8 16 32 64 128 256
Pa
ge
Fa
ults
/Se
c.
Memory in Megabytes
Machine is way too small for app
Machine is a little small but OK
Machine handles the app just fine
How to Determine Working Set?
NT Task Manager
– Processes Tab shows per process data
– Mem Usage shows active Working Set
– Mem Delta shows Working Set changes
– Page Faults shows total number of faults
NT pmon utility Mem Usage column
NT pstat utility Ws column
Working Set
Physical Memory
Virtual Memory
Hard Disk
The OS manages the working
set size of all the applications
Working Set
Size
1.5 Meg
0K
500K
1 Meg
use two settings in it’s
paging algorithm wsMax and
wsMin
wsMin
wsMax
wsCur’s relati n to these
ett s along with memory
contraint is used in paging
wsCur
When memory i ample and
a page fault occurs wsCur is
allowed to grow unconstrained
Page Fault
wsCur
Working Set (Cont.)
Physical Memory
Hard Disk
Working Set
Size
1.5 Meg
0K
500K
1 Meg
wsMin
wsMax
When memory is ample and
a page fault occurs wsCur is
allowed to grow above wsMax
Page Fault
wsCur
At some point memory gets
tight...
If memory is tight and wsCur
is larger than wsMax the OS
takes an aggressive position.
Page Fault
t causes the app to page out
one of it’s pages to free up
ro m.
Then it pages in the faulted
page,The overall working set
size remains the same.
Working Set Growth
When a page fault occurs,the new page is
added to the process’s working set,If
memory is tight,a page is removed from
your working set; otherwise,your working
set grows
In ample memory conditions,page faults
cause WsCur to grow even above WsMax
In tight memory conditions,page faults
cause you to page against yourself
Working Set Trim
In tight memory conditions your working
set will not grow
WsMgr thread steals memory from working
sets to make more available memory
The more your working set is over WsMax,
the more pages that get stolen
Small machines (16mb NT,8Mb Win95)
almost always operate in this mode!
How to be a Good Citizen! – 做操作系统的好市民
VirtualUnlock() will kick a range of pages out of
your working set
SetProcessWorkingSetSize(-1,-1)
– empties entire working set
– when a window is minimized,this happens to the
owning process
link /WS:AGGRESSIVE marks a process for
aggressive trimming
– Marking helps free memory quickly
– Marked processes can have their working set trimmed
to WsMin (~80k) if they are idle.
– NT marks Explorer,Spooler,services…
Demo
Watch IE 5.0 Working Set go from 23Mb to
8Mb by browse,minimize,restore,refresh
Watch PowerPoint go from 6-7Mb to 2-
3Mb by edit,minimize,restore,change
view
Spot the defect
#define PAGE_COUNT 16
#define PAGE_SIZE 4096
DWORD rgdwArray[PAGE_COUNT][PAGE_SIZE];
// code to fill the array skips…

// access the array
for (int y = 0; y < PAGE_SIZE; y++) {
for (int x = 0; x < PAGE_COUNT; x++) {
DWORD dw = rgdwArray[x][y];
}
}
Spot the defect
#define PAGE_COUNT 16
#define PAGE_SIZE 4096
DWORD rgdwArray[PAGE_COUNT][PAGE_SIZE];
// code to fill the array skips…

// access the array
for (int y = 0; y < PAGE_SIZE; y++) {
for (int x = 0; x < PAGE_COUNT; x++) {
DWORD dw = rgdwArray[x][y];
}
}
4KB pages – may cause 16*4k page faults.
25-150 sec vs,3 weeks! (10 TickCounts)
A better version
#define PAGE_COUNT 16
#define PAGE_SIZE 4096
DWORD rgdwArray[PAGE_COUNT][PAGE_SIZE];
// code to fill the array skips…

// access the array
for (int y = 0; y < PAGE_COUNT,y++) {
for (int x = 0; x < PAGE_SIZE,x++) {
DWORD dw = rgdwArray[x][y];
}
}
Possibly very little page faults
…about 16 (0 TickCount)
Spot the defect
#define MAX_LENGTH 1024*256*64
int rgLookup[MAX_LENGTH];
// code to fill the array skips…
// 1,access the array
for (i = 0; i < MAX_LENGTH; i++)
temp = rgLookup[i];
// 2,access the array
for (i = 0; i < MAX_LENGTH; i++)
temp = i+i;
Spot the defect
#define MAX_LENGTH 1024*256*64
int rgLookup[MAX_LENGTH];
// code to fill the array skips…
// 1,access the array
for (i = 0; i < MAX_LENGTH; i++)
temp = rgLookup[i];
// 2,access the array
for (i = 0; i < MAX_LENGTH; i++)
temp = i+i;
TickCount,1 = 581,2 = 300
For more information
研究院网页,
http://www.microsoft.com/china/research
研究院电子邮箱,msrchina@microsoft.com
招聘电子邮箱,msrcnjob@microsoft.com
林斌电子邮箱,binlin@microsoft.com